Troubleshooting the Database

A customer sent me a short email about formatting in their new system. Simple enough, I created a sample in the system and sent them a specification for formatting, and told them “just delete the sample when you’re done”.

Alas. Ten minutes later, an email arrives “I deleted the sample… everything is gone and I can’t log in?!”

Shock.

Shock fades into denial – it must be the platform. Right? A database failure, or a virus. Glancing at the database, it appears to have rolled back 3 months. I sent off a support request to Heroku and they pulled a friendly face of “that’s very odd, our databases don’t do that kind of thing on their own”.

So, I took a closer look, and sure enough there are suggestions it isn’t just a rolled-back database – a few missing things were created before a few present things, so it isn’t just a plain rollback, more like a systematic deletion. And the automated database backup hasn’t run since last night, so the most recent work by the customer is lost.

Despite it apparently being a bug in the delete method, no matter how hard I tried, I wasn’t able to reproduce it in the CMS locally or on production. I created almost exact replicas of the structure prior to the error, and couldn’t get a repeat of the bug – maybe there was something malicious at play?

I played around for hours, loading up content, deleting it, renaming it, trying to be malicious, and nothing came up. I pored over the logs from the client working on the CMS,  and as I got further back in time noticed that they were testing content in the “client” side application as they added it.

I connected the client side application and worked through some exercises, deleted a few from the CMS. Nothing. Edited a few more things, completed a few more exercises, hit delete.

Long pause… “Signed out of CMS”. Ok… How odd. I couldn’t sign back in, either – user not found. I’m a system administrator, how can I be not found! This sounds exactly like the reported problem. There’s a lot to be said for examining the logs carefully – not just the immediate problem, but the surrounding context. Get a feel for what the user was doing when they encountered the problem!

Another database dump, and sure enough, my user account is deleted, along with quite a few (but not all) of the slides. Mixed feelings at this point – on one hand, I’ve reproduced the bug and am making some headway. On the other hand, the bug and subsequent data loss must be my fault – depressing.

At this point I wrote down the facts as I knew them

  • The user needs to use the application as well as the CMS for the bug to occur
  • A few User(s) are deleted
  • Many Exercises are deleted

Not much links users to exercises in the system. Just one little table called “UserProgress” which has a many-1 relationship with users and exercises, which entries are created in as the user progresses through exercises. A quick check of the database, and indeed all UserProgress rows are deleted as well. A lot of fingers now point at this relation.

Opening up the rails models, the UserProgress model has dependent => destroy set on it’s relations to User and Slide. On the other end, that’s also set – so if one end is destroyed, so is the other, regardless of which end is destroyed. This is not good. If a slide is destroyed here, it cascades – all the Exercise’s Progress relations are deleted, then all the Progress’s Users, then all the User’s Progresses, then all the Progress’s Exercises. We have come full circle – and it doesn’t have to end here… This explains why almost all users, and absolutely all slides vanished from the production database.

I created a simple test case and deleted a slide the user has marked progress on. The Bug is reproduced! This is a vital step in any troubleshooting. Another two unrelated deletion-of-progress bugs raise their heads – unrelated in the sense they don’t help fix the initial bug, but they take an hour to fix (they get in the way of running test cases for the show-stopper bug).

A quick removal of the offending dependent => destroy association from the UserProgress model (http://guides.rubyonrails.org/association_basics.html – deleting a customer deletes orders, which is sensible. Not the other way round!) and I am ready to re-run my test case.

Fixed! Hurray – off to the autobuild it goes.

Lessons learned:

  • Bugs usually clump together around bad code
  • Bad code is usually the less tested code
  • “System” level bugs that don’t occur until everything comes together require more complex tests
  • Be humble, expect the mistake to be in your code, not others
  • Don’t just throw answers together until they work. (i.e. putting the association property on both sides “just to be sure”)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s