Troubleshooting the Database

A customer sent me a short email about formatting in their new system. Simple enough, I created a sample in the system and sent them a specification for formatting, and told them “just delete the sample when you’re done”.

Alas. Ten minutes later, an email arrives “I deleted the sample… everything is gone and I can’t log in?!”

Shock.

Shock fades into denial – it must be the platform. Right? A database failure, or a virus. Glancing at the database, it appears to have rolled back 3 months. I sent off a support request to Heroku and they pulled a friendly face of “that’s very odd, our databases don’t do that kind of thing on their own”.

So, I took a closer look, and sure enough there are suggestions it isn’t just a rolled-back database – a few missing things were created before a few present things, so it isn’t just a plain rollback, more like a systematic deletion. And the automated database backup hasn’t run since last night, so the most recent work by the customer is lost.

Despite it apparently being a bug in the delete method, no matter how hard I tried, I wasn’t able to reproduce it in the CMS locally or on production. I created almost exact replicas of the structure prior to the error, and couldn’t get a repeat of the bug – maybe there was something malicious at play?

I played around for hours, loading up content, deleting it, renaming it, trying to be malicious, and nothing came up. I pored over the logs from the client working on the CMS,  and as I got further back in time noticed that they were testing content in the “client” side application as they added it.

I connected the client side application and worked through some exercises, deleted a few from the CMS. Nothing. Edited a few more things, completed a few more exercises, hit delete.

Long pause… “Signed out of CMS”. Ok… How odd. I couldn’t sign back in, either – user not found. I’m a system administrator, how can I be not found! This sounds exactly like the reported problem. There’s a lot to be said for examining the logs carefully – not just the immediate problem, but the surrounding context. Get a feel for what the user was doing when they encountered the problem!

Another database dump, and sure enough, my user account is deleted, along with quite a few (but not all) of the slides. Mixed feelings at this point – on one hand, I’ve reproduced the bug and am making some headway. On the other hand, the bug and subsequent data loss must be my fault – depressing.

At this point I wrote down the facts as I knew them

  • The user needs to use the application as well as the CMS for the bug to occur
  • A few User(s) are deleted
  • Many Exercises are deleted

Not much links users to exercises in the system. Just one little table called “UserProgress” which has a many-1 relationship with users and exercises, which entries are created in as the user progresses through exercises. A quick check of the database, and indeed all UserProgress rows are deleted as well. A lot of fingers now point at this relation.

Opening up the rails models, the UserProgress model has dependent => destroy set on it’s relations to User and Slide. On the other end, that’s also set – so if one end is destroyed, so is the other, regardless of which end is destroyed. This is not good. If a slide is destroyed here, it cascades – all the Exercise’s Progress relations are deleted, then all the Progress’s Users, then all the User’s Progresses, then all the Progress’s Exercises. We have come full circle – and it doesn’t have to end here… This explains why almost all users, and absolutely all slides vanished from the production database.

I created a simple test case and deleted a slide the user has marked progress on. The Bug is reproduced! This is a vital step in any troubleshooting. Another two unrelated deletion-of-progress bugs raise their heads – unrelated in the sense they don’t help fix the initial bug, but they take an hour to fix (they get in the way of running test cases for the show-stopper bug).

A quick removal of the offending dependent => destroy association from the UserProgress model (http://guides.rubyonrails.org/association_basics.html – deleting a customer deletes orders, which is sensible. Not the other way round!) and I am ready to re-run my test case.

Fixed! Hurray – off to the autobuild it goes.

Lessons learned:

  • Bugs usually clump together around bad code
  • Bad code is usually the less tested code
  • “System” level bugs that don’t occur until everything comes together require more complex tests
  • Be humble, expect the mistake to be in your code, not others
  • Don’t just throw answers together until they work. (i.e. putting the association property on both sides “just to be sure”)

Using the Redmine Kanban

We’re working on a small project and wanted to see how well the Redmine kanban works – it fell to me to set it up.

I started by followed the instructions from the Kanban plugin website, however, it requires a few extra steps to get it working at all:

  • Install the gem block_helpers (gem install block_helpers)
  • Read the README.rdoc that comes with the project – to save you some time here it is in short – you need to go to the administration page in Redmine, then configure the kanban plugin. There, you need to configure the pane settings – any panes you want to use must link to a status within your own redmine – there’s a list of recommendations in the readme but it’s just common sense.

Now, how does it fare in use? It unfortunately has a showstopper bug – moving something twice on the kanban logs you out. In addition to that, some users were unable to see the kanban. This means that the kanban plugin as it stands, is unusable for anything but an ‘overview’ of the projects (and an non-interactive overview at that).

Seeing as development seems to have ended a while ago, this plugin isn’t worth installing, and we have removed it from our Redmine setup. If it resumes active development, I will do a more full review of the plugin.

Another CUDA post – how to move to cuda compute 1.3 safely

CUDA compute1.3 (and higher) add features that you might want to use, but they also add double support. This can be troublesome in performance applications, as double performance is much slower than float. Here are some tips for making your code use only floats. Please comment with any other tips you have!

The solution should just be to explicitly label any floats. For an example of the problem, 0.1 will be interpreted by the compiler as a double. This could propagate through your code and in an inner loop could seriously impact your performance. Whenever you use a floating point number and don’t want a double, add an ‘f’ to the end, thus: ‘0.1f’.

If you’ve already written a massive amount of code without this, here are some tips for finding doubles that have crept in.

  • Regex for “0.[0-9]+”, and match the whole word (add word start and end tags, or tick match whole word in your editor). This will identify all ‘0.0’s, and ignores all ‘0.0f’s (this tip works for non CUDA code too)
  • Add -keep to the NVCC compile options. You can then open the PTX in a text editor, and search for f64. If there are any occurrences of f64, your code is using doubles at some point. If you successfully did the above step there should be none.
  • Watch the compiler output for warnings about float and double conversion.