Lessons learnt from production problems

At last week’s dev meeting we swapped war stories about problems encountered in production, how we tracked down the root cause and lessons we learnt. Here are some highlights:

  • Environment is very important. When trying to reproduce a bug, you need to try to replicate the environment as closely as possible. Even very minor changes to libraries, frameworks, subsystems or operating system patches can make a major difference to whether the bug is seen or not.
  • “We haven’t touched anything”. The person telling you that nothing has changed may well believe this is the case. However, if a system mysteriously stops working it’s best to actually verify that this is the case. There was a bit of moaning about Windows at this point. At least with text file based configuration you can easily do a diff. Not so easy when someone has unchecked a vital property in Windows config settings.
  • Test your systems like a real user. We discussed a problem on a website where the functionality in question had been tested thoroughly by people within the organisation. Unfortunately, when real external users came to use the site, it didn’t work as expected. This was due to a private address being used that could only be seen within the corporate network. So it worked for internal users, but not for real customers. If it had been tested via an external network this would have come to light before real users hit problems.
  • The bug may be in old code. It’s possible that the bug you are seeing has been lying dormant for years and is being triggered by some innocuous change. We talked about a situation where a new release would cause a site to have severe performance problems. Much time was spent looking at all the changes going into that release to see if there was some change to database access or the like causing the problem. In the end it transpired that a small change to the cookies used was triggering a latent bug in a script used for load balancing. This script had been running fine in production for years until this seemingly minor change caused it fork processes like crazy and bring down the site.