So, where did this release go wrong. There were three main problems that were encountered:
- A switch failure in one of the data centres that temporarily rendered some servers inaccessible.
- Mistakes in the manual actions in the release plan.
- A data inconsistency between two different data centres.
Unexpected Events During a Release
Murphy's Law: Anything that can go wrong, will go wrong.Unfortunately, no matter how well you plan, there will always be things that you can't anticipate. In this particular release we had a hardware switch fail right in the middle of the release. An erroneous configuration in the data centre caused this failure to render a number of our servers unreachable. We were therefore mid-release with no way to go forward and an arduous rollback process to restore us back to a working state.
Fortunately due to diligent disaster recovery work undertaken previously we were able to fail the entire site over to the alternative data centre. A pragmatic decision was eventually taken that we couldn't wait on the chance that we MIGHT be able to complete the release so we rolled back. This was well planned and went smoothly enough that within 30 minutes of a new switch coming online we were back up and running on the original software versions. Nice work.
However, even flushed with the success of our miraculous escape from the clutches of Murphy, we have to consider what might have been. Our rollback was a largely manual process and although tested in the test environment, our production is just different enough as to add a level of uncertainty to the rollback process. There's also the greater chance of errors in manual steps when under pressure of a live situation.
So how could we be more certain of our rollback plan?
Untested Manual Release Plans
Our release had a very detailed plan that had been well reviewed before hand. A variation of the plan had been executed in test but, as I already said, the production environment is just different enough as to require a different (and slightly more complex) plan. Additionally, this release had some additional complexities requiring a number of extra manual steps.What we discovered during the release was that some of the steps specific to production were in fact not quite correct. Couple this with the chance of making errors when executing manual steps and we have a number of places with a high potential for failure.
While the plan did run pretty smoothly, how could we reduce the chances of errors and mistakes?
Problems That Can't be Found Before Production
The final problem encountered on the release was one of database consistency. Database in two different data centres were thought to be consistent. It turns out they were not! With hindsight, we should have checked this and have been continually monitoring these over time. We live and learn.What should really have happened is that we detected the possibility of inconsistent databases before we got to production. Unfortunately all the lower environments only have a single data centre model for the database in question so there was no way that anything could ever get out of sync in those cases. Hence, no way to detect the problem before reaching production.
How can we find these sorts of problems earlier?
Automate Releases, Equivalent Environments
Fortunately, all of the above problems can be fixed with two solutions. I wanted to say SIMPLE solutions, but unfortunately they are not particularly simple and require a fair amount of work to get right. These solutions are:- Automate the release and rollback pipelines.
- Ensure Integration and Test environments are
identicalequivalent to Production.
In a modern computing environment there is really no need to include manual steps in a release. Even when releases have lots of complexity it should be possible to script all the required steps. An automated script can be tested, fixed and re-tested many times in order to ensure it is correct before running it against a production environment. This is much better than a manual process that is difficult to test and open to human error.
Don't get me wrong, creating automated releases is hard, but the benefit of smoother, faster and more frequent releases to production gives significant enough business benefit to make the effort worthwhile.
Equivalent Environments
A manufacturer building a product wouldn't create a prototype and then go straight into mass production with a design that varied significantly from that prototype. Companies building software shouldn't do so either. The purpose of Continuous Integration is to find problems early when they are quicker and cheaper to fix. In order to work well, the lower environments need to match production in terms of applications, server structure, configuration and so on.
Additionally, if we are to build automated deployment plans then we need production like environments to develop and test them against. An unproven automated plan is of no more value than a fully manual one.
Very insightful - hope all the relevant parties have it put in front of them!
ReplyDeleteHum... Does google/amazon/akamai have a replica of Prod?
ReplyDeleteScaling Test or Int to be like Prod at .com scale takes serious time and money, which you can trade off against risk and wasting time doing releases that do not work.
:-)
Just picked this up in Kathmandu,catch up next week for our new improved release plan :-(
ReplyDeleteit would be ideal if Test and Prod were identical, looking at Greg's comment do we know if the trade off is saving us time and money in the long run?
I think greg makes a very valid point. For large scale .com platforms having an exact, identical copy of live is just not feasible. That's why in this blog entry I use the word 'equivalent' and not 'identical' (I did have one place where I accidentally used identical, but I've corrected that now).
ReplyDeleteAn equivalent environment is one that has all the similarities of the production environment, even if it is not completely identical. For example, if a production deployment targets multiple data centres then a test environment should attempt to provide an equivalent model. Now, this may just be one server running multiple virtualised containers that give the impression of two data centres, but it allows the software and automated release process to be tested in a realistic environment.
Ah... now I get asked nearly constantly for "identical" so that we can perf tests etc. and others ask for logically similar. :-)
ReplyDeleteThe costs of an environment are not just the Tin, you need to include power/cooling, storage, networking (Cisco can be quite expensive!), build, decom, hardware support, Sys amin time, licensing, monitoring etc... which can easily add up to 3x or 4x the cost of the tin (probably 10x over it's life). I would love to find the www page, but Amazon/google factor that power (power/cooling) will cost more than anything combined else these days and they have heavily automated systems.
Anyhooooo. I am not sure what the answer is, and someone usually gets to make the call.
Last post 25-Nov-2010 20:07 (and this one) by Greg BTW.
ReplyDelete