So, where did this release go wrong. There were three main problems that were encountered:
- A switch failure in one of the data centres that temporarily rendered some servers inaccessible.
- Mistakes in the manual actions in the release plan.
- A data inconsistency between two different data centres.
Unexpected Events During a ReleaseMurphy's Law: Anything that can go wrong, will go wrong.
Unfortunately, no matter how well you plan, there will always be things that you can't anticipate. In this particular release we had a hardware switch fail right in the middle of the release. An erroneous configuration in the data centre caused this failure to render a number of our servers unreachable. We were therefore mid-release with no way to go forward and an arduous rollback process to restore us back to a working state.
Fortunately due to diligent disaster recovery work undertaken previously we were able to fail the entire site over to the alternative data centre. A pragmatic decision was eventually taken that we couldn't wait on the chance that we MIGHT be able to complete the release so we rolled back. This was well planned and went smoothly enough that within 30 minutes of a new switch coming online we were back up and running on the original software versions. Nice work.
However, even flushed with the success of our miraculous escape from the clutches of Murphy, we have to consider what might have been. Our rollback was a largely manual process and although tested in the test environment, our production is just different enough as to add a level of uncertainty to the rollback process. There's also the greater chance of errors in manual steps when under pressure of a live situation.
So how could we be more certain of our rollback plan?
Untested Manual Release PlansOur release had a very detailed plan that had been well reviewed before hand. A variation of the plan had been executed in test but, as I already said, the production environment is just different enough as to require a different (and slightly more complex) plan. Additionally, this release had some additional complexities requiring a number of extra manual steps.
What we discovered during the release was that some of the steps specific to production were in fact not quite correct. Couple this with the chance of making errors when executing manual steps and we have a number of places with a high potential for failure.
While the plan did run pretty smoothly, how could we reduce the chances of errors and mistakes?
Problems That Can't be Found Before ProductionThe final problem encountered on the release was one of database consistency. Database in two different data centres were thought to be consistent. It turns out they were not! With hindsight, we should have checked this and have been continually monitoring these over time. We live and learn.
What should really have happened is that we detected the possibility of inconsistent databases before we got to production. Unfortunately all the lower environments only have a single data centre model for the database in question so there was no way that anything could ever get out of sync in those cases. Hence, no way to detect the problem before reaching production.
How can we find these sorts of problems earlier?
Automate Releases, Equivalent EnvironmentsFortunately, all of the above problems can be fixed with two solutions. I wanted to say SIMPLE solutions, but unfortunately they are not particularly simple and require a fair amount of work to get right. These solutions are:
- Automate the release and rollback pipelines.
- Ensure Integration and Test environments are
identicalequivalent to Production.
In a modern computing environment there is really no need to include manual steps in a release. Even when releases have lots of complexity it should be possible to script all the required steps. An automated script can be tested, fixed and re-tested many times in order to ensure it is correct before running it against a production environment. This is much better than a manual process that is difficult to test and open to human error.
Don't get me wrong, creating automated releases is hard, but the benefit of smoother, faster and more frequent releases to production gives significant enough business benefit to make the effort worthwhile.
A manufacturer building a product wouldn't create a prototype and then go straight into mass production with a design that varied significantly from that prototype. Companies building software shouldn't do so either. The purpose of Continuous Integration is to find problems early when they are quicker and cheaper to fix. In order to work well, the lower environments need to match production in terms of applications, server structure, configuration and so on.
Additionally, if we are to build automated deployment plans then we need production like environments to develop and test them against. An unproven automated plan is of no more value than a fully manual one.