For my first post on Test Replic I thought I'd start off with my current favorit topic of testing web service in production and in this case the hazards ther of. I made a brief blog on the hazards of Testing in Production (TiP) on my blog on MSDN. I don't want to just repost that blog here but do have some additional thoughts to share since that post.
The blog was motivated the outage in the Microsoft Bing service last week. The outage was reported to be caused by a configuration change in support of a test cluster that went awry. The full post can be found here (http://blogs.msdn.com/kenj/
). and I copied much of it below so you don't have to go to my MSDN blog. What I wanted to do here was share the recomendations that I came up with and others shared both within Microsoft and via other friends. My purpose in shareing on Test Republic is to solicit more input on the Hazards of Testing in Production.
If you are going to test in production and in particular abandan your test lab for the ability to leverage production as a place to test your code, which I belive is the right directio to go, you must manage the new risk it introduces. Here is my current list of vital risk mitigation factors:
1. A fully automated deployment system
a. With fully automated (and rapid) roll-back
2. Rock solid controls on change management approval
3. Every change must be a metered change so that it cannot roll across all of production too quickly
a. This way even a catastrophic mistake will not affect all of production
b. If a change is needed of course aim for off peak hours
4. Service Monitoring needs to alert to what version of production or test config is throwing the erros so that operations can recognize if the new failures are related to a change in the service code or configuration or another problem such as hardare failure, network, or load.
What do thou think is needed to mitigate the risks of testing in production?
Parts of original blog so you don't have to follow the link.
Bing Outage exposes Testing in Production (TiP) Hazards
04 December 09 06:28 PM
I have been a big proponent of shipping services into production more frequently and conducting more testing on the code once it is in production. Those of us that support this methodology tend to call it TiP (Testing in Production). Find links to previous blog posts on this subject below.
After the recent Bing outage (evening of 12/3/2009), I find myself thinking about the Hazards of TiP and thought I might make a post about some lesson's I have drawn from this production outage and what has been written about it so far. ZD Net posted a bit of a sarcastic blog with the title "Microsoft is making progress on search: You noticed Bing's glitch
." According to the official blog post by the Bing team (here
) the outage was, “The cause of the outage was a configuration change during some internal testing that had unfortunate and unintended consequences.”
Despite this black mark, I still believe TiP is the right direction to go for services testing but clearly there are some hazards and lessons we can extrapolate.
These two posts imply that the outage was wide spread, noticed by a lot of individuals, and caused by an errant configuration change in support of a test. My assessment is that while there was clearly an attempt to run a test configuration in production, the test did not cause the outage. The challenge came where the test configuration change somehow went to all of production.
The core concept of TiP is to minimize risk through TiP-ing. In order to accept the risk of less stable code into production in order to run tests, the less stable code must be easily sandboxed. Whatever happened here was likely a configuration management mistake not a testing error.
The reality is that the Bing system is very automated. The team has shared some information about their infrastructure so I won’t go into details here less I share something not disclosed. In outage like this from a test configuration change impacting production is clearly a case of fast moving automation.
In order to enable TiP and to take more risk into production, the change management system of a service must be rock solid and fully automated. Clearly though from what has been shared they have a state of the art system. In fact it is likely this state of the art system that allowed the errant change to propagate so quickly and require a full roll back.
Therefore the gap must be in the safety mechanisms to prevent such a mistake in combination with how fast the mistake rolled out to all environments. Another factor in successful TiP is metering of change in production. This change just moved too fast and while the bing system is highly automated it still takes a long time to undo a change across so many servers.