Testing in Production (TiP) Hazards - Follow up.

For my first post on Test Replic I thought I'd start off with my current favorit topic of testing web service in production and in this case the hazards ther of. I made a brief blog on the hazards of Testing in Production (TiP) on my blog on MSDN. I don't want to just repost that blog here but do have some additional thoughts to share since that post.

The blog was motivated the outage in the Microsoft Bing service last week. The outage was reported to be caused by a configuration change in support of a test cluster that went awry. The full post can be found here (http://blogs.msdn.com/kenj/). and I copied much of it below so you don't have to go to my MSDN blog. What I wanted to do here was share the recomendations that I came up with and others shared both within Microsoft and via other friends. My purpose in shareing on Test Republic is to solicit more input on the Hazards of Testing in Production.

If you are going to test in production and in particular abandan your test lab for the ability to leverage production as a place to test your code, which I belive is the right directio to go, you must manage the new risk it introduces. Here is my current list of vital risk mitigation factors:

1. A fully automated deployment system
a. With fully automated (and rapid) roll-back
2. Rock solid controls on change management approval
3. Every change must be a metered change so that it cannot roll across all of production too quickly
a. This way even a catastrophic mistake will not affect all of production
b. If a change is needed of course aim for off peak hours
4. Service Monitoring needs to alert to what version of production or test config is throwing the erros so that operations can recognize if the new failures are related to a change in the service code or configuration or another problem such as hardare failure, network, or load.

What do thou think is needed to mitigate the risks of testing in production?

Parts of original blog so you don't have to follow the link.

Bing Outage exposes Testing in Production (TiP) Hazards
04 December 09 06:28 PM
I have been a big proponent of shipping services into production more frequently and conducting more testing on the code once it is in production. Those of us that support this methodology tend to call it TiP (Testing in Production). Find links to previous blog posts on this subject below.

After the recent Bing outage (evening of 12/3/2009), I find myself thinking about the Hazards of TiP and thought I might make a post about some lesson's I have drawn from this production outage and what has been written about it so far. ZD Net posted a bit of a sarcastic blog with the title "Microsoft is making progress on search: You noticed Bing's glitch." According to the official blog post by the Bing team (here) the outage was, “The cause of the outage was a configuration change during some internal testing that had unfortunate and unintended consequences.”

Despite this black mark, I still believe TiP is the right direction to go for services testing but clearly there are some hazards and lessons we can extrapolate.

These two posts imply that the outage was wide spread, noticed by a lot of individuals, and caused by an errant configuration change in support of a test. My assessment is that while there was clearly an attempt to run a test configuration in production, the test did not cause the outage. The challenge came where the test configuration change somehow went to all of production.

The core concept of TiP is to minimize risk through TiP-ing. In order to accept the risk of less stable code into production in order to run tests, the less stable code must be easily sandboxed. Whatever happened here was likely a configuration management mistake not a testing error.

The reality is that the Bing system is very automated. The team has shared some information about their infrastructure so I won’t go into details here less I share something not disclosed. In outage like this from a test configuration change impacting production is clearly a case of fast moving automation.

In order to enable TiP and to take more risk into production, the change management system of a service must be rock solid and fully automated. Clearly though from what has been shared they have a state of the art system. In fact it is likely this state of the art system that allowed the errant change to propagate so quickly and require a full roll back.

Therefore the gap must be in the safety mechanisms to prevent such a mistake in combination with how fast the mistake rolled out to all environments. Another factor in successful TiP is metering of change in production. This change just moved too fast and while the bing system is highly automated it still takes a long time to undo a change across so many servers.

Views: 209

Tags: SOA, Service, TiP, Web


You need to be a member of Test Republic to add comments!

Join Test Republic

Comment by Ken Johnston on December 9, 2009 at 12:00am
Within Microsoft we have Pre-Prod but the typical use of Pre-Prod is one time run of deployment shortly before a full roll out to production. I find that to be a very expensive safety net that well, quite often fails us anyway. So, the classical internal Microsoft version of PPE I'm not in favor of. Having many TCs in the data center such that they are not exposed to end users and can be used for testing 24x7 is where I want to head.

Your comment of Off-peak made me re-think that point. I commented on my MSDN Blog about it but will comment again here. Bascially if a team has an off-peak it is okay to consider using that for higher risk changes. The problem with thinking about deployments and change as an off-peak activity is that you get into accepting corner cutting on deployment because you are justifying the off-peak as the way you will mitigate risk. It alos asumes you won't have a world class, high demand service and who wants to start designing with that in mind?
Comment by Ram on December 7, 2009 at 12:20pm
Good start on the Risk mitigation list.

I hope you aren't taking away the option of thorough testing in pre-production before making any changes into production environment? there are still tremendous advantages ensuring on pre-prod before moving into production.

I totally agree with stages / metered approach for testing in Production environment (3rd bullet). However, Off-peak hours might be a contextual. especially for services / apps like Bing, MSN, Twitter, Facebook, etc where the usage is about the same round the clock. but i believe, taking metered approach might mitigate this issue as well.

I would like to add about - dry run the changes and write down all possible pros and Cons, possible and potential impacts for related features and components, -- this would help track down the issue quicker even if something goes wrong.

© 2014   Created by Test Republic.

Badges  |  Report an Issue  |  Terms of Service