Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

How the UK Met Office embraces 'chaos' to test its new cloud infrastructure

Scott Carey | July 6, 2017
The Met Office has run its first Chaos Day to test the capability of its CloudOps team and spot weaknesses in its new public cloud infrastructure.

Make backups that you can quickly restore from. You might break something you didn't intend to. Make sure you have a rollback and restore plan for every 'breakage' you make, so that you can fix any unintended consequences quickly!
Start simple -- in real breakages or accidental changes, simple stuff happens as well. As you see how the team responds, you can increase the difficulty, break multiple things at once, etc.

Don't be tempted to be too clever too early. Remember, the goal is find out areas for improvement, not to defeat your CloudOps team!

Timebox the breakages - typically beyond about 30-45 mins per breakage will help keep people engaged without losing focus.
Audit tools such as AWS Cloudtrail can be your undoing with a clever team looking for changes. You can avoid this somewhat by using different users, or have something such as a Lambda function or cron on an instance to trigger the changes. However, ultimately you'll probably have to restrict your teams from jumping straight to CloudTrail or it will get pretty boring fast!
Try and present your problems to the CloudOps team as users would - an email with screenshots, error messages etc.


Previous Page  1  2 

Sign up for CIO Asia eNewsletters.