The Met Office is embracing what it calls 'Chaos Days' -- where it purposefully introduces failures into a clone of its cloud environment -- as a way to test its newly formed CloudOps team and the resiliency of its cloud infrastructure.
The Met Office is the UK's national weather service and provides forecasts to consumers and the private sector, as well as data science around the issue of climate change. This makes it a highly data-intensive organisation, so a shift towards the cloud is an understandable move.
Richard Bevan, head of operational technology at the Met Office told Computerworld UK that the Met Office has traditionally run an on-premise delivery model, but is increasingly shifting to the public cloud. He said: "Our strategy is to have an in-house cloud capability and we have developed a CloudOps team over the past twelve months to do so."
The Met Office has been working with consultancy Cloudreach to build out its CloudOps team and to aid in the adoption of AWS infrastructure. The first app to move to a cloud delivery model is the media-facing APIs for including weather information in web and mobile apps.
One practice Cloudreach promotes amongst its clients is "Chaos Day", where the CloudOps team is encouraged to break parts of its own infrastructure. After spinning up a clone AWS cloud environment the team spends the day breaking small parts of the system and investigating what went wrong and how to fix it. This doubles up as a training exercise for the team, as well as giving them an insight into where gaps in their knowledge and documentation are.
Cloudreach took inspiration from Chaos Monkey, an open source software tool developed by Netflix engineers to test the resiliency and recoverability of their Amazon Web Services (AWS) infrastructure, when promoting this exercise. The Met Office is currently working with AWS, but Bevan said that they aren't committed to a single vendor.
James Wells, a systems developer at Cloudreach helped the Met Office run its first Chaos Day last month. He said: "We wanted to see what we don't know and iron that out." He explained that the aim is to "discover problems you may not have seen before. You see the documentation and gaps where you need to improve".
Wells has some tips for any CloudOps teams looking to run their own Chaos Day, and admitted that it's important to strike a balance between difficulty of the challenge to keep staff engaged.
His advice is:
Know your team. If your team is mainly networking specialists, it's going to to be easy for them to find networking problems. If you've got a mix, do a range of things so everyone gets a chance to share their knowledge.
Be careful! In the cloud you can create copies of environments to test these things with. So spin one up. Don't risk your production data if you don't need to.
Sign up for CIO Asia eNewsletters.