The project had four phases, Cowling says: "Build the system. Prove it correct. Scale it up and then optimise the hell out of it."
'Build the system'
"First you start from nothing," says Cowling. "Let's work out what we need to build, what our requirements are and also accept that we're not going to know the requirements a few years down the track. We were designing something for a small company, knowing we were going to be a big company."
His team took around six months to create the initial code using Python. During testing on standard hardware the team rewrote the entire system - for greater efficiency and to reduce the memory footprint switching to Go, with some elements in Rust.
"Designing a distributed storage system is a big challenge, but it's much harder to build one that operates reliably at scale, and supports all of the monitoring and verification systems and tooling that will ensure it's running correctly. It's also incredibly important to make technical decisions that are the right solution to the right problem, not just because they're cool and novel."
'Prove it correct'
"How could we guarantee to the company and to ourselves that this was correct?" says Cowling. "You can't just launch something haphazardly."
Having built the prototype, they then put it through its paces - injecting software failures and trying to simulate hardware failures.
"[We put] people on a plane to data centres to pull out circuit breakers on racks and we got a rack and boxed it up and waited for it overheat and fail and made sure it'd come back up with the data. It was fun! Software's complicated but hardware fails in much more unexpected ways."
To be confident in their new system, the team began a 180-day countdown on a screen overlooking their San Francisco office work pod. The idea was to run the system without issue for the duration. They were on track, until day 40.
"We had a staging cluster - a copy of our test cluster - that we'd test out new code on. And we found a bug. It was pretty close. It didn't actually break the rules, but we wouldn't have felt good about ourselves launching."
They started again. And this time they made it. "We all clapped and we drank some champagne," says Cowling. "And then we were like - 'what's the next thing we're going to do?' - and we were straight back to work."
'Scale it up'
"Magic Pocket had to grow from our initial double-digit-petabyte prototypes to a multi-exabyte behemoth within the span of around six months- a fairly unprecedented transition," says Cowling. "We called it base jump. The metaphor is base jumping; you jump a cliff - we had to get our storage over this cliff with very little time to open the parachute and if not, it's not pretty!"
Sign up for CIO Asia eNewsletters.