Another airline, another bad day -- weeks after "performance issues" at Southwest Airlines led to hundreds of flight cancellations or delays, a datacenter fire resulted in a systemwide failure at Delta Airlines.
Having seen a few airline back ends, I'm not very surprised. Are such outages avoidable? Nearly always. You simply have to create a high-quality system with sufficient resilience and redundancy baked in. It helps to have good processes and programmers, too.
Reliable, redundant infrastructure
Obviously no matter how great your software and servers, if you don't have redundant network infrastructure, you may not survive. This starts with your switches and routers, includes such equipment as load balancers, and goes a bit beyond. For instance, backup power -- from a reliable provider that routes your power appropriately.
High availability (HA) protects against such problems as a server or service failing. Let's say the runtime your web service is built on does a core dump. No user should know -- they should simply be routed to another service. To accomplish this, you need to run multiple instances and load balancers, then replicate appropriately.
Disaster recovery (DR) protects against more unlikely events. Often, DR is implemented with relatively low expectations (sometimes hours or even days before recovery) and/or acceptable data loss. I'd argue that on the modern internet, recovery should be minutes at worst -- same goes for data accessibility. To accomplish DR, you need another datacenter that's geographically distant and an ironclad failover scheme. You also need to get the data there.
If you use a cloud hosting provider like AWS, opt for servers in multiple regions in case disaster strikes. Some larger organizations go as far as using more than one cloud infrastructure provider.
Rules of replication
Replication comes in multiple forms. There is "now" replication, which you can accomplish between two servers in the same room or (at worst) a few miles away. Then there's WAN replication. The two are fundamentally different.
We're used to thinking in throughput. (How big is your pipe? 10 gigabits? 100 gigabits?) But that ignores latency. As distance increases, so does latency. If you have a datacenter in New York City and one in Arizona, your data will take time to travel even if it's replicated as fast as possible.
You won't use a transactional distributed cache with ACID transactions between the two sites because your data will not get there faster than the speed of light. It's 2,331 miles from New York City to Arizona, so with light traveling at 186,000 miles per second, you're talking at least 80ms of latency.
In reality, the internet operates slower than the speed of light even across fiber, because switches and packet routing add overhead. Thus, you need a form of replication that doesn't hold up server performance, but makes a best effort to transport the data with certain expectations of loss in the event that aliens take out your datacenter, "Independence Day"-style.
Sign up for CIO Asia eNewsletters.