One of the mantras of cloud computing is prepare for failure. And in fact some cloud services, including Microsoft and AWS, mandate that customers architect their systems to be prepared for failure to meet the terms of the SLA. AWS, for example requires that virtual machines be deployed across multiple Availability Zones (which are different data centers in AWS's cloud) and both copies of the VM must be unavailable for the SLA to be breached. Microsoft uses the term Availability Sets instead of Availability Zones, but it's the same idea. Customers must heed the best-practice architectures to ensure their systems comply with the terms of the SLA.
One thing to keep in mind is that if you architect your system to be fault tolerant and to fail over to another VM or Availability Set, that action itself could cause problems, such as a reboot. If your system goes down because it was not set up to handle a migration to a new set of VMs then that failure is not the provider's fault and will not count as a breach of the SLA. Tools like Netflix's Simian Army Chaos Monkey and Chaos Gorilla can help AWS customers test the tolerance of their systems to outages.
In the example of the Texas company above, IT staff believed the outage was Microsoft's fault, which it was. But the service wasn't really unavailable because web access was still an option, so it didn't count against the SLA. So if your app goes down, is it really your vendor's fault? Is the service unavailable from all access points? Similarly, sometimes cloud services go down but it's not the vendors fault. For Microsoft's SLA to be breached the service must be down because of "circumstances within Microsoft's control," the company states. When an outage occurs, check to see if there is something on your end that caused the outage. Is your network connection to the cloud good, for example? Customers have to prove that their vendor was at fault and the service was truly down in order to be compensated for an SLA breach. A helpful tool for determining if your provider has had an outage are service health dashboards, where Microsoft and AWS report which services have been unavailable.
The cloud is a fast-moving industry and offerings from providers can change. When offerings change, so too can the SLAs. Typically SLAs will outline whether a provider has to notify customers of a change to the service or SLA, or if customers should be prepared for a service disruption. But, it can vary from provider to provider and service to service whether customers will be informed of changes. If a sudden change to a service would impact your workload, check to ensure that your provider will notify you of such changes.
Sign up for CIO Asia eNewsletters.