Code hosting site GitLab.com suffered a major service outage on 31 January 2017 after data from their primary database server was accidentally removed. The site went offline after their standard backup procedure failed to run.
"The standard backup procedure uses
pg_dump to perform a logical backup of the database. This procedure failed silently because it was using PostgreSQL 9.2, while GitLab.com runs on PostgreSQL 9.6," said GitLab in a post-mortem report published on 10 February.
Even though the site went back live on 1 February after being offline for almost 18 hours, some production data were lost and could not be recovered.
"Specifically, we lost modifications to database data such as projects, comments, user accounts, issues and snippets that took place between 17:20 and 00:00 UTC on 31 January. Our best estimate is that it affected roughly 5,000 projects, 5,000 comments and 700 new user accounts. Code repositories or wikis hosted on GitLab.com were unavailable during the outage, but were not affected by the data loss," GitLab reported.
The incident highlighted the importance of creating a comprehensive data management strategy, said Matthew Johnston, Vice President of Commvault in ASEAN & Korea, in an interview with CIO Asia.
Matthew Johnston, Vice President of Commvault in ASEAN & Korea.
To avoid suffering from a backup failure like GitLab, Johnston advised companies to adopt a regular, automated testing of recoveries "to ensure that all processes are working as expected prior to any unexpected outages. More regular backups are crucial as key applications need more than one recovery point a day."
He added that "leveraging technologies-such as application consistent snapshots and replication-to ensure that data is protected at regular intervals gives companies a better chance of recovery without data loss and in a short time, when failures occur."
Furthermore, Serguei Beloussov, CEO of Acronis, reminded organisations that while replication is good, they mustn't forget to backup their data. "Never put all your eggs in the same basket - at some point you will drop it and all the contents will be lost. So, while you can do replication for a faster RTO (recovery time objective) , you also need an actual backup in a safe location."
In a nutshell, backup and recovery technologies are crucial in ensuring an organisation's continuous operation, especially if it relies heavily on data. However, as with many other technologies, backup and recovery tools have undergone significant changes over the years to adapt to the evolving needs of businesses.
Sign up for CIO Asia eNewsletters.