Myth #5: Writing backup/recovery scripts for big data is easy. Writing scripts can work if you have engineering resources, a small amount of data and just one of the big data platforms. Most organizations typically have 10’s to 100’s of terabytes of big data spread over multiple big data platforms. It is not easy to write, test and maintain scripts for these types of environments. Scripts have to be written for each platform that is being backed up (e.g. a script for Hadoop, another one for Cassandra, etc.). Scripts have to be tested at scale and retested as platform versions change (upgrade from Cassandra 2.1 to 2.2). In some cases, scripts may have to be periodically updated to support new features of the platform, new API’s, new data types, etc.
Most organizations do not realize that there are significant hidden costs and expertise needed to write good backup scripts for big data platforms. The recovery process is much harder and error prone since it involves locating the right backup copies, copying the data back to the appropriate nodes, and applying platform specific recovery procedures to recover data.
Myth #6: Big Data Backup/Recovery operations costs are very small. In addition to periodically maintaining and testing scripts, there are additional costs associated with backup and recovery. Additional costs include:
- People cost: someone responsible for running scripts, ensuring backups are successful, debugging when needed, performing ad hoc recoveries, etc.
- Storage cost: spend needed to store backups
- Downtime costs: during the time it takes the admin to locate the backup copies and restoring the data to the desired state
These costs could significantly add up especially as the big data environment gets bigger and more complex.
Myth #7: Snapshots are an effective backup mechanism for big data. Snapshots (state of data frozen at a particular point in time) are sometimes used as a backup copy to protect against user errors or application corruptions. There are a few considerations when using platform or storage snapshots for backup.
First, snapshots can be used to automate the backup process. However, when using storage snapshots, extra manual steps are needed to ensure consistency of the backup data and metadata. Secondly, snapshots are efficient when the data is not changing rapidly. With big data platforms the rate of data change is high and techniques such as compaction only add to the rate of data change. As a result, snapshots require significant storage overhead (as much as 50%) to keep a few point in time copies.
Finally, recovery from snapshots will be a very tedious and manual process. The admin or DBA will have to identify the snapshot files that correspond to the data that needs to be restored (e.g. a keyspace or table) and restore it from the snapshot to their respective nodes in the cluster. Any mistakes during the restore process can incur permanent data loss.
Sign up for CIO Asia eNewsletters.