Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

BLOG: To reduce or dedupe, that is the big data question

Peter Eicher | June 6, 2013
Reducing data at the source is the smart way to do backup. But how should you do so?

Reducing data at the source is the smart way to do backup. That is the conclusion I came to in my last post, If files were bricks, you'd change your backup strategy.  But I also left off by saying "there are technologically different ways to do this, which have their own smart and dumb aspects." Let's take a look at them. 

There are two common ways of reducing data at the host (as I mentioned last time, I am only considering traditional backup from servers, not disk-array snapshots). Since terminology can be used in different ways, I'll define the terms as I use them.

Data deduplication: A process that examines new data blocks using hashing compares them to existing data blocks, and skips redundant blocks when data is transferred to the target.

Data reduction: A process that tracks block changes, usually using some kind of log or journal, and then transfers only new blocks to the backup target.

data blocks
For the most efficient data backup, only changed data blocks should be moved.

The end result of each process is similar - but not the same - yet there are major differences in how things work in the real world. Let's look at each in turn.

Deduplication is the most effective at reducing the quantity of data that gets sent if the dedupe is universal, which it generally is for products that use it. Data reduction methods are limited to a single server because they rely on a local journal, and the journal doesn't know that a particular block of data may have already been sent from some other node. Realistically, this benefit is mostly derived in the initial backup set when you can avoid sending the same operating system bits over and over. After that, there are a lot less differences.

The largest downside for deduplication is that it uses a process that is computationally heavy. Hashing data uses significant system resources and that causes application impact and slowed response times. This can be minimized by tracking which files change and only hashing those files that have been updated, but that advantage disappears with systems that have high rates of change or large database files. All it takes for a database file is a single transaction to make the entire file "new," meaning it has to be re-hashed. In fact, some vendors will even recommend you don't use their deduplication on databases because of the impact it creates.

That brings up another key thing to keep in mind about host deduplication. Vendors like to make claims about how much deduplication reduces backup time, but when you examine what they mean they are referring only to the datatransfer time. They conveniently leave out the data hashing time. If a backup takes five hours to hash the data set and then transmits the blocks in 30 minutes, vendors will say it was a "30 minute backup" when it was really a five and a half hour backup in terms of system impact. Watch out for this when you are evaluating deduplication products.


1  2  Next Page 

Sign up for CIO Asia eNewsletters.