What is Data Deduplication?

Entry posted by wpirobotbuilder April 28, 2014

699 views

What is Deduplication?

Put simply, it is a data reduction technique which reduces the space that data takes up. It does this by comparing newly written data to data currently on a disk. If there is an identical chunk, then it doesn't re-write this chunk, but stores a reference to it instead. This reference is usually very small compared to a chunk of data, thus improving overall storage efficiency by not writing data that already exists (i.e. not duplicating it).

For instance: If I write two identical 1GB files to a disk with deduplication enabled, I should in theory only have used 1GB worth of data space. This is a very useful feature, particularly when you know that lots of duplicate copies of files (like in e-mail servers where lots of e-mails get stored as duplicates).

It's supported by a lot of enterprise storage equipment and in implementations of ZFS. There's probably more that I don't know about.

However, there are some problems:

The implementation of dedup means that references to chunks of data must be stored in something called a dedup table. This table takes up a lot of memory. For example, ZFS recommends a minimum of 4GB of additional memory for every TB of storage to be deduplicated, but there is no upper limit to memory consumption. The more data you store that can be deduplicated, the more memory you'll need.

By definition, dedup writes files in a different format than the original. This makes data recovery with disks that had deduped data on them much more complex.

Also by definition, deduplication fragments your volume. Lets say, for instance, that the chunks necessary to reconstruct a file are spread out over many sectors in many different locations on a hard drive. To read the whole file back, it has to move the read head to read each individual sector. On mechanical hard drives, this does reduce performance to some degree, much less so on SSDs. On some enterprise equipment there is a technique called data rehydration, which defragments a file when it is read, allowing future reads to go much faster.

So, when should I use deduplication?

Most people shouldn't use it at all. The space savings are very minimal unless you know for sure that a lot of data will be reduced through this method, and the memory requirements are very high. Use compression unless you know this method is useful to you.

If you have a lot of (mostly) identical virtual machines, this can save you a ton of space. If you had a hundred Linux installations on a volume you might need a terabyte or more of storage space. Using data deduplication could reduce your requirement down to the order of tens of gigabytes. This makes it more practical to use SSD-based storage, because the performance degredation due to fragmentation will be significantly less due to the speed of SSDs. However, you will still need lots of memory, depending on how many VMs you have.

If your dedup implementation is scheduled, then you should use it if the space savings are worth the performance hit from fragmentation. If your backend storage is flash, there won't be nearly as much of a performance hit, and the space savings will help you get the most out of that flash.

If you are archiving data. It's less important to have archival storage be fast, so deduplication might be worth it here. Or you could use tape storage.

If you have an aforementioned e-mail server in a medium business, you could find a use for deduplication.

Or if you absolutely know that much of your data is going to be duplicated, you could find a use for it.

Here is an example of the space savings for virtual machines. Most VMs are Windows Server 2008 R2, but there are a couple Linux machines too. All these VMs are all active.

There is 952GB of provisioned storage (that the VMs could use if their virtual disks were full), but only about 525GB of actual data is stored on the datastore (each VM's virtual disk is a little over half full).

The actual disk space used is 25GB. 25GB, versus 525GB if there was no deduplication. The disk usage is less than 5% of the total data stored. The storage backend is an all-flash SAN, and these systems boot up in less than 10 seconds.

That's the power of deduplication.

Sign In

Storage Blog

What is Data Deduplication?

0 Comments

My Activity Streams