Home / Definitions / Data Deduplication

Data Deduplication

Vangie Beal
Last Updated May 24, 2021 7:39 am

Data deduplication is a technique used to reduce the amount of storage space an organization needs to save its data. In most organizations, the storage systems contain duplicate copies of many pieces of data. For example, the same file may be saved in several different places by different users, or two or more files that aren’t identical may still include much of the same data.

Deduplication eliminates these extra copies by saving just one copy of the data and replacing the other copies with pointers that lead back to the original copy. Companies frequently use deduplication in backup and disaster recovery applications, but it can be used to free up space in primary storage as well.

Deduplication at the File or Bock Level

In its simplest form, deduplication takes place on the file level; that is, it eliminates duplicate copies of the same file. This kind of deduplication is sometimes called file-level deduplication or single instance storage (SIS). Deduplication can also take place on the block level, eliminating duplicated blocks of data that occur in non-identical files.

Block-level deduplication frees up more space than SIS, and a particular type known as variable block or variable length deduplication has become very popular. Often the phrase data deduplication is used as a synonym for block-level or variable length deduplication.

Benefits of Data Deduplication

The primary benefit of data deduplication is that it reduces the amount of disk or tape that organizations need to buy, which in turn reduces costs. NetApp reports that in some cases, deduplication can reduce storage requirements up to 95 percent, but the type of data you’re trying to deduplicate and the amount of file sharing your organization does will influence your own deduplication ratio. While deduplication can be applied to data stored on tape, the relatively high costs of disk storage make deduplication a very popular option for disk-based systems. Eliminating extra copies of data saves money not only on direct disk hardware costs, but also on related costs, like electricity, cooling, maintenance, floor space, etc.

Deduplication can also reduce the amount of network bandwidth required for backup processes, and in some cases, it can speed up the backup and recovery process.

Deduplication vs. Compression

Deduplication is sometimes confused with compression, another technique for reducing storage requirements. While deduplication eliminates redundant data, compression uses algorithms to save data more concisely. Some compression is lossless, meaning that no data is lost in the process, but “lossy” compression, which is frequently used with audio and video files, actually deletes some of the less-important data included in a file in order to save space. By contrast, deduplication only eliminates extra copies of data; none of the original data is lost. Also, compression doesn’t get rid of duplicated data — the storage system could still contain multiple copies of compressed files.

Deduplication often has a larger impact on backup file size than compression. In a typical enterprise backup situation, compression may reduce backup size by a ratio of 2:1 or 3:1, while deduplication can reduce backup size by up to 25:1, depending on how much duplicate data is in the systems. Often enterprises utilize deduplication and compression together in order to maximize their savings.

Implementing Data Deduplication

The process for implementing data deduplication technology varies widely depending on the type of product and the vendor. For example, if deduplication technology is included in a backup appliance or storage solution, the implementation process will be much different than for standalone deduplication software.

In general, deduplication technology can be deployed in one of two basic ways: at the source or at the target. In source deduplication, data copies are eliminated in primary storage before the data is sent to the backup system. The advantage of source deduplication is that is reduces the bandwidth requirements and time necessary for backing up data. On the downside, source deduplication consumes more processor resources, and it can be difficult to integrate with existing systems and applications.

By contrast, target deduplication takes place within the backup system and is often much easier to deploy. Target deduplication comes in two types: in-line or post-process. In-line deduplication takes place before the backup copy is written to disk or tape. The benefit of in-line deduplication is that it requires less storage space than post-process deduplication, but it can slow down the backup process. Post-process deduplication takes place after the backup has been written, so it requires that organizations have a great deal of storage space available for the original backup. However, post-process deduplication is usually faster than in-line deduplication.

Deduplication Technology

Data deduplication is a highly proprietary technology. Deduplication methods vary widely from vendor to vendor, and many of those methods are patented. For example, Microsoft has a patent on single instance storage. In addition, Quantum owns a patent on variable length deduplication. Many other vendors also own patents related to deduplication technology.