Virtualization Technology News and Information
FalconStor 2013 Predictions: Deduplication Proves All Data Is Not Equal

VMblog Predictions

Virtualization and Cloud executives share their predictions for 2013.  Read them in this series exclusive.

Contributed article by Paul Kruschwitz, director of product management, at FalconStor Software

Deduplication Proves All Data Is Not Equal

We have all heard the term "deduplication" over the years as it pertains to our company's data and storage systems. Simply put, deduplication is the elimination of redundant data in backup and storage systems. It is important because backup solutions create large amounts of duplicated data by sending several copies of the same data to the secondary storage tier. History has shown us that the powerful cost savings and efficiency of deduplication first experienced by large enterprises has been adopted across businesses of all sizes. Armed with this knowledge, we are able to properly predict the future of this market in 2013 and best appreciate why all data and deduplication methods are not equal.

Yesterday: Deduplication creates backup efficiency

A popular technology for more than five years, deduplication was created to solve two main data protection and backup problems related to the generation of multiple copies of data and its inefficient storage. Before deduplication, companies were backing up all the data to tape systems, which included multiple copies of the same file. Companies could not afford to replicate this information to an off-site data center due to the multiple file copies requiring large amounts of expensive network bandwidth. The whole backup and data protection environment was an overwhelming, costly burden. Unsure of what else to do, companies bought more backup media and threw lots of money at the problem.  With the introduction of deduplication technology, companies were able to cost-efficiently backup information to disk and replicate information-backup systems to store data for longer periods of time.

Deduplication solutions utilize various methods for identifying duplicate data. Compression algorithms use a limited buffer size to remove the duplicate data and represent the data in the buffer with less information. Deduplication solutions retain information (normally a hash value) about all of the data stored and not just the last full buffer. This results in a greater overall reduction in the amount of stored data. The unique data that is stored is run through a separate compression algorithm prior to storing it to disk. The deduplication performance is reported as a product of deduplication and compression.

Standard compression algorithm effectiveness is improving because the size of the buffers used has been growing. Similarly, the effectiveness of deduplication normally improves as the size of the data stored in the repository increases. This is why solutions that can scale with a single repository have a greater efficiency and value. Unfortunately some vendor solutions are adversely affected as their size grows, leading to performance concerns.

Today: Deduplication is a commodity

Data is growing at the rate of 50 to 60 percent annually, which increases the need for effective data protection and storage solutions. Today, deduplication is an integral aspect of a modern storage strategy. In fact, deduplication has become a commodity. Deduplication is viewed as a simple feature within the backup environment; but in reality it is a complicated process that involves a number of resources, processes and attention from the IT staff on how to best manage and protect data. Not all data deduplicates well, so you must manage data requiring client side compression and encryption so as to get better use of your deduplicated storage resources.

Now, there has been a lot of discussion around deduplication for primary storage versus secondary storage. Deduplication is best done on the secondary storage tier, where the impact of the deduplication process will not impact the production performance.  Additionally, companies must recognize that all data is not created equal. IT managers must be able to identify data types and assign policies on how these various categories of data will be treated. The three basic policy options that a deduplication solution should provide are:

  1. Inline deduplication is ideal for small storage configurations or environments where immediate replication is desired. It has the primary benefit of minimizing storage requirements and in some cases allows data to be deduped faster for quicker replication.

  2. Post-process deduplication is ideal when the goal is to back up the data as quickly as possible, since it occurs after the backup process is complete and can be scheduled for any point in time.  It also facilitates more efficient transfer to physical tape or frequent restore activities by postponing the deduplication activity.

  3. Concurrent deduplication is similar to post-processing but starts as soon as the first set of records has been written and run concurrently with backup. It allows deduplication solutions to make full use of available processing power while minimizing the impact to the incoming data stream. Concurrent deduplication is normally best suited for the larger multi-node clustered solutions, allowing full use of all available computing resources.

Companies also can choose not to utilize deduplication for data that does not dedupe well or is to be exported to physical tape. This data includes image data, pre-compressed data or encrypted data. Turning off deduplication for these files allows companies to better utilize their deduplicated storage resources and reduce overall costs.

Tomorrow: Intelligent global integration

Deduplication has evolved significantly over the past five years. It has moved from an enterprise-only solution to the small and midsize markets. It is now a standard feature within many backup solutions, but companies must be wary of the all-in-one storage or backup software solutions. Not all deduplication solutions are equal. Deduplication is not something that can be slapped on an appliance or into software, as these types of solutions will be limited in terms of performance, scalability and reliability.

This has been proven by the history and development of deduplication technologies. Originally some vendors architected deduplication solutions for enterprise infrastructures for large amounts of data with richer, heterogeneous environments. These first deduplication solutions offered high availability, data protection failover capabilities, scalability and large data repositories. Other vendors chose to architect smaller scale appliance-based solutions with limited performance, scalability and features. Marketed as replacements for physical tape, they downplayed the value of being able to physically integrate with tape. These smaller appliances also lacked any high availability features required to assure that the solution was always available and adequately protected.  Even though the claim "tape is dead!" has been stated by some vendors for years, it is far from true. Tape has an important role in the archival of data. Many companies must adhere to data retention requirements resulting from regulatory requirements or ongoing litigation. The most cost-effective way to do this is to export data to lower cost tape. Deduplication solutions must provide a method for companies to continue their use of tape for archival purposes. 

Larger customers often have multiple data centers and/or remote offices. Mergers and acquisitions often lead to the consolidation of data protection and disaster recovery resources. Deduplication with replication can greatly simplify and reduce the cost of this effort. For instance, one medical company has more than 120 different hospitals that back up locally and sends data into one disaster recovery data center. When a common block of data is sent into the central location by the first site, that common block is not transferred from the other 119 sites that have the same data. The result is a large savings in bandwidth required to move data from all sites. A single copy of the common block exists at the DR site rather than 120 copies, greatly reducing the storage costs. Physical tape resources can be integrated at the DR site to archive longer retention data to the lower cost tape media, allowing the customer to realize a greater return on investment.

Looking ahead, companies need intelligent deduplication solutions that allow them to ensure that data is properly stored and protected. This intelligence may come from dynamically analyzing the data and automatically assigning the appropriate dedupe policies as well as automatically integrating physical tape as an additional transparent tier of storage. Companies looking to upgrade or install backup and data protection solutions in 2013 should seek out global, intelligent deduplication solutions that provide flexibility, scalability, performance and high availability of the data.


About the Author

Paul Kruschwitz is the director of product management at FalconStor Software with over 20 years of experience in technology with a specific focus on data protection and deduplication technologies.

Published Friday, December 07, 2012 6:25 AM by David Marshall
Comments - Virtualization Technology News and Information for Everyone - (Author's Link) - January 15, 2013 7:00 AM

First, I'd like to personally thank everyone for being a valued member and reader of VMblog! Once again, with the help of each of you, VMblog has been able to remain one of the oldest and most successful virtualization and cloud news sites on the Web

To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<December 2012>