
Virtualization and Cloud executives share their predictions for 2013. Read them in this VMblog.com series exclusive.
Contributed article by Paul Kruschwitz, director of product management, at FalconStor Software
Deduplication Proves All Data Is Not Equal
We have all heard the term "deduplication" over the years as it
pertains to our company's data and storage systems. Simply put, deduplication
is the elimination of redundant data in backup and storage systems. It is
important because backup solutions create large amounts of duplicated data by
sending several copies of the same data to the secondary storage tier. History
has shown us that the powerful cost savings and efficiency of deduplication first
experienced by large enterprises has been adopted across businesses of all sizes.
Armed with this knowledge, we are able to properly predict the future of this
market in 2013 and best appreciate why all data and deduplication methods are
not equal.
Yesterday: Deduplication creates
backup efficiency
A popular technology for more than five years, deduplication was
created to solve two main data protection and backup problems related to the
generation of multiple copies of data and its inefficient storage. Before
deduplication, companies were backing up all the data to tape systems, which
included multiple copies of the same file. Companies could not afford to
replicate this information to an off-site data center due to the multiple file
copies requiring large amounts of expensive network bandwidth. The whole backup
and data protection environment was an overwhelming, costly burden. Unsure of what
else to do, companies bought more backup media and threw lots of money at the
problem. With the introduction of
deduplication technology, companies were able to cost-efficiently backup
information to disk and replicate information-backup systems to store data for
longer periods of time.
Deduplication solutions utilize various methods for identifying
duplicate data. Compression algorithms use a limited buffer size to remove the
duplicate data and represent the data in the buffer with less information.
Deduplication solutions retain information (normally a hash value) about all of the data stored and not just the
last full buffer. This results in a greater overall reduction in the amount of stored
data. The unique data that is stored is run through a separate compression
algorithm prior to storing it to disk. The deduplication performance is
reported as a product of deduplication and compression.
Standard compression algorithm effectiveness is improving because the
size of the buffers used has been growing. Similarly, the effectiveness of
deduplication normally improves as the size of the data stored in the
repository increases. This is why solutions that can scale with a single
repository have a greater efficiency and value. Unfortunately some vendor solutions
are adversely affected as their size grows, leading to performance concerns.
Today: Deduplication is a
commodity
Data is growing at the rate of 50 to 60 percent annually, which increases
the need for effective data protection and storage solutions. Today,
deduplication is an integral aspect of a modern storage strategy. In fact, deduplication
has become a commodity. Deduplication is viewed as a simple feature within the
backup environment; but in reality it is a complicated process that involves a
number of resources, processes and attention from the IT staff on how to best
manage and protect data. Not all data deduplicates well, so you must manage
data requiring client side compression and encryption so as to get better use of
your deduplicated storage resources.
Now, there has been a lot of discussion around deduplication for
primary storage versus secondary storage. Deduplication is best done on the
secondary storage tier, where the impact of the deduplication process will not
impact the production performance.
Additionally, companies must recognize that all data is not created
equal. IT managers must be able to identify data types and assign policies on
how these various categories of data will be treated. The three basic policy
options that a deduplication solution should provide are:
- Inline deduplication is ideal for small storage
configurations or environments where immediate replication is desired. It has
the primary benefit of minimizing storage requirements and in some cases allows
data to be deduped faster for quicker replication.
- Post-process deduplication is ideal when the
goal is to back up the data as quickly as possible, since it occurs after the
backup process is complete and can be scheduled for any point in time. It also facilitates more efficient transfer
to physical tape or frequent restore activities by postponing the deduplication
activity.
- Concurrent deduplication is similar to
post-processing but starts as soon as the first set of records has been written
and run concurrently with backup. It allows deduplication solutions to make
full use of available processing power while minimizing the impact to the
incoming data stream. Concurrent deduplication is normally best suited for the
larger multi-node clustered solutions, allowing full use of all available computing
resources.
Companies also can choose not to utilize deduplication for data that
does not dedupe well or is to be exported to physical tape. This data includes
image data, pre-compressed data or encrypted data. Turning off deduplication
for these files allows companies to better utilize their deduplicated storage
resources and reduce overall costs.
Tomorrow: Intelligent global
integration
Deduplication has evolved significantly over the past five years. It
has moved from an enterprise-only solution to the small and midsize markets. It
is now a standard feature within many backup solutions, but companies must be
wary of the all-in-one storage or backup software solutions. Not all deduplication
solutions are equal. Deduplication is not something that can be slapped on an
appliance or into software, as these types of solutions will be limited in
terms of performance, scalability and reliability.
This has been proven by the history and development of deduplication
technologies. Originally some vendors architected deduplication solutions for
enterprise infrastructures for large amounts of data with richer, heterogeneous
environments. These first deduplication solutions offered high availability,
data protection failover capabilities, scalability and large data repositories.
Other vendors chose to architect smaller scale appliance-based solutions with
limited performance, scalability and features. Marketed as replacements for
physical tape, they downplayed the value of being able to physically integrate
with tape. These smaller appliances also lacked any high availability features
required to assure that the solution was always available and adequately
protected. Even though the claim "tape
is dead!" has been stated by some vendors for years, it is far from true. Tape
has an important role in the archival of data. Many companies must adhere to
data retention requirements resulting from regulatory requirements or ongoing
litigation. The most cost-effective way to do this is to export data to lower
cost tape. Deduplication solutions must provide a method for companies to
continue their use of tape for archival purposes.
Larger customers often have multiple data centers and/or remote
offices. Mergers and acquisitions often lead to the consolidation of data
protection and disaster recovery resources. Deduplication with replication can
greatly simplify and reduce the cost of this effort. For instance, one medical
company has more than 120 different hospitals that back up locally and sends
data into one disaster recovery data center. When a common block of data is
sent into the central location by the first site, that common block is not
transferred from the other 119 sites that have the same data. The result is a
large savings in bandwidth required to move data from all sites. A single copy
of the common block exists at the DR site rather than 120 copies, greatly
reducing the storage costs. Physical tape resources can be integrated at the DR
site to archive longer retention data to the lower cost tape media, allowing
the customer to realize a greater return on investment.
Looking ahead, companies need intelligent deduplication solutions that
allow them to ensure that data is properly stored and protected. This
intelligence may come from dynamically analyzing the data and automatically
assigning the appropriate dedupe policies as well as
automatically integrating physical tape as an additional transparent tier of
storage. Companies looking to upgrade or install backup and data protection
solutions in 2013 should seek out global, intelligent deduplication solutions
that provide flexibility, scalability, performance and high availability of the
data.
##
About the Author
Paul Kruschwitz is the director of product management at FalconStor
Software with over 20 years of experience in technology with a specific focus
on data protection and deduplication technologies.