By Matthew Dewey, Technical Director at Quantum
Traditional storage architectures today are getting
overwhelmed by the growth of unstructured data like scientific and medical
research, satellite imagery and high-res video. Object storage is a promising
solution for organizations producing a lot of unstructured data. Object stores
have long found a home in the cloud and inside data centers as long-term repositories
for high-value data, but with demand for storage capacity growing daily, can
organizations reap the benefits of object stores within budget?
Providing Essential Data Retention and Protection
Object stores abstract away the location of an object,
enabling higher levels of redundancy. This protects against device failure - as
well as failures of entire nodes or even data centers. Abstracting away object
location also enables object stores to scale to sizes and topologies difficult
to achieve with file systems.
An object store user may not know exactly where data
is physically stored. What looks like a single object store may be distributed
across multiple geographic locations for greater reliability against natural
disasters. This level of durability can increase the capacity requirements of
the underlying hardware, but with smart erasure coding algorithms it can be
achieved using less capacity than by mirroring the data.
One consideration for many entities exploring object
storage is data retention. The retention period for many kinds of data is
specified by legal and other compliance constraints. One might expect that data
not subject to compliance requirements is likely to be deleted sooner, but some
data has value indefinitely. Geological and genetic data are examples of data
sets with no expiration date that can represent a major investment.
Tape Can Play a Critical Role Managing Costs
The demand for storage capacity is growing at a compound
annual growth rate (CAGR) of more than 20 percent. Long-term repositories must become
cheaper and deeper without losing durability. Object stores must lower the
overall total cost of ownership (TCO) of storage - not just the cost of the
media, but associated expenses of owning equipment. Most object stores are hard
disk-based for performance and reliability, but the cost of power and physical
footprint is significant.
Some object stores now use tape to lower TCO - ideal
for large amounts of data stored for a long time. For sequential IO, tape
outperforms disk for both reading and writing. Tape offers low media costs and
when not being accessed uses minimal power and cooling. However, the latency to
access data on tape is a consideration. Best practice implementations will
present tape as a separate tier to allow applications to help manage data
access.
To realize tape's advantages in an object store it
helps to have a thorough knowledge of how to properly manage and treat it. The
object store must survive failure modes unique to tape. It must also manage
access patterns to reduce tape latencies and wear.
Because tape excels at
sequential access, large individual objects perform well. However, a well-
implemented object store will group small objects into larger sequential
streams to and from tape. With the right expertise, organizations can implement
a tape-based object store for long-term data retention while controlling
storage costs.
Data Cataloging and Management to Elevate Object
Stores
Object stores can be petabytes
or even exabytes in size, yet objects are likely in the range of kilobytes to megabytes
in size: there are, potentially quadrillions of objects in an
exabyte-scale object store. How can we know what is in the data store? How can
we identify and select complete subsets of information in a pool of data this
vast? A catalog of the contents of the object store is a mechanism for
selecting subsets of the data. Data must be classified as it is added to the
object store. The initial classification may include the standard attributes
and domain-specific classification. The uses of data and the information
extracted from it will change and improve over time, so the classification
information must be malleable.
What decides which objects go
on which storage media and how is that decision made? How do you select the
sets of data required for a task? Proper data management ensures the data is
where the user needs it, when it is needed. The data needs to be where it can
be kept safe for the lowest cost when it is not used. Once an item of interest is
identified, the system ensures the data are placed for optimal processing.
In order to affordably
preserve unstructured data as a future asset, a properly managed and catalogued
object store can be a vital tool.
##
About the Author
Matthew Dewey works within
the Technology Group to guide Quantum with forward looking technology choices.