A couple of weeks ago, VMblog spoke with industry expert David Morris, VP of Product and Global Marketing at FalconStor, where we found out more about the company and its technology, persistent containers and the storage industry in 2020. As a follow on to that conversation, we again spoke with Morris, this time to learn more about the retention and reinstatement archival market.
VMblog: Last time you alluded to the retention and reinstatement
archival market, what does that mean?
Morris: First, thank you for having us back. Yes, we
did chat briefly about it. I ask for your patience and a bit of latitude
to frame the answer. Let's split the storage market into the operational side,
where IT professionals manage active applications that run the business and
short-term data backup and recovery processes that keep the business running.
On the archival side, we transfer data ownership from IT operations to
custodians that manage data in passive storage and long-term retention
archives. Yes, we polarize the market into the two extremes with
"active" data on one pole and "passive" data on the other
for simplicity and emphasis, with an understanding of varied uses across the
spectrum.
On the operational side, there has been a significant
investment in new compute and high-performance storage technologies. Rapidly
gaining value from data is a differentiator for businesses; however, the
operational market is also a very competitive red ocean marketplace. On
the archival side, there has been significantly less investment as it has
traditionally been viewed as a cost center. However, three major growth
areas will change the archival business dynamics and require new features and
functionality for archived data, and they are applicable, whether in a data
center or the cloud.
The first growth wave was deriving value from archival data
and was the rise of data scientists, which is straightforward. The second
growth wave was the expansion the compliance, regulatory, legal, and various
privacy mandates and laws, which are continuously expanding their scope and
purview to include emerging data types, increasing data volumes, and extending
retention periods. The third growth wave for archival is the deployment of
available data generating endpoints (IoT, IIoT, MIoT). Data archives will be
filled by around-the-clock machine-generated information versus the limited
human-generated data. Each of these growth areas is entwined in a virtuous
cycle driving growth in the others. Existing archival paradigms and products
are inadequate to meet the new archival demands.
The new archive will require retention policies and
capabilities as standard features, which is half the battle. Throughout the
archive lifecycle, the data's validity must be assessed and recorded, which is
the other half of the battle. The long-term archive will maintain, ensure, and
verify at periodic intervals the confidentiality, integrity, authenticity,
accessibility, nonrepudiation, and enforcement of the chain of custody
throughout its lifecycle. The archive must present a secured and journaled
record of all periodic integrity checks, data movement, access attempts, and
managing custodians for audit or evidentiary review.
With double-digit growth, this accelerating market has a
combination of new features and compliance and legal requirements that are
significantly different from products available today and compelling enough to
warrant a new market segment, connoted as the "Retention and
Reinstatement" archival.
VMblog: Is retention and reinstatement archive needed across
industries?
Morris: Yes, it is. Let me prattle a bit about the
history and how we will get to the next end game. We are currently entering the
third wave of data retention and reinstatement, which will be 10,000 to 100,000
times larger electronic storage archives than the first and second storage wave
combined, which is a conservative estimate. Why did we not notice the first
and second waves? They weren't large enough for the average end-user and most
companies to see unless you are in storage or a sizeable data-driven
company. During the second wave, AWS was adding enough servers each day
to support a $7B company in 2013 (James Hamilton AWS Re:Invent 2013). The third
wave will eclipse AWS's previous scale. If we believe our other friend, George
Gilder, he calculates that the mega data centers of Google, Facebook, and AWS
are near their theoretical limits. Perhaps, this is why AWS is offering
micro data center pods to reside at a customer's headquarters.
The first traditional data archives were the following: Oil
and Gas keep their seismic data forever, as cities grow or territorial
boundaries shift, they may never have another chance to conduct another
study. In pharmaceuticals and biotechnology, the records' retention
period is up to a hundred years (for good reasons...Zombies), which is where
EMC Documentum & OpenText did very well. Sarbanes-Oxley Act mandates records
retention from five years to nearly forever for financial trading
communications, again EMC Centera storage. Insurance, construction,
movies and entertainment, and aircraft all have extended retention periods due
to regulations or inherent data value. Until 2008, the data needed to be
under retention was focused, well defined, and typically documents or
electronic copies of documents. Collections were very targeted and were
mostly done manually.
In 2008, electronic discovery (eDiscovery) emerged as a side
benefit of the patent battle between Broadcom and Qualcomm. Subsequently,
any electronic communication or data was discoverable in any legal proceeding,
which opened the data collection floodgates. Email, email archives, messenger
text, phone texts, computer hard drives, thumb drives, and more were collection
targets. With electronic discovery software, data collection was near
automatic, and savvy attorneys quickly learned every repository in the
enterprise that potentially held evidentiary data, so that they could subpoena
them.
In scoping some of the eDiscovery collection systems,
initial discovery data volumes were often over one hundred petabytes in 2008,
and a majority of it is still on legal hold today. With legal hold, storage
companies get the bonus plan. The data corpus is collected without changing the
data or the metadata. A Shaw hashing algorithm verifies it for data
authenticity, and a complete copy is made and stored.
The third wave is machine-driven, and data creation will be
nonstop across all industries, and data analysis will deliver focused
competitive differentiation from agriculture to logistics to medicine with
significant effects. And, this data will be co-opted for compliance,
regulatory, legal, and privacy usage, as well as other non-intended uses.
Data retention and reinstatement periods will continue to grow throughout the
third wave to new data types as their historical, monetary, and legal merit come
into focus.
VMblog: For retention and reinstatement, is the bar much higher
than for operational backup and recovery?
Morris: The scrutiny and burden of proof that long-term data
archival will demand is significantly higher than the existing standard
afforded a backup and recovery copy, or traditional archive. Furthermore, its
lifecycle is across decades into centuries versus weeks into months for a
backup copy. Comparing Backup and Recovery to Retention and Reinstatement
is the equivalent to apples to oranges juxtaposition.
VMblog: Is traditional disk archive and physical tape archive not feasible for the Retention and Reinstatement market?
Morris: No, they are not feasible, as retention and reinstatement
features are an all or nothing proposition with long-tailed expense
implications of archiving data for a half-century or a more extended period.
Accessibility of physical tape is challenging under tight
timelines for production requests, or deletion requests, as well as the
physical chain of custody of tape, is problematic. The need to refresh tapes
every five to seven years due to tape degradation is expensive, time-consuming,
and a security challenge. Historically, fifteen to twenty-five percent of
the physical tapes degrade and are unrecoverable after the refresh time period.
Disk archive solves the accessibility and digital chain of
custody challenges up to a point. However, the storage systems' end of life is
around seven to ten years, and then the data must be copied to a new storage
system. Most storage systems entangle the data within the system, and it is
very problematic to move data from one system to another without changing
metadata, data, or retention policies, as well as verify that the copied data
is identical. Historically, seamless portability across rival storage
systems has consistently been a low priority feature for storage vendors and a
top priority for storage customers. As we discussed last time, the
nightmare scenario is the EMC Centera, a great product, and one where the data
retention period outlived the product lifecycle and led to an expensive and
time-consuming data migration challenge for customers (LMC Associates for more details
in this challenge). This nightmare will be repeated, as retention periods
are extending due to new demands.
As we look to the cloud, the hardware is typically
abstracted or virtualized, so servers and storage can change with little impact
within a single vendor. Most don't implement a zero-trust model, so there are
security challenges. S3 compatibility is the de facto standard to transfer data
between clouds of different vendors; however, not all S3 clouds are created
equal, and there are differences between cloud schemes that create problems.
The top three cloud vendors are the largest purchaser of tape systems due to
their low acquisition cost. Customers bear the total cost of ownership
over the extended retention periods. The first big lawsuit where a cloud
vendor loses ~15% of the client's evidentiary data due to tape degradation or
mishandling data will be interesting. Another factor for cloud customers is the
inability to manage their archival expenses over time. Most companies
initially viewed the cloud as an active and competitive market. For today's
customers, there is no easy way to move 10 or 100 petabytes of data from one
cloud vendor to another while maintaining data coherency, as well as paying the
data egress fees. Like the traditional storage vendors, cloud vendors
prioritize data portability in line with their interests versus the customers.
VMblog: It's been around forever, so why is there such a challenging problem with archives?
Morris: As we mentioned earlier, the archive is
traditionally viewed as a cost center. There has been an overall lack of
investment in archival technologies. With the three waves reinforcing each
other, these growth drivers will leave many companies stranded with protracted
retention mandates and 100s of petabytes of archival data on the one hand and
cost-prohibitive, aging archival solutions on the other. With retention
periods increasing from 10 and 25 years to 50 to 100 years, companies will need
to actively manage their long-term total cost of archival, or the expense could
quickly compound and sink many companies if new solutions are not available.
It is not just the technical challenges we must overcome
with Retention and Reinstatement. It is much more complicated and will become
more complex over time. Whether a compliance audit, legal case, or General Data
Protection Regulation (GDPR), there are fines, sanctions, and even jail time
can be levied for mishandling information over its lifecycle. GDPR leverages
the awarded penalties to fund further GDPR investigations, which fuels more
audits and litigation resulting in more archived data. The mandates and laws
continue to expand with California, Nevada, and Brazil creating their own data
protection regulations that individuals and corporations are directed to follow
wherever they are physically located.
Today, the third wave of archival is upon us,
and with the machines, there are no limits. One big thinker who has intimate
knowledge of the third wave from an edge perspective is
Mark Thiele, CEO &
Founder at Edgevana Inc. There is increasing attention and scrutiny on data
custodianship, and growing data integrity and authenticity demands. With
the new extended lifecycle, retention requirements and policy enforcement are
becoming standard practice, as data volumes increase faster than can be
archived appropriately.
##
David
Morris serves as VP of product and global marketing for FalconStor. He has more
than 25 years of leadership experience in storage systems and storage,
information, and compliance management. Before FalconStor, he worked with Cisco
and Huawei to develop and define new strategic imperatives, as the third era of
IT disrupted the technology sector. Recognized for his ability to identify new
markets and develop targeted solutions, Morris has worked with private equity
backed companies to position them into emerging high-growth markets, including
Kazeon, acquired by EMC, and Cetas, acquired by VMware. At NET, he led the
storage and network division turnaround efforts and repositioning the company
to raise $85 million in a private placement of public equity (PIPE) and
subsequent acquisition by Sonus.
He
holds graduate degrees in marketing from the University of California,
Berkeley-Haas, in finance from Columbia University in the City of New York, and
in engineering from George Washington University, as well as a Bachelors in
Physics from Auburn University. He currently advises Aerwave, a next-gen
security company, and Brite Discovery, a GDPR compliance and eDiscovery
company. He is active in and supported Compass Family Services, which services
homeless and at-risk families in San Francisco, The Tech Museum of Innovation
in San Jose, CA, and The American Indian Science and Engineering Society.