Data
categorization and management present formidable challenges for organizations
due to regulatory, legal, and business concerns. Yet new technologies, such as
predictive analytics and cloud computing, provide a path towards the ‘holy
grail' of information governance: completely autonomous, computer-driven data
categorization and management. Bill Tolson, Vice President of Marketing at
Archive360, shares his insight into the past, present
and future of this field.
VMblog: What key points do
organizations need to know regarding information governance?
Bill Tolson: First, note that in the past, information governance was a task for the
end user of the data. Even in organizations with well-defined information
governance policies, the actual decisions regarding data - for example, which
documents to keep, which ones to discard, and when - fell in the hands of the
end user.
So when storage grew full, and the IT department informed end users they would
be unable to save or send further documents until freeing up storage, it was up
to the end user to decide what to save and what to delete. This was often done
using irrelevant metrics, such as the creation date or file size, and without
regard to regulatory, legal, or business requirements for archiving.
VMblog: This sounds like an area ripe
for technological innovation. How did technology first begin to play a role in
the information governance field?
Tolson:
Around the turn of the millennium, tech firms started to offer ‘records
management systems' designed to simplify information management. However, these
electronic solutions were largely ineffective, since they lacked the ability to
decide whether a file needed to be kept - for legal, regulatory, or business
reasons - or could be discarded. They also could not process data into
different streams - for example, to note how long a file should be stored, or
move it into long-term storage.
Even today,
the most sophisticated organizations - private, public, and non-profit - are
struggling to take control of their information governance. The explosion of
data makes it practically impossible for end users to implement information
governance, even with software support.
VMblog: Since in your view it's
nearly impossible for organizations to maintain a full grasp on the information
in their possession, do you think they should simply focus on the most crucial
data, such as information that falls under legal or regulatory requirements,
and ignore the rest?
Tolson: This is the priority most organizations focus on - though it typically
accounts for a mere 6-10% of the overall data in the organization. That still
leaves a vast amount of data for end users to process, and in practice that
means this data is not managed at all - it's simply not possible for the end
user to properly categorize this volume of data. Instead, they stick it in a
folder in their email or desktop - they don't delete it, but they do forget
about it...
VMblog: So, organizations can't
simply ignore this data, or leave it to the end user. What prospects do you see
for how this situation could be solved?
Tolson: The
ideal information governance solution would be entirely automated, able to make
intelligent decisions with fresh data, and do so in a highly accurate manner.
Both the organization and the end users would benefit from this - after all,
information governance is not a core component of most positions, and those end
users don't see extra compensation or bonuses for effectively managing their
data.
Microsoft is actively working on this: at a previous Microsoft Inspire
Conference initial keynote address, Microsoft CEO, Satya Nadella, introduced the
application of predictive automation to information governance. He discussed
how predictive intelligence, and archiving in the cloud, could address data
issues before they occur.
For example, predictive automation could analyze data to decide whether it is
subject to regulatory, compliance, or legal mandates, where it should be
stored, for how long, and any limitations regarding access or security. This
would free end users from the responsibility of data governance, and ultimately
ensure more efficient and accurate information management.
VMblog: It seems like this is an
ideal solution for organizations facing the typical data explosion. How would a
predictive automation solution work in practice?
Tolson: Here's an illustration: I previously worked in the eDiscovery industry.
We used predictive coding to automate the eDiscovery data review process, where
documentation is reviewed for relevance to a legal proceeding.
Prior to the predictive analytics innovation, eDiscovery organizations gathered
vast repositories of information, and performed an initial sort using keywords.
They then assigned teams of attorneys and paralegals to read each document and
consider its relevance to the case at hand. This was a costly and inefficient
endeavor: just a few years ago, the average cost for an eDiscovery review was
approximately $1.5 million - not including the actual trial or judgment
process.
Our solution
applied supervised machine learning to automate the eDiscovery process. We
collected previous eDiscovery results, and used these sets to train computers
to recognize relevant data points and interpret their meaning. We trained the
computers using anywhere from 2 to 50 training cycles, and each training helped
the computer better recognize relevant information.
Typically, additional trainings lower the error rate of the program, so the
computer can more accurately recognize relevant case material: manual reviews
have an error rate anywhere from 20 to 50%, but our predictive coding system
had an error rate as low as 2%. Given this highly accurate performance, the
courts began to accept predictive analytics as a legally acceptable tool for
discovery.
VMblog: I understand you're able to
recognize the type of data you're dealing with, using predictive coding. Yet it
seems like you have to be present to help train the computer, so it is not a
truly automated system. Is there any way to completely eliminate the need for
human monitoring, so the computer is able to manage the information governance
process independently?
Tolson: What
you're describing is the 'holy grail' of predictive information governance: a
completely independent computer system that can recognize documents, and manage
them correctly, with no human input. In technical terms, we call this 'unsupervised machine learning.'
A fully developed, unsupervised machine learning system would be the
realization of a truly automated predictive information governance system. It
could gather, manage, store, safeguard, and decide whether to keep or delete
information.
Even better would be such a system built on the cloud. A cloud-based solution
would lower the cost and complexity of managing and storing data, since the
cost of this capability would be shared by many organizations within the public
cloud environment.
Fortunately, we are almost there. Microsoft's Cloud and Azure services are
bringing us within sight of the holy grail of information governance: a fully
automated, predicted governance system. Azure includes machine learning to help
organizations develop self-adapting security and analytics, among other
capabilities. Full data governance automation is on the horizon.
##
Bill Tolson is Vice President of
Marketing for Archive360 (www.archive360.com). He has more than 25
years of experience with multinational corporations and technology start-ups,
including 15-plus years in the archiving, ECM, information governance,
regulations compliance and legal eDiscovery markets. Prior to joining
Archive360, Bill held leadership positions at Actiance, Recommind, Hewlett
Packard, Iron Mountain, Mimosa Systems, and StorageTek.