Industry executives and experts share their predictions for 2022. Read them in this 14th annual VMblog.com series exclusive.
Natural Language Processing for Legal: Improving the Data Sets to Increase Contract Accuracy
By Jonathan Herr and Dan Simonson, PhD, BlackBoiler
Like
other tools that derive from next-generation technologies, natural language
processing (NLP) relies on machine learning and AI to improve, accelerate, and
automate underlying text analytics functions to transform unstructured text
into usable data. Advances in underlying NLP technologies present opportunities
for knowledge worker industries, such as legal, to streamline operations by
automating tedious and time-consuming administrative tasks; however, most
advancements remain far from being a set of mature, easily deployable, and
commodifiable technologies. For the foreseeable future, developing such
applications still requires expert knowledge and skill to take from
cutting-edge research to viable product, as well as data that exemplifies the
task being automated.
In the
last few years, transformer-based methods, such as BERT, have come to dominate cutting edge
work in NLP research. With most computation, when software is written to
process a list of things-such as the word tokens of a sentence-they're done in
order, from start to finish. Transformers, instead, rely on an attention
mechanism when learning, directing attention to where it's best suited for
learning about a specific point in the data. This has a number of advantages.
Attention allows for specific contexts of words to be better captured. It also
allows for processing of large volumes of data to be better parallelized. This
has led to the rise of pre-trained models-models of language that can be
fine-tuned to a specific task while leveraging large volumes of data to further
extend the result beyond what would be reasonable to develop for every possible
task.
These
methods-and methods that derive from them-will continue to dominate NLP into
the foreseeable future, especially the next year. Even if there were a large
breakthrough of some kind-the next BERT or even a new post-neural learning
paradigm entirely-most of the work to be published in the next year is
currently being finished and will be submitted for peer review in the
spring.
NLP
techniques will continue to improve, both due to increases in processing power
and research that further leverages the increased capabilities of modern
hardware and continues to push the state-of-the-art. However, these are not general-purpose
systems. Some NLP tasks are focused on things that may support other
problems-such as part-of-speech taggers or syntax parsers-but no existing NLP
technique or system can simply start solving any problem out-of-the-box.
Machine learning requires examples of the problem to learn; without examples,
there is no learning. For example, if you're trying to get a machine learning
system to learn to review contracts, you need data reflecting rounds of
contract review. Only with data that reflects the problem being solved can
an NLP system produce output which is relevant, accurate, and sophisticated
enough to help with the task at hand.
Building
Language Models
With
increasing demands on both legal professionals in terms of both turnaround and
efficiency, the legal industry continues to embrace NLP-driven products as a
way of both improving turnaround and relieving burdens from staff. Legal
entities that leverage NLP are automating a wide range of time-consuming,
labor-intensive tasks and enjoying the benefits of better-performing processes
and reduced operational drag. Applying NLP to laborious tasks at scale
strengthens business agility and prevents burn out. However, one of the biggest
challenges of adopting NLP-driven tools in the legal industry is training it to
understand the long contracts and other legal documents.
Whereas
a formal programming language is precisely defined, natural language is
ambiguous and is better understood by humans than machines. To try to resolve
these ambiguities, a statistical approximation is required. In a given NLP
application, a language model serves as the core component and is
essentially a parametric reflection of the statistical patterns of human
language it has been provided. While not remotely similar to the manner in
which humans process and understand language, it allows a computer program to
handle language in a way that is useful. A language model may have something
like GPT or BERT at its core, or it may even use one of the many methods from
statistical NLP's three-decade history.
On a
practical level, fine tuning a language model currently entails manual
processes - advancement in automated probing techniques looms on the horizon.
In cases where the language model is pre-trained for complex tasks like M&A
documents, engineers
are building powerful custom data sets, which is promising.
Training
the System
Because
NLP-based applications rely on large volumes of data to learn patterns, the
most pressing issue centers on training the system - largely because it
requires a critical mass of historical data to become intelligent enough to be
useful. Further complicating matters, vendors of NLP-driven products need to
procure the redlined versions of previous contracts and other historical data
from clients to extrapolate the preferred terminology, changes, or clauses,
which presents its own set of challenges.
Luckily,
because legal contractual language is largely consistent across documents, entirely novel or new language
occurs infrequently compared with language typically studied in the course of
NLP research, such as newswire or Wikipedia; this repetition allows the system
to reach parity performance with fewer documents than one would expect in other
genres of text. Further, most legal documents are written in precise language
where many of the phrasings and lexical choices are established in legal precedent.
Also, contracts often retain similarities to prior contracts through
institutional inertia, where previous
agreements may arbitrarily act as templates for new agreements.
Consequently, there is less ambiguity for NLP-driven systems in the legal space
to navigate.
Although
the limitations associated with mapping context, specificity, and
personalization are persistent concerns, the legal industry largely - and
uniquely - operates on prescribed language that's easier for an algorithmic
system to make accurate and useful inferences and predictions.
Applying
NLP to Legal-Industry Use Cases
- Streamlining
Legal Research
Conducting
thorough research is both time consuming and essential, which is why many
legal entities are using NLP to shorten timelines by streamlining
the research process. For example, it's common knowledge that the Indian
legal system suffers from a backlog of cases. NLP-powered legal search engines
can translate plain language into standardized legal language, which makes it
significantly easier to sift through relevant documents and cases for
discovery.
More
advanced NLP programs can search beyond mere keywords using topics and
embedding-driven broadening of search terms, which makes it easier for lawyers
to find what they need faster. Additionally, some NLP programs can analyze a
case study and suggest similar cases for lawyers to review - these
recommendations help lawyers obtain precedent quickly and reliably.
- Analyzing and
Drafting Contracts and Documents
With
contracts generally containing highly repetitive language, NLP helps lawyers
review and negotiate contracts more efficiently and consistently, reducing
errors and consequently, reducing risk. Using NLP contract tools, lawyers
receive redlined contracts that have been reviewed in accordance with the legal
entity's playbook - in minutes.
Because
word choice and syntax are so consequential in legal documents, legal entities
are using NLP-driven tools to standardize language to prevent errors in
drafting documents that lead to unintended interpretations. Some NLP programs
can process documents in a variety of languages, and some can create templates
based on lawyer needs.
Because
it's so easy to make a costly mistake when drafting a document or reviewing and
negotiating a contract, running NLP-driven document review to prevent these
errors can pay for itself many times over.
- Exploring
Emerging NLP Capabilities
Although
NLP on its own is not an automation technology, it can enable automation in
certain realms - for example, chatbots. With the widespread assumption that
lawyers should be available outside of business hours, chatbots close the gap,
providing round-the-clock support. Some NLP-based automation tools can even
produce basic contracts and automatically file documents based on the language
they contain.
Some
threads of cutting-edge legal NLP research extract case outcomes or
even analyze
past case studies to model how a court will likely rule in future cases.
Because NLP programs rely on machine learning, the more the lawyers use them,
the smarter they become. Over time, these predictive capabilities will become
increasingly accurate and reliable.
Looking
Forward
Like
other technologies based on AI and machine learning, recent increases in
processing power and advances in techniques which leverage that power have
opened new possibilities in what NLP can do; but specific problems still
require work to implement those advances, both on the software side and in
acquiring and annotating the data necessary to train such systems. The
efficacy of a given NLP-driven system relies on high-quality data - eliminating
silos to clean and unify data is an important first step.
You
also must be sure the training data reflects genuinely the types and
distribution of documents that will appear in practice; that's why the data you
use to train the models needs to be very specific to the needs of a client, not
just trained off documents on the internet.
There
is still a need to work on modeling properly to produce human-level sentences
or understanding, such as telling or comprehending a story. To date, on the Story Cloze Task, which
requires that a system choose between two possible outcomes of a story-one
absurd, one chosen by human annotators-the best performing system hovers
around 88% accuracy. That's an impressive performance,
but still wrong on one story out of 10 - and even then on a greatly simplified
task. What genuine story understanding would entail is still not completely
understood by narratologists and social scientists. Thus, the kind of data
required not only doesn't exist; it's not even well-understood what the form
and structure of such data would be. However, that's why NLP models work well
for legal, because the language used in contracts performs a far more limited
social function than storytelling or the many other difficult linguistic
problems the NLP research community has set itself to try solving.
NLP is
growing even faster due to the constant increase in processing power, research
in how to leverage that power, and theoretical understanding of human language.
In the near future, NLP is bound to become even more ubiquitous as data and
algorithms become more sophisticated, accurate, and powerful. As the algorithms
get smarter, we anticipate increased automation for improving language models,
and progress on new use cases in the legal industry.
##
ABOUT THE AUTHORS
Jonathan Herr captains BlackBoiler's talented research
and development team. He has unique ability to bridge traditional research with
functional software that few developers possess. As a machine learning and
artificial intelligence expert, he worked on countless DARPA projects utilizing
deep neural networks, machine translation, and other NLP methodologies to
develop cutting-edge technologies.
Dan Simonson, PhD, tailors custom natural language
processing (NLP) solutions for automating contract negotiation. Having nearly a
decade of experience in computational linguistics, he has worked on problems in
the legal, medical, and defense domains and has published research in the ACL
Anthology. He holds a PhD in computational linguistics from Georgetown
University.