Virtualization Technology News and Information
BlackBoiler 2022 Predictions: Natural Language Processing for Legal - Improving the Data Sets to Increase Contract Accuracy

vmblog predictions 2022 

Industry executives and experts share their predictions for 2022.  Read them in this 14th annual series exclusive.

Natural Language Processing for Legal: Improving the Data Sets to Increase Contract Accuracy

By Jonathan Herr and Dan Simonson, PhD, BlackBoiler

Like other tools that derive from next-generation technologies, natural language processing (NLP) relies on machine learning and AI to improve, accelerate, and automate underlying text analytics functions to transform unstructured text into usable data. Advances in underlying NLP technologies present opportunities for knowledge worker industries, such as legal, to streamline operations by automating tedious and time-consuming administrative tasks; however, most advancements remain far from being a set of mature, easily deployable, and commodifiable technologies. For the foreseeable future, developing such applications still requires expert knowledge and skill to take from cutting-edge research to viable product, as well as data that exemplifies the task being automated.

In the last few years, transformer-based methods, such as BERT, have come to dominate cutting edge work in NLP research. With most computation, when software is written to process a list of things-such as the word tokens of a sentence-they're done in order, from start to finish. Transformers, instead, rely on an attention mechanism when learning, directing attention to where it's best suited for learning about a specific point in the data. This has a number of advantages. Attention allows for specific contexts of words to be better captured. It also allows for processing of large volumes of data to be better parallelized. This has led to the rise of pre-trained models-models of language that can be fine-tuned to a specific task while leveraging large volumes of data to further extend the result beyond what would be reasonable to develop for every possible task. 

These methods-and methods that derive from them-will continue to dominate NLP into the foreseeable future, especially the next year. Even if there were a large breakthrough of some kind-the next BERT or even a new post-neural learning paradigm entirely-most of the work to be published in the next year is currently being finished and will be submitted for peer review in the spring. 

NLP techniques will continue to improve, both due to increases in processing power and research that further leverages the increased capabilities of modern hardware and continues to push the state-of-the-art. However, these are not general-purpose systems. Some NLP tasks are focused on things that may support other problems-such as part-of-speech taggers or syntax parsers-but no existing NLP technique or system can simply start solving any problem out-of-the-box. Machine learning requires examples of the problem to learn; without examples, there is no learning. For example, if you're trying to get a machine learning system to learn to review contracts, you need data reflecting rounds of contract review.  Only with data that reflects the problem being solved can an NLP system produce output which is relevant, accurate, and sophisticated enough to help with the task at hand. 

Building Language Models

With increasing demands on both legal professionals in terms of both turnaround and efficiency, the legal industry continues to embrace NLP-driven products as a way of both improving turnaround and relieving burdens from staff. Legal entities that leverage NLP are automating a wide range of time-consuming, labor-intensive tasks and enjoying the benefits of better-performing processes and reduced operational drag. Applying NLP to laborious tasks at scale strengthens business agility and prevents burn out. However, one of the biggest challenges of adopting NLP-driven tools in the legal industry is training it to understand the long contracts and other legal documents. 

Whereas a formal programming language is precisely defined, natural language is ambiguous and is better understood by humans than machines. To try to resolve these ambiguities, a statistical approximation is required. In a given NLP application, a language model serves as the core component and is essentially a parametric reflection of the statistical patterns of human language it has been provided. While not remotely similar to the manner in which humans process and understand language, it allows a computer program to handle language in a way that is useful. A language model may have something like GPT or BERT at its core, or it may even use one of the many methods from statistical NLP's three-decade history.

On a practical level, fine tuning a language model currently entails manual processes - advancement in automated probing techniques looms on the horizon. In cases where the language model is pre-trained for complex tasks like M&A documents, engineers are building powerful custom data sets, which is promising.

Training the System

Because NLP-based applications rely on large volumes of data to learn patterns, the most pressing issue centers on training the system - largely because it requires a critical mass of historical data to become intelligent enough to be useful. Further complicating matters, vendors of NLP-driven products need to procure the redlined versions of previous contracts and other historical data from clients to extrapolate the preferred terminology, changes, or clauses, which presents its own set of challenges. 

Luckily, because legal contractual language is largely consistent across documents, entirely novel or new language occurs infrequently compared with language typically studied in the course of NLP research, such as newswire or Wikipedia; this repetition allows the system to reach parity performance with fewer documents than one would expect in other genres of text. Further, most legal documents are written in precise language where many of the phrasings and lexical choices are established in legal precedent. Also, contracts often retain similarities to prior contracts through institutional inertia, where previous agreements may arbitrarily act as templates for new agreements. Consequently, there is less ambiguity for NLP-driven systems in the legal space to navigate. 

Although the limitations associated with mapping context, specificity, and personalization are persistent concerns, the legal industry largely - and uniquely - operates on prescribed language that's easier for an algorithmic system to make accurate and useful inferences and predictions. 

Applying NLP to Legal-Industry Use Cases

  • Streamlining Legal Research

Conducting thorough research is both time consuming and essential, which is why many legal entities are using NLP to shorten timelines by streamlining the research process. For example, it's common knowledge that the Indian legal system suffers from a backlog of cases. NLP-powered legal search engines can translate plain language into standardized legal language, which makes it significantly easier to sift through relevant documents and cases for discovery. 

More advanced NLP programs can search beyond mere keywords using topics and embedding-driven broadening of search terms, which makes it easier for lawyers to find what they need faster. Additionally, some NLP programs can analyze a case study and suggest similar cases for lawyers to review - these recommendations help lawyers obtain precedent quickly and reliably.

  • Analyzing and Drafting Contracts and Documents

With contracts generally containing highly repetitive language, NLP helps lawyers review and negotiate contracts more efficiently and consistently, reducing errors and consequently, reducing risk. Using NLP contract tools, lawyers receive redlined contracts that have been reviewed in accordance with the legal entity's playbook - in minutes. 

Because word choice and syntax are so consequential in legal documents, legal entities are using NLP-driven tools to standardize language to prevent errors in drafting documents that lead to unintended interpretations. Some NLP programs can process documents in a variety of languages, and some can create templates based on lawyer needs. 

Because it's so easy to make a costly mistake when drafting a document or reviewing and negotiating a contract, running NLP-driven document review to prevent these errors can pay for itself many times over.

  • Exploring Emerging NLP Capabilities

Although NLP on its own is not an automation technology, it can enable automation in certain realms - for example, chatbots. With the widespread assumption that lawyers should be available outside of business hours, chatbots close the gap, providing round-the-clock support. Some NLP-based automation tools can even produce basic contracts and automatically file documents based on the language they contain. 

Some threads of cutting-edge legal NLP research extract case outcomes or even analyze past case studies to model how a court will likely rule in future cases. Because NLP programs rely on machine learning, the more the lawyers use them, the smarter they become. Over time, these predictive capabilities will become increasingly accurate and reliable. 

Looking Forward

Like other technologies based on AI and machine learning, recent increases in processing power and advances in techniques which leverage that power have opened new possibilities in what NLP can do; but specific problems still require work to implement those advances, both on the software side and in acquiring and annotating the data necessary to train such systems. The efficacy of a given NLP-driven system relies on high-quality data - eliminating silos to clean and unify data is an important first step.

You also must be sure the training data reflects genuinely the types and distribution of documents that will appear in practice; that's why the data you use to train the models needs to be very specific to the needs of a client, not just trained off documents on the internet.  

There is still a need to work on modeling properly to produce human-level sentences or understanding, such as telling or comprehending a story. To date, on the Story Cloze Task, which requires that a system choose between two possible outcomes of a story-one absurd, one chosen by human annotators-the best performing system hovers around 88% accuracy. That's an impressive performance, but still wrong on one story out of 10 - and even then on a greatly simplified task. What genuine story understanding would entail is still not completely understood by narratologists and social scientists. Thus, the kind of data required not only doesn't exist; it's not even well-understood what the form and structure of such data would be. However, that's why NLP models work well for legal, because the language used in contracts performs a far more limited social function than storytelling or the many other difficult linguistic problems the NLP research community has set itself to try solving. 

NLP is growing even faster due to the constant increase in processing power, research in how to leverage that power, and theoretical understanding of human language. In the near future, NLP is bound to become even more ubiquitous as data and algorithms become more sophisticated, accurate, and powerful. As the algorithms get smarter, we anticipate increased automation for improving language models, and progress on new use cases in the legal industry.



Jonathan Herr 

Jonathan Herr captains BlackBoiler's talented research and development team. He has unique ability to bridge traditional research with functional software that few developers possess. As a machine learning and artificial intelligence expert, he worked on countless DARPA projects utilizing deep neural networks, machine translation, and other NLP methodologies to develop cutting-edge technologies.

Dan Simonson 

Dan Simonson, PhD, tailors custom natural language processing (NLP) solutions for automating contract negotiation. Having nearly a decade of experience in computational linguistics, he has worked on problems in the legal, medical, and defense domains and has published research in the ACL Anthology. He holds a PhD in computational linguistics from Georgetown University.

Published Friday, January 28, 2022 7:31 AM by David Marshall
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<January 2022>