How Contract Analytics Tools like ContraxSuite Work
Optical Character Recognition
Unfortunately, many documents were born "analog" and scanned as PDF or TIF files. Optical character recognition, often abbreviated as OCR, is a technology that can help convert scanned or handwritten material back into digital, "computer-friendly" documents.
Once upon a time, OCR was much more difficult. Today, high-quality OCR tools like Google's tesseract project are freely available for nearly 100 languages. If you've ever used Google Books, you've already used the OCR software underneath the hood in ContraxSuite.
Natural Language Processing
Over the last 50 years, computer scientists and linguists have made huge strides in the fields of natural language processing and computational linguistics. Luckily, almost all of the most important discoveries in this field have been published in leading academic journals, and many of these algorithms are freely available in packages like Stanford's CoreNLP and the Natural Language Toolkit (NLTK).
Today, tools like CoreNLP and NLTK allow users to quickly analyze the structure of words and sentences, look up synonyms and antonyms, tag parts-of-speech, and identify proper nouns like countries in tens of the world's most common languages, including English, Spanish, French, German, Chinese/Mandarin, and Arabic.
Clustering is a powerful form of unsupervised learning that has been used for over 50 years in fields like biology, genetics, and statistics. By combining natural language processing with clustering, we are able to automatically categorize clauses and documents just like scientists group sequences of DNA or regions of the brain.
As a tool for modern data science, clustering is readily available in many open source and freely available frameworks such as scikit-learn, Apache Spark/Mahout, and Weka. Thanks to contributions from two generations of academics and companies like Facebook and Google, we can stand on the shoulders of giants instead of starting from scratch.
When you hear someone talk about "machine learning", they are probably referring to a mathematical approach called classification. Classification is a form of supervised learning (alongisde regression - which you might remember from your schooling), in which a set of rules or equations helps distinguish between two or more types of "things."
Back in 1983, when machine learning was already old enough to warrant a history, scientists were already drawing distinctions between expert or rule-based systems and data-driven systems. As systems like Google's DeepMind or IBM's Watson show, the "battle for AI" has swung in favor of data. Luckily, the battle has also produced a rich open-source ecosystem of machine learning frameworks like frameworks such as scikit-learn, Apache Spark/Mahout, and Weka.
If machine learning techniques are driven by data, where do contract analytics platforms get their data from? Generally, there are two sources for high-quality contracts and legal documents:
- public data, like Federal court evidence or SEC filings
- private data, like material acquired from clients
ContraxSuite is built on the LexPredict Legal Document Database, which draws entirely from public sources like Federal courts and SEC filings. Unfortunately, many vendors build their products on private data acquired from clients, which introduces serious legal and economic concerns. Because ContraxSuite is an open-source product and can be used freely by most organizations, organizations can benefit from contract analytics without worrying that their private data will be aggregated or re-used.