A few blogs back we spoke about structuring your content by extracting business entities, such as asset tags, sites, and customer names. Usually, dictionaries of terms and regular expressions with proximity detection are enough for this type of entity extraction. The only technical challenge is how you do this super fast and at scale. If you did get chance to read the blog I mentioned, you'll recall that this was something we spent a significant amount of time perfecting at Aiimi.

But what if you want to extract people, location or organisation names? It is obviously not practical to create a dictionary for all of these, so what if we can use machine learning to understand language structure and extract them?

Looking into Natural Language Processing (NLP)

A branch of text analytics called Natural Language Processing (NLP for short), something that has its roots in machine learning, offers us a way to dismantle a sentence and then extract nouns, verbs and so on. Combine this with some statistical machine learning models and you are then able to extract specific entities, such as people’s names, locations or even the business-specific things that we are interested in.

OpenNLP

At Aiimi, our research started out by looking at OpenNLP, an Apache open source offering written in Java. OpenNLP provides a whole host of functionality, including basic paragraph and sentence tokenisation and then a series of ‘name finders’ which extract the actual terms. The statistical name finder requires a trained model for the entity types that you wish to extract, and Open NLP provides a bunch of features that help you train models. However, herein lies the problem – the sheer volumes of text that you need to mark up to train a model. It usually requires around 10,000 sentences to create a robust model with good precision and recall. Luckily though, OpenNLP comes with a few pre-trained models.

NLTK

Moving on from OpenNLP, we started to investigate the Natural Language Toolkit - NLTK for short. This is a Python-based set of libraries and pre-trained models that can be used to perform a whole host of text processing tasks. Luckily for us, we could easily integrate this with the InsightMaker enrichment process through our REST enrichment step and our Python Application Server, which we use to host Python-based enrichment steps.

Trials of the pre-built models for things like person, location and organisation proved very successful at extracting most text content - providing it was not semi-structured content, such as an invoice or a purchase order for. This is because the models are trained on proper formed sentences; those types of documents seldom contain a type of English structure that can parsed and tokenised into proper sentences.

Google & Microsoft

Our research would not have been complete without looking at both the Google Cloud Platform and Microsoft Cognitive Service to see what they had to offer in the NLP space. Unsurprisingly, both provide some very good capabilities in these areas and are underpinned by a series of very, very well-trained models. Nothing less than what you would expect, since both index the internet! But, one limitation is that you can’t provide your own model at this stage, so you can’t detect your business-specific entities.

Just before we close, I will also touch on some text summarisation work that we have been undertaking which aims to summarise very large bodies of text and pull out the key sections for the user. The text summarisation is based on the Python library, Sumy, and provides us with a series of algorithms that we can use to create summaries. These summaries can then be used to present a picture of a large set of documents, perhaps a case file, which, if combined with dynamic topics (more on this in a later blog!), can give users a very efficient interface into what may have been previously an unwieldy set of documentation.

Cheers and speak soon, Paul

If you missed my previous blogs in the 12 Days of Information Enrichment series, you can catch up here.

Day 1 - What is enrichment? Creating wealth from information

Day 2 - Starting at the beginning with Text Extraction

Day 3 - Structuring the unstructured with Business Entity Extraction

Day 4 - Solving the GDPR, PII and PCI problem

Day 5 - Sustainable Document Classification

Day 6 - Image Enrichment: Giving your business vision