Named Entity Recognition For Better Data Management

Welcome to this blog from Aiimi Engineering, where we’ll introduce one of the exciting research areas we’re working on: Named Entity Recognition.

At Aiimi, we’re obsessed with finding new ways to use technology and AI to manage information intuitively. Natural Language Processing is a key area for us, and today we’re giving you the low-down on one branch of that – Named Entity Recognition. We’ll cover the what, the how, and the why, including its use in our next-gen search and discovery software, the Aiimi Insight Engine. Let’s begin!

What is Named Entity Recognition?

Named Entity Recognition (NER) is a way to find and classify useful information (entities) within unstructured text – entities such as people, organisations, and locations. NER is a sequence tagging task, used to label parts of a word sequence (usually a sentence) with the location and type of entity. It’s worth noting that entities are commonly one-word long, but they can also be made up of multiple words.

How does Named Entity Recognition work?

There are a variety of different ways to extract entities from text. A successful NER model will not only identify whether an entity is present, but also classify the entity type. In this section, we’ll guide you through some of the methods used in Named Entity Recognition.

Dictionary-based NER

This method compares each word in a text to a pre-existing dictionary (or lexicon) of entities. If a word or series of words matches one of the items in the dictionary, it’s classified as an entity. The drawback to this approach is that it’s limited by the quality of pre-existing entity information, so there’s no possibility for new entities or synonyms to be found.

Rule-based NER

This approach is based on the observation that certain types of entities have a predictable structure. To extract these from text, we apply a series of rules that correspond to each type of entity we’re looking for – also known as pattern-matching.

This method of entity extraction is more generalisable than the dictionary-based method, but it still requires a level of domain knowledge in how these entities are structured. Email addresses, phone numbers, and postcodes often have a consistent format, so they work well with this method.

Statistical NER (Machine Learning)

The machine learning approach creates a probabilistic model of where the entities are likely to appear within natural language, based on previous experience. Trained with a large number of labelled examples, this model then uses those learnings to identify entities in an unseen context.

The benefit of this approach is that providing the model is trained over a diverse set of contexts, it should be able to work on a wide range of unseen examples. Because natural language is inconsistent, it’s important that any method of entity extraction can work in many different scenarios. The machine learning method achieves this, making it a far more powerful and generalisable solution than the dictionary-based or rule-based methods.

Why is Named Entity Recognition so useful?

NER can provide us with many business benefits. It’s one of the key techniques used by the Aiimi Insight Engine to extract value from information and classify documents, such as identifying sensitive files for GDPR. By associating pieces of data with documents (e.g. invoices with purchase orders), Named Entity Recognition enables users to easily navigate information without relying on search. It also allows us to extract information like people's names, geopolitical data, and organisations. Combined with technologies such as phrases and topics, it enables us to establish links between these entities and events like fraud or terrorism.

We’ve covered several types of NER in this blog: dictionary-based, rule-based, and statistical (machine learning). With effective training, it’s the machine learning approach that tells us things we didn’t already know about a piece of information. In the context of case management, for example, we can present the user with a series of facts and entities about the case to directly help prioritise, organise, and manage cases.

Named Entity Recognition has excellent synergies with other technologies, such as phrase and topic extraction, and text classification. It also helps with:

Document summarisation – Named entities provide additional and useful context that allows a user to quickly access the key points within a document.
Automated pseudonymisation and anonymisation – Named entities allow us to automate this whole process and remove time spent redacting information.
Synonym detection – We can use NER to find master data that a business didn’t know about, like suppliers, assets, or pieces of equipment.

What’s Aiimi working on in this space?

At Aiimi, we’re looking beyond the common types of entity and finding new ways to extract business-focused entities, such as invoice numbers and project codes.

We’ve developed a technology called the ‘Trie-Entity Extractor’, which uses all three Named Entity Recognition approaches and is already baked into our Aiimi Insight Engine software. It applies dictionary-based and rule-based NER in a super-fast, scalable way, optimised for a business use case, and combines these with a machine learning model to create the most powerful solution.

This research project is focused on building an NER machine learning model that can identify business entities to a high degree of accuracy. There are a variety of pre-trained models available on an open-source basis. Unfortunately, the performance of these models in a business context is not satisfactory, because many of the academic datasets used to train these NER models are built primarily on news data. The two most popular academic datasets are:

CoNLL-2003 4 class dataset - https://www.aclweb.org/anthology/W03-0419.pdf

OntoNotes 16 class dataset - https://catalog.ldc.upenn.edu/LDC2013T19

As we cannot use what is already available, we’re tasked with adapting our own NER model. This means either fine-tuning a pre-trained model for a business context or building a new model from scratch.

With our own document set, we’re able to formalise a methodology that can be deployed on any client data.

The key elements of this project are:

Data Collection - Channeling appropriate data sources for use as a training set for this model
Dataset Labelling - Marking-up documents with known entity information
Word Embeddings - Processing words in a way that maximises the ability for entities to be successfully recognised and categorised
Model Training - Automatically digesting labelled documents to train a model
Model Evaluation - Identifying useful metrics to quantify the performance of the NER system
Model Deployment - Building a scalable enrichment pipeline to deploy this technology

Find out more about how the Aiimi Insight Engine helps businesses classify, enrich, protect, and control data, or take a deep dive into our latest insights on Machine Learning and neural networks.

Aiimi Engineering on… Named Entity Recognition.