Aiimi Labs on… Named-Entity Recognition.
Welcome to the first blog in a new series from Aiimi Labs, where we’ll be introducing some of the exciting research projects we’re working on right now.
At Aiimi, we’re obsessed with finding new ways to use technology and AI to manage information intuitively. Natural Language Processing is a key area for us, and today we’re giving you the low-down on one branch of that – Named-Entity Recognition. We’ll cover the what, the how, and the why, including its use in our next-gen search and discovery software, InsightMaker. Let’s begin!
What is Named-Entity Recognition?
Named-Entity Recognition (NER) is a way to find and classify useful information (entities) within unstructured text – entities such as people, organisations, and locations. NER is a sequence tagging task, used to label parts of a word sequence (usually a sentence) with the location and type of entity. It’s worth noting that entities are commonly one-word long, but they can also be made up of multiple words.
How does Named-Entity Recognition work?
There are a variety of different ways to extract entities from text. A successful NER model will not only identify whether an entity is present, but also classify the entity type. In this section, we’ll guide you through some of the methods used in NER.
This method compares each word in a text to a pre-existing dictionary (or lexicon) of entities. If a word or series of words matches one of the items in the dictionary, it’s classified as an entity. The drawback to this approach is that it’s limited by the quality of pre-existing entity information, so there’s no possibility for new entities or synonyms to be found.
This approach is based on the observation that certain types of entities have a predictable structure. To extract these from text, we apply a series of rules that correspond to each type of entity we’re looking for – also known as pattern-matching.
This method of entity extraction is more generalisable than the dictionary-based method, but it still requires a level of domain knowledge in how these entities are structured. Email addresses, phone numbers, and postcodes often have a consistent format, so they work well with this method.
Here’s an example of a pattern-matching query (using regex) that extracts email addresses:
You can test your own regex at: https://regex101.com/
Statistical NER (Machine Learning)
The machine learning approach creates a probabilistic model of where the entities are likely to appear within natural language, based on previous experience. Trained with a large number of labelled examples, this model then uses those learnings to identify entities in an unseen context.
The benefit of this approach is that providing the model is trained over a diverse set of contexts, it should be able to work on a wide range of unseen examples. Because natural language is inconsistent, it’s important that any method of entity extraction can work in many different scenarios. The machine learning method achieves this, making it a far more powerful and generalisable solution than the dictionary-based or rule-based methods.
Why is Named-Entity Recognition so useful?
NER can provide us with many business benefits. It’s one of the key techniques used by our InsightMaker software to extract value from information and classify documents, such as identifying sensitive files for GDPR. By associating pieces of data with documents (e.g. invoices with purchase orders), NER enables users to easily navigate information without relying on search. It also allows us to extract information like people's names, geopolitical data, and organisations. Combined with technologies such as phrases and topics, it enables us to establish links between these entities and events like fraud or terrorism.
We’ve covered several types of NER in this blog: dictionary-based, rule-based, and statistical (machine learning). With effective training, it’s the machine learning approach that tells us things we didn’t already know about a piece of information. In the context of case management, for example, we can present the user with a series of facts and entities about the case to directly help prioritise, organise, and manage cases.
NER has excellent synergies with other technologies, such as phrase and topic extraction, and text classification. It also helps with:
- Document summarisation – Named entities provide additional and useful context that allows a user to quickly access the key points within a document.
- Automated pseudonymisation and anonymisation – Named entities allow us to automate this whole process and remove time spent redacting information.
- Synonym detection – We can use NER to find master data that a business didn’t know about, like suppliers, assets, or pieces of equipment.
What’s Aiimi working on in this space?
At Aiimi, we’re looking beyond the common types of entity and finding new ways to extract business-focused entities, such as invoice numbers and project codes.
We’ve developed a technology called the ‘Trie-Entity Extractor’, which uses all three NER approaches and is already baked into our InsightMaker software. It applies dictionary-based and rule-based NER in a super-fast, scalable way, optimised for a business use case, and combines these with a machine learning model to create the most powerful solution.
This research project is focused on building an NER machine learning model that can identify business entities to a high degree of accuracy. There are a variety of pre-trained models available on an open-source basis. Unfortunately, the performance of these models in a business context is not satisfactory, because many of the academic datasets used to train these NER models are built primarily on news data. The two most popular academic datasets are:
- CoNLL-2003 4 class dataset - https://www.aclweb.org/anthology/W03-0419.pdf
- OntoNotes 16 class dataset - https://catalog.ldc.upenn.edu/LDC2013T19
As we cannot use what is already available, we’re tasked with adapting our own NER model. This means either fine-tuning a pre-trained model for a business context or building a new model from scratch.
With our own document set, we’re able to formalise a methodology that can be deployed on any client data.
The key elements of this project are:
- Data Collection - Channeling appropriate data sources for use as a training set for this model
- Dataset Labelling - Marking-up documents with known entity information
- Word Embeddings - Processing words in a way that maximises the ability for entities to be successfully recognised and categorised
- Model Training - Automatically digesting labelled documents to train a model
- Model Evaluation - Identifying useful metrics to quantify the performance of the NER system
- Model Deployment - Building a scalable enrichment pipeline to deploy this technology
How you can get involved
We’re currently looking for talented data science researchers to contribute to exciting projects like this one.
While you’re here, take a look at InsightMaker and see how organisations use it to help discover, manage, and govern their information.
Aiimi Insights, delivered to you.
Discover the latest data and AI insights, opinions, and news from our experts. Subscribe now to get Aiimi Insights delivered direct to your inbox each month.
Enjoyed this insight? Share the post with your network.
Enrichment: The power of Enrichment
Enrichment: Understanding documents with Dynamic Topics
Enrichment: Giving users what they need