Here in Aiimi Labs, we've been doing some research around how we might describe a set of documents and indicate to the user what they are all about. You’re probably thinking, "what about document summaries - don't they do this already?". Sort of, but summaries tend to be long (I'm talking several paragraphs!), and what we really want is a short, concise label that indicates the topics, concepts and themes that are contained within in the text.

Clustering and labelling documents

Our research started out by looking into clustering algorithms to group documents together (you can read more about our use of clustering in my previous blog, Accelerating classification with Document Clustering). However, on its own, this did not provide us with a label that would actually mean something to the user. Then we stumbled across the Lingo algorithm.

The Lingo algorithm works by examining a body of text and then extracting something called ‘complete phrases’ from it. There's a lot of very clever maths behind the scenes, but, put simply, a complete phrase is a recurring phrase in a body of text that is said to be 'right and left complete'. What does that mean? Well a complete phrase cannot be ‘extended’ by adding preceding or trailing elements, because at least one of these elements is different from the rest of the phrases... Lost? Don’t worry, all you need to know is that these complete phrases become those magic labels that we need to help us describe our group of documents.

If you’re going to use this method of examining text on a set of search results, for instance, then you need to make sure its super-fast so it can be performed at the same time as the search. If not, you'll have pretty unhappy users.

And therein lies the problem with the Lingo algorithm – computing those complete phrases is very expensive, especially over a large set of documents.

Speeding things up with InsightMaker

One of our data science team, Jack, suggested that we utilise the InsightMaker enrichment pipeline to pre-calculate these complete phrases. This means that, at the time of a search, all we need to do is the clustering, which on its own is pretty quick. Luckily for us, a few hours after suggesting it, Jack also wrote the code in his spare time.

If you're wondering what InsightMaker is, it's our AI-powered discovery and insights platform, and it has formed the basis of my discussion of enrichment steps throughout this blog series. InsightMaker has this concept of enrichment which essentially allows us to add context, labels and other metadata to documents and data as it is ingested into the platform.

So, we created a new enrichment step for InsightMaker to compute these complete phrases from Lingo and add them as metadata to the documents. Then, when it's time to perform a search within our platform InsightMaker, we compute the dynamic topics (the clusters) and then let the users use these to filter and navigate their search results. All with sub-second timing.

Thinking now about the use cases for Dynamic Topics, it's great for when users want to understand a set of documents - for example, if they have received a batch of documents that relate to a case or some other investigation - or for any user working in a research mode. Dynamic Topics allow users to quickly ascertain what a set of documents is about, and then zoom in on the ones they are interested in. Best of all, none of this requires us to train models, it's all done on the fly using machine learning.

Hope that was interesting. We certainly think it is…

Cheers and see you tomorrow for our final installment of this Enrichment series! Paul


If you missed my previous blogs in the 12 Days of Information Enrichment series, you can catch up here.

Day 1 - What is enrichment? Creating wealth from information

Day 2 - Starting at the beginning with Text Extraction

Day 3 - Structuring the unstructured with Business Entity Extraction

Day 4 - Solving the GDPR, PII and PCI problem

Day 5 - Sustainable Document Classification

Day 6 - Image Enrichment: Giving your business vision

Day 7 - Advanced Entity Extraction with Natural Language Processing

Day 8 - Understanding customers with Speech to Text translation

Day 9 - Accelerating classification with Document Clustering

Day 10 - Giving users what they need, when they need it