Services
Explore Discover new data and digital opportunities, prove their value, and unlock your business potential.
Strategy

Map out technology-driven strategies to forge your data, AI, and digital-first future vision.

Transform Build strong data and digital foundations, strengthened by ML, AI, data science, and apps, to achieve your goals.
Enable Establish self-service analytics, citizen data science, and low-code/no-code platforms to support your business intelligence.
Discover our services
Learn
Blogs

From deep dives to quick tips, become an industry leader with Aiimi.

Videos

Webinars, explainers, and chats with industry leaders, all on-demand.

Guides

All of our expert guides in one place. No form fills - just download and go.

CIO+ Hub

Practical advice, CIO success stories, and expert insights for today’s information leaders.

Explore
Customer Stories

Discover how customers across a range of industries have realised value and growth with Aiimi.

Data Risk Assessment

Our free Data Risk Assessment helps you quickly identify your most urgent data risk areas and biggest opportunities for automated data governance.

Partners

Accelerate your success by partnering with Aiimi. Our partner portal is your complete toolkit for driving success.

Our Work
Contact
Insights

Enrichment: Understanding documents with Dynamic Topics.

by Paul Maker

Here in Aiimi Labs, we've been doing some research around how we might describe a set of documents and indicate to the user what they are all about. You’re probably thinking, "what about document summaries - don't they do this already?". Sort of, but summaries tend to be long (I'm talking several paragraphs!), and what we really want is a short, concise label that indicates the topics, concepts and themes that are contained within in the text.

Clustering and labelling documents

Our research started out by looking into clustering algorithms to group documents together (you can read more about our use of clustering in my previous blog, Accelerating classification with Document Clustering). However, on its own, this did not provide us with a label that would actually mean something to the user. Then we stumbled across the Lingo algorithm.

The Lingo algorithm works by examining a body of text and then extracting something called ‘complete phrases’ from it. There's a lot of very clever maths behind the scenes, but, put simply, a complete phrase is a recurring phrase in a body of text that is said to be 'right and left complete'. What does that mean? Well a complete phrase cannot be ‘extended’ by adding preceding or trailing elements, because at least one of these elements is different from the rest of the phrases... Lost? Don’t worry, all you need to know is that these complete phrases become those magic labels that we need to help us describe our group of documents.

If you’re going to use this method of examining text on a set of search results, for instance, then you need to make sure its super-fast so it can be performed at the same time as the search. If not, you'll have pretty unhappy users.

And therein lies the problem with the Lingo algorithm – computing those complete phrases is very expensive, especially over a large set of documents.

Speeding things up with InsightMaker

One of our data science team, Jack, suggested that we utilise the InsightMaker enrichment pipeline to pre-calculate these complete phrases. This means that, at the time of a search, all we need to do is the clustering, which on its own is pretty quick. Luckily for us, a few hours after suggesting it, Jack also wrote the code in his spare time.

If you're wondering what InsightMaker is, it's our AI-powered discovery and insights platform, and it has formed the basis of my discussion of enrichment steps throughout this blog series. InsightMaker has this concept of enrichment which essentially allows us to add context, labels and other metadata to documents and data as it is ingested into the platform.

So, we created a new enrichment step for InsightMaker to compute these complete phrases from Lingo and add them as metadata to the documents. Then, when it's time to perform a search within our platform InsightMaker, we compute the dynamic topics (the clusters) and then let the users use these to filter and navigate their search results. All with sub-second timing.

Thinking now about the use cases for Dynamic Topics, it's great for when users want to understand a set of documents - for example, if they have received a batch of documents that relate to a case or some other investigation - or for any user working in a research mode. Dynamic Topics allow users to quickly ascertain what a set of documents is about, and then zoom in on the ones they are interested in. Best of all, none of this requires us to train models, it's all done on the fly using machine learning.

Hope that was interesting. We certainly think it is…

Cheers and see you tomorrow for our final installment of this Enrichment series! Paul

If you missed my blogs in the 12 Days of Information Enrichment series, you can catch up here.

Aiimi Insights, delivered to you.

Discover the latest data and AI insights, opinions, and news from our experts. Subscribe now to get Aiimi Insights delivered direct to your inbox each month.

Aiimi may contact you with other communications if we believe that it is legitimate to do so. You may unsubscribe from these communications at any time. For information about  our commitment to protecting your information, please review our Privacy Policy.


Enjoyed this insight? Share the post with your network.