How do you unlock hidden wealth from your organisation’s information, and how do you do this without masses of human intervention? In fact, how do you do this without any human intervention? Now you’re intrigued, right?! This short series of blogs is going to show you some of the work that we are doing at Aiimi Labs in the information enrichment space with our product InsightMaker.

I will start by saying that this is not a sales pitch for InsightMaker, I will touch on the technology, but the real focus will be on the techniques we use and how this helps our customers. So, with that out of the way, what is information enrichment and why might we want to do it?

Information enrichment is the process of taking either unstructured data (such as Microsoft Word documents and PDF files) or structured data (such as data from SAP or a CRM system) and adding additional context to it. This context often incorporates labels, metadata, classifications and other such things that we can use to better structure, navigate and use the information.

For example, we might extract all the site and asset details from CAD drawings so that we can automatically attach them to their SAP asset records, creating a unified world of structured and unstructured asset data. Or, perhaps we might categorise inbound emails into a customer service centre and then route them automatically to the best department to handle them. We may even prioritise these based on sentiment analysis to improve our customer services KPIs.

So, how does this work technically?

The InsightMaker platform has connectors which pull information from source systems. Once we have the information, for example a PDF invoice, we pass this through something we call an enrichment pipeline. The pipeline will be configured with a whole series of enrichment steps that each have their own task, such as extracting key metadata from the invoice which we can then associate with it.

Building the enrichment pipeline

In terms of enrichment steps, there are lots of different things that we have been researching and building in Aiimi Labs.

We started by focusing on extracting the text content from as many document types as possible. For this, we landed on the open source Apache Tika library. We had some teething troubles at the start around memory usage when using this at scale in the enterprise, so we modified it - now we have a much more granular control of how it works.

We then progressed into Named Entity Recognition. Essentially, this is the ability to extract key business entities from information; for example site code, site addresses, asset numbers and so on. Interestingly, we have built a lot of IP in this space. In particular, we have focussed on how we manage Named Entity Recognition at scale and super-fast - something that really matters if you are processing half a billion files (which, yes, we do do for one of our customers – more on that another day).

From there, we ventured into classification, clustering, image recognition, extracting content from hidden databases, CAD drawings, identifying PII data (used for achieving GDPR compliance), payment card information, advanced fact extraction and more. These are all things that I will be talking about in more detail in the subsequent blogs in this series.

Why bother with enrichment?

We believe information enrichment is a crucial enabler for organisations who want to extract value and wealth from the masses of information that flow through their core processes. Automating it in this way offers organisations the chance to unlock value that was previously impossible to liberate. After all, users would never manually label or classify content, and, even if they did, what about all that historic content that’s been growing for years across your networks and legacy systems? Food for thought!

Cheers, and see you soon for the second installment of my 12 Days of Information Enrichment!

Paul, CTO at Aiimi