Services
Explore Discover new data and digital opportunities, prove their value, and unlock your business potential.
Strategy

Map out technology-driven strategies to forge your data, AI, and digital-first future vision.

Transform Build strong data and digital foundations, strengthened by ML, AI, data science, and apps, to achieve your goals.
Enable Establish self-service analytics, citizen data science, and low-code/no-code platforms to support your business intelligence.
Discover our services
Learn
Blogs

From deep dives to quick tips, become an industry leader with Aiimi.

Videos

Webinars, explainers, and chats with industry leaders, all on-demand.

Guides

All of our expert guides in one place. No form fills - just download and go.

CIO+ Hub

Practical advice, CIO success stories, and expert insights for today’s information leaders.

Explore
Customer Stories

Discover how customers across a range of industries have realised value and growth with Aiimi.

Data Risk Assessment

Our free Data Risk Assessment helps you quickly identify your most urgent data risk areas and biggest opportunities for automated data governance.

Partners

Accelerate your success by partnering with Aiimi. Our partner portal is your complete toolkit for driving success.

Our Work
Contact
Insights

Enrichment: Accelerating classification with Document Clustering.

by Paul Maker

If you recall my blog on Day 5 of this series, Sustainable Document Classification, I talked about some of the challenges with document classification, most of which have to do with the length of time it takes to train an accurate model.

How can Clustering help with training a model?

To train a model, we need a pre-labelled set of training data for each document class (i.e. document type – for example, invoice). Pre-labelling a set of training data means marshalling a whole lot of users to find the documents and then label them – and therein lies the problem. Users tend to have better things to do, like their day job, so asking them to undertake a task like this is unlikely to yield much success!

What if we could use the power of machine learning to have a first go at labelling the documents and take this pain away from the business and the user?

There's a branch of machine learning called unsupervised machine learning that requires no pre-labelling of data or prebuilt models. Specifically, a process called cluster analysis or 'clustering' can take a large set of documents or data and group these into groups or clusters that represent similarity. We can then use this to create an initial document classification model and further refine this with user driven reinforcement and feedback.

How Clustering works using TF-IDF

Logically, the primary way to differentiate documents is logically to look at their content, specifically text content. A key insight we wish to gain from a document’s text is to identify how it differs from other documents and which terms make it unique. For example, you would expect invoice documents to have a high occurrence of the word “invoice”.

This is achieved using a method called “Term Frequency – Inverse Document Frequency” or “TF-IDF” for short, which identifies how significant the occurrence of a word is in one document compared to all other documents. Using this method for each document, words that occur frequently across all documents will have a low score and words that occur frequently in a smaller subsection of documents will have a high score.

Once we have obtained this “TF-IDF” score for each word in each document, we can run a clustering model which will group documents that have high levels of similarity between TF-IDF scores. Following the previous example, if the word “invoice” is scored highly in two documents, these two documents are more likely to be clustered together.

Making it easier to train a classification model

Now that we have these groups, or 'clusters; as we call them, we can use them to train a supervised classification model, which can then be used to classify documents. Before we start this training process, we may want to inspect the clusters and make any minor corrections that we see fit. To help with this, we have been working on some innovative user interfaces that simplify this process.

Another cool thing which we are working on at Aiimi is how we reinforce a classification model and make it more accurate over time. For this, we are looking at how we can crowd source classification corrections through the InsightMaker user interface. We can then take these corrections and automatically retrain the classification model in the background.

Through our research, we have discovered that we can dramatically accelerate the time to build classification models, something that has always been the Achilles Heel of document classification.

I hope you found today's post interesting - perhaps it's inspired you to take the plunge and look into document classification for your organisation. If you'd like to find out more about the value of Document Classification for businesses, don't forget to check out Day 5's post - linked below.

Cheers and speak soon, Paul

If you missed my blogs in the 12 Days of Information Enrichment series, you can catch up here.

Aiimi Insights, delivered to you.

Discover the latest data and AI insights, opinions, and news from our experts. Subscribe now to get Aiimi Insights delivered direct to your inbox each month.

Aiimi may contact you with other communications if we believe that it is legitimate to do so. You may unsubscribe from these communications at any time. For information about  our commitment to protecting your information, please review our Privacy Policy.


Enjoyed this insight? Share the post with your network.