If you recall my blog on Day 5 of this series, Sustainable Document Classification, I talked about some of the challenges with document classification, most of which have to do with the length of time it takes to train an accurate model.

How can Clustering help with training a model?

To train a model, we need a pre-labelled set of training data for each document class (i.e. document type – for example, invoice). Pre-labelling a set of training data means marshalling a whole lot of users to find the documents and then label them – and therein lies the problem. Users tend to have better things to do, like their day job, so asking them to undertake a task like this is unlikely to yield much success!

What if we could use the power of machine learning to have a first go at labelling the documents and take this pain away from the business and the user?

There's a branch of machine learning called unsupervised machine learning that requires no pre-labelling of data or prebuilt models. Specifically, a process called cluster analysis or 'clustering' can take a large set of documents or data and group these into groups or clusters that represent similarity. We can then use this to create an initial document classification model and further refine this with user driven reinforcement and feedback.

How Clustering works using TF-IDF

Logically, the primary way to differentiate documents is logically to look at their content, specifically text content. A key insight we wish to gain from a document’s text is to identify how it differs from other documents and which terms make it unique. For example, you would expect invoice documents to have a high occurrence of the word “invoice”.

This is achieved using a method called “Term Frequency – Inverse Document Frequency” or “TF-IDF” for short, which identifies how significant the occurrence of a word is in one document compared to all other documents. Using this method for each document, words that occur frequently across all documents will have a low score and words that occur frequently in a smaller subsection of documents will have a high score.

Once we have obtained this “TF-IDF” score for each word in each document, we can run a clustering model which will group documents that have high levels of similarity between TF-IDF scores. Following the previous example, if the word “invoice” is scored highly in two documents, these two documents are more likely to be clustered together.

Making it easier to train a classification model

Now that we have these groups, or 'clusters; as we call them, we can use them to train a supervised classification model, which can then be used to classify documents. Before we start this training process, we may want to inspect the clusters and make any minor corrections that we see fit. To help with this, we have been working on some innovative user interfaces that simplify this process.

Another cool thing which we are working on at Aiimi is how we reinforce a classification model and make it more accurate over time. For this, we are looking at how we can crowd source classification corrections through the InsightMaker user interface. We can then take these corrections and automatically retrain the classification model in the background.

Through our research, we have discovered that we can dramatically accelerate the time to build classification models, something that has always been the Achilles Heel of document classification.

I hope you found today's post interesting - perhaps it's inspired you to take the plunge and look into document classification for your organisation. If you'd like to find out more about the value of Document Classification for businesses, don't forget to check out Day 5's post - linked below.

Cheers and speak soon, Paul

If you missed my previous blogs in the 12 Days of Information Enrichment series, you can catch up here.

Day 1 - What is enrichment? Creating wealth from information

Day 2 - Starting at the beginning with Text Extraction

Day 3 - Structuring the unstructured with Business Entity Extraction

Day 4 - Solving the GDPR, PII and PCI problem

Day 5 - Sustainable Document Classification

Day 6 - Image Enrichment: Giving your business vision

Day 7 - Advanced Entity Extraction with Natural Language Processing

Day 8 - Understanding customers with Speech to Text translation