In this blog I will help to demystify the complexities surrounding text analytics, machine learning and unstructured content. If your business is one of the many trying to join the conversation about data science and artificial intelligence, then this is a must-read for you.
I was inspired to write this article after I noticed an increasing number of businesses grappling with the fundamentals of data science and machine learning. It seems as if many businesses see text analytics as the holy grail to solving their business problems, and terms like sentiment analysis and topic modelling have become their buzzwords du jour.
But what they don’t know is that text analysis isn’t as straightforward as it seems. Getting it wrong carries a number of risks, not least damaging your organisation’s faith in the methodology and perhaps even machine learning in general.
So, to help ground the potential issues in the real wold, I’ll today be using the example of an email classification solution I recently developed for a large UK utilities provider.
This solution automated the previously laborious human task of manually sorting emails into ten different categories. Not only has our solution created huge cost savings, it also demonstrates how – when done right – we can rely on automation to create highly accurate results.
At its heart, this sort of text analytics and classification is a machine learning problem, so let’s start by understanding exactly what we mean by machine learning.
Wikipedia defines machine learning as giving “computers the ability to learn without being explicitly programmed”. Sounds like witchcraft, right? But this can actually be achieved very easily through basic statistics. In fact, many of the statistical techniques underpinning machine learning pre-date the computer.
Machine learning comes in two types: supervised and unsupervised. Supervised learning requires a pre-classified “training set” to learn from. In other words, the computer learns by following a set of guidelines. Unsupervised learning meanwhile does not need these guidelines. It defines its own categories.
For my email classification project, the computer was trained using a supervised approach, meaning we can show the computer various sets of emails, tagged as being in a category. The machine would then learn to associate characteristics of those emails with the categories.
It would have been easy to jump the gun and dive straight into linguistic techniques here, focusing just on the email text. Text is a form of unstructured data – unlike data, we can’t always store it neatly and succinctly in a table. That makes text notoriously tricky to deal with computationally. In contrast, structured data is well-ordered (think Excel spreadsheets). Fortunately for us, emails also contain a wealth of structured data – fields such as the time and date, and number of attachments are all structured data, and in-fact hold a lot of predictive power of their own.
A predictive model can be built on patterns and trends in this data. For example, emails from other businesses – such as letting agents – will exclusively arrive during business hours, while submissions of certain forms come with a scanned attachment. Initially, I applied a classification algorithm called a Random Forest. With this, we were able to achieve 60% accuracy from the structured data alone.
Now to the main event. How to process the unstructured data, the text itself? This is always a complicated procedure, because a computer cannot “read” in the same way a human does. Therefore, when it comes to text analysis, words must at some point be converted to numbers. How do we do this?
First, we vectorise the text. This means we take frequently occurring words and give them ID numbers. This in turn gives each email structured, numerical ‘properties’ – such as the number of occurrences of each word.
Next, we use the vectorised emails to train a machine learning model. I chose to use a Naïve Bayes model in this case. Historically, Naïve Bayes has proven effective for text classification despite the relatively light computational effort required, allowing me to scale the model significantly.
The results of this text model can then be passed into the Random Forest model along with the structured data in an ensemble model.
The process is fairly straightforward, but there are a few issues to beware of. For instance, words such as “good” may become associated with positive emails, but this can nullify words like “not”. This means we could end up identifying emails containing the phrase “not good” as positive.
The solution to this problem is to use n-grams as well as just words for vectorisation. N-grams process several words at a time. In the sentence “it was a cloudy day”, ordinary vectorisation would only recognise five words: it, was, a, cloudy and day. Using two-word n-grams, however, we can recognise those five words, plus four pairs of words: it-was, was-a, a-cloudy and cloudy-day.
Going back to our previous example, an n-gram technique would be able to recognise “good” as a positive word and “not-good” as a negative.
So that’s vectorisation and n-grams – but we’re still not done yet.
Another natural language processing technique used was “stemming”. Stemming categorises words according to their roots. This means “chase”, “chasing” and “chased” would all be classified as “chase”. This helps by condensing information passed to the classifier, improving results.
Especially when dealing with multiple topics, there are many nuances that must be considered. In addition to the Naïve Bayes model, I used Latent Dirichlet Allocation (LDA) to capture topics, and extracted numerous other features, such as the number of exclamation marks, numbers and symbols in the text.
What does this all mean?
Even looking at this email classification problem from a purely algorithmic perspective, it can be very easy to fall into traps. There is very often a compulsion to jump straight into text classification without considering the nuances that come with it (such as the use of n-grams to properly capture sentiment), or completely ignoring value which can be gained through associated metadata (structured information), such as the time of day.
You should remember that machine learning isn’t just some magic solution which is trivially implemented. A lot of consideration is required to make sure a computer can “see” what may seem obvious to a human – this is especially true in text analytics.
However, even if you get the data science right on paper, this is only half the battle. Achieving real business value from a machine learning solution is worthy of a blog in itself. See my follow-up for more information on the email solution’s road to production.