In my last blog in my 12 Days of Information Enrichment series, I spoke about how we extract business entities from information and then use these entities to better structure, navigate and use that information. Quite a simple concept, yet something so fundamental to information enrichment.

So, how does this help your organisation comply with GDPR? Well, about 2 years ago, as the hype around GDPR was reaching its peak, we decided that we could use our entity extraction capabilities to help find Personally Identifiable Information (PII) and Payment Card Information (PCI). We could then couple this with some GDPR-specific apps so that users could quickly and easily understand their risk, get their house in order and be compliant with the new legislation.

Developing a PII & PCI finder solution

As with all new solutions, there was quite a bit of learning to come as we started to solve this problem for our customers. The main challenge was the dreaded false positive. What is this? Essentially, a false positive in this case would be determining that a piece of data is, let's say, a national insurance number when in fact it isn’t. An obvious problem which makes it very hard for the customer to see the wood for the trees and even harder to rely on their solution to help them achieve GDPR compliance.

After thinking through this issue, we decided that we needed to research some approaches that we could use to contextually reinforce that the pieces of information we were finding were genuine examples of PII or PCI.

Identifying PII with proximity indicators

We started out by looking at using proximity indicators. These work by checking the distance between the extracted entity, say a national insurance number, and a word that would reinforce the determination, for example, 'NI Number'. We extended this to include synonyms for each indicator word, increasing the reliability of the process. In addition, by using the distance between the indicator word and the entity in question, we were able to compute a confidence score to give as much transparency as possible to the user.

Using context to identify PII

Next, we looked at how we could infer meaning based on the presence of multiple pieces of information. A great example of this is payment card information, or PCI as it’s known. A 3-digit number means little in isolation, but if it occurs in a document along with a 16-digit number, a postcode, a name and a date…. Suddenly this looks like credit card details. We built algorithms to intelligently detect multiple items and then classify the document based on what was present, using a risk score to help users focus on the right things first.

How visible is your information?

Telling this story would not be complete without including another piece of capability that we developed – the information visibility metric.

This metric informs a user about how visible a piece of data or a document is within the enterprise. This is possible because, as part of the enrichment process, we store all the access permissions for each piece of information that we index. We can take these visibility details and use this to boost documents that can be seen by lots of people to the top of the queue. The rationale for this is that you are more liable to get yourself into GDPR-related hot water if you are storing sensitive data in a location that is wide open to the whole business.

Something we quickly learnt about PII and PCI is that customers have far more of it present than they realise. Undertaking this process of discovery usually reveals sensitive personal data sitting in places like network drives, where it remains for years and is usually undiscoverable. And, because customers have so much information like this, it becomes an impossible task to remedy manually. Being able to attach a confidence score to the items found, along with prioritising those items that are accessible to numerous users across business, offers organisations a pragmatic and progressive way to address their personal data problem for GDPR.

Cheers, and see you soon, Paul…

If you missed my previous blogs in the 12 Days of Information Enrichment series, you can catch up here.

Day 1 - What is enrichment? Creating wealth from information

Day 2 - Starting at the beginning with Text Extraction

Day 3 - Structuring the unstructured with Business Entity Extraction