Structure your unstructured data with automated discovery and enrichment, steering business success with timely and cost-effective insights.
Our CTO, Paul Maker, and Content Lead, Matt Eustace, discuss the differences between unstructured and structured information – and reveal how structuring the unstructured with automated discovery and enrichment unlocks insight, saves time, and reduces costs for real business value.
Join their conversation...
Matt: All organisations have unstructured and structured data spread across their vast repositories and systems. I classify unstructured data as content, like Word documents, Excel spreadsheets, PDFs, and CAD drawings, while spreadsheets and databases can sometimes crossover into semi-structured data. This blurs the line between the unstructured and structured worlds.
Paul: That’s the thing. Some people consider Excel spreadsheets and CSV files to be unstructured data, simply because they’re often stored on unstructured systems like SharePoint. But they’re actually quite structured. So, yes, it’s a grey area. But on the opposite side of the coin, structured data is very clear cut – it’s your system servers, databases, CRMs, and SAPs. I guess the questions we really need to answer are: Why does it matter? And what are the impacts?
The traditional problem with unstructured and semi-structured data is working out how to find the information you need from what you’ve already got stored.
Matt: I think the traditional problem with unstructured and semi-structured data is working out how to find the information you need from what you’ve already got stored. What tools do businesses have in place to do that effectively? And what are the solutions to help them derive information, discover it, and classify it properly?
Paul: Right. That’s the first challenge, the discovery problem. You need a platform that can consistently dip into all those diverse repositories and separate systems to create one unified index. It needs to constantly do this and be accessible to all users, so they don’t need to be familiar with back-end repositories and systems. Once you’ve got that in place, you’ve got a one-stop-shop. You can then find and discover what you need.
You need a platform that can consistently dip into all those diverse repositories and separate systems to create one unified index.
The next challenge is how to enrich that information to add value. Importantly, you need to add context to it. Often that context manifests itself as a series of labels. For example, take an invoice. You might classify it as a finance document pertaining to a specific customer, and then add its status as ‘paid’. By adding this context, you’ve enriched that piece of unstructured data, making it more structured. At that point, you’re arming yourself with the ability to organise it better and treat it more like structured data. You can now easily partner that data with its counterparts in the true structured world, such as your SAPs and CRMs. You’re now on your first step towards discovery – and we call it enrichment. This insight capability really brings the unstructured and structured data worlds together.
Matt: I guess the traditional approach to insight capability, without an Insight Engine, is to use content management platforms. And try to get all your users to classify all your data in the same way with a uniform schema for classifying information. That, I think, can be a real challenge.
Paul: That’s right. It’s a huge challenge because it requires all your users to interact with your platforms and classify information according to dropdown lists, adding text to different metadata fields, such as status ‘paid’ for an invoice. You’ll deploy these platforms, and for the first six months your users will do what’s been asked, but then the wheels will quickly start to fall off. And that massive investment you’ve made just turns into another big bucket full of disorganised data and information, which isn’t neatly classified, isn’t correctly tagged, and has no meaningful metadata. To tackle this problem, you need to remove all these requirements from your users and let technology automate that process. That’s sustainable.
You need to remove all classification and tagging requirements from your users and let technology automate that process.
Matt: Yes, when you try to adopt content management solutions, the main objection from users is that they just want to do their job. They don’t want to be stopped in their tracks to classify something they’re not actively saving. Particularly in professional services where users are producing high volumes of content. Automation really delivers extra value here, linking new content to structured data you’ve already got. Now, the information being classified is automatically connected to the entities you deal with, such as your assets, customers, invoices, and so on.
Paul: Plus, if you happened to get your labelling incorrect during classification and through your metadata, you can easily reconcile it with master data that exists in, say, your structured SAP or CRM systems. You could, for example, validate customer names on invoices or phone numbers on bank statements against your master data. You’ve almost completely automated its extraction, tagging, and validation, massively reducing the burden on your end users.
Matt: So, as a result of these enrichment and discovery processes, you end up with a structured repository of information, and you’re no longer relying on your unstructured repositories. You’ve now got data that’s properly structured, properly classified, treated in the same way as any other structured data, and you can perform analytics across it.
As a result of enrichment and discovery processes, you end up with a structured repository of information, and you’re no longer relying on your unstructured repositories.
Paul: And at same time, you can still connect back to your unstructured information. You’ve enabled your users to see a unified view of everything related to a given entity, like a water utility you might be working with. And, as you say, you’ve also created a repository that can be used to drive analytics. You might’ve been using fact extraction, for example, across PDFs, pulling out data that wasn’t previously available in your SAPs or CRMs. Now, you can start to use that to drive insight, analytics, and better decision-making too. You’ve now created a structured repository and connections across it, spanning analytics, discovery, and enterprise search, to support a whole plethora of use cases in different industries.
By structuring content, you can easily find what you need, when you need it, so you also avoid recreating it. There’s an enormous wealth of valuable data stored in your content, which many of your users can simply never access. You can’t unlock insight if you don’t even know it’s there in the first place, right?
Matt: I guess one of the challenges I come across when I talk to risk and compliance teams, for example, is shared by all businesses – Data Subject Access Requests or the UK Data Protection Act, those kinds of things. These teams need to find a structured piece of information that lays buried in a huge repository or plethora of different systems. So, I take unstructured data that these teams are likely to be very interested in, such as a name or National Insurance number, and make it into a structured entity. I then supply a search interface or visualisation showing them where all these entities exist at source.
Paul: That’s right. You’ve totally spun the problem around, because without extracting those entities and structuring that data, you could never quantify your risk. All you could do is search for, say, an individual National Insurance number and say that it exists, rather than saying you’ve got 50,000 National Insurance numbers stored across your systems. So, that’s the benefit of structuring unstructured information.
By structuring content, you can easily find what you need, when you need it, so you also avoid recreating it.
Matt: And you could use that in quite a few different use cases. I’ll give you an example of an engineering customer I work with. They’ve been in operation for around 80 years, and they’ve got data that stretches back throughout that time on products still in service today. At the moment, that information exists as images, so if they receive a query from a customer about a very old piece of equipment, they need to find related information. In this use case, the organisation could take the original source content and discover and enrich it using structured data queries.
Paul: And possibly even exploit things like vision technology for image recognition, perhaps pulling out information from scanned copies of old drawings. For example, we’ve looked at information site plans that are 60 years old and, using vision technology, we’ve pulled out the site code and measurements, and then structured that archived information to make it discoverable in the future.
Matt: And then there’s language. You could be searching for words or phrases, and easily apply ontological classification to structure that type of data too.
Paul: You can also gain insight. Moving into the space of Insight Engine capability, you can pull out phrases, topics, concepts, classifications, and summarisations, enabling you to take a document and hang a structure around it. But, usually, the use case for that is more along the lines of intelligence and case processing. Say, you receive 20 documents related to a use case. You can run them through summarisation, phrases and topics, concept analysis, and named entity recognition, and suddenly you’ve created a dashboard describing exactly what that case is about. Now that you’ve got a structured set of data, you can prioritise use cases and assign work to the right people, such as call centre, complaints, or processing teams. You’re taking unstructured information and structuring it to exploit a use case. But in terms of value, what do you get with these Insight Engines?
Now that you’ve got a structured set of data, you can prioritise use cases and assign work to the right people, such as call centre, complaints, or processing teams. You’re taking unstructured information and structuring it to exploit a use case.
Matt: The scenario closest to my heart, and that I see and talk to a lot of people about, is when you’ve got no choice but to look at your unstructured content in a holistic fashion. Insight Engines let you do that with minimal effort, minimal cost. For instance, say you have a team of people tasked with reading documents purely to understand where data is stored and whether it meets your discovery criteria, like disclosures, you’ll find it takes up a lot of your resources. But Insight Engines automate that discovery and enrichment process, saving you a lot of time, effort, and money.
Paul: Right, those are some of the fundamental benefits, plus better decision-making and potentially reducing your time to market.
Aiimi Insights, delivered to you.
Discover the latest data and AI insights, opinions, and news from our experts. Subscribe now to get Aiimi Insights delivered direct to your inbox each month.
Enjoyed this insight? Share the post with your network.
Llama 2: our thoughts on the ground-breaking new large language model
ChatGPT Explained: A breakdown of how it works for curious business leaders
Why Enterprises Need an Insights Engine
Metadata: How to remain competitive in a data-driven world
The business-wide benefits of unstructured data discovery tools
Metadata: How to remain competitive in a data-driven world
What is metadata, and how does it help your business?