Services
Explore Discover new data and digital opportunities, prove their value, and unlock your business potential.
Strategy

Map out technology-driven strategies to forge your data, AI, and digital-first future vision.

Transform Build strong data and digital foundations, strengthened by ML, AI, data science, and apps, to achieve your goals.
Enable Establish self-service analytics, citizen data science, and low-code/no-code platforms to support your business intelligence.
Discover our services
Learn
Blogs

From deep dives to quick tips, become an industry leader with Aiimi.

Videos

Webinars, explainers, and chats with industry leaders, all on-demand.

Guides

All of our expert guides in one place. No form fills - just download and go.

CIO+ Hub

Practical advice, CIO success stories, and expert insights for today’s information leaders.

Explore
Customer Stories

Discover how customers across a range of industries have realised value and growth with Aiimi.

Data Risk Assessment

Our free Data Risk Assessment helps you quickly identify your most urgent data risk areas and biggest opportunities for automated data governance.

Partners

Accelerate your success by partnering with Aiimi. Our partner portal is your complete toolkit for driving success.

Our Work
Contact
Insights

Enrichment: Starting at the beginning with Text Extraction.

by Paul Maker

If you recall the first blog in this series, I introduced something that we call 'information enrichment'. Just to recap, this is the process of taking a piece of information – it could be data or a document – and then adding additional business context to it. This context is often labels, metadata, classifications and other such things that we can use to better structure, navigate and use that information.

This blog is going to talk about the first step in the process of document enrichment – text extraction.

Managing multiple document formats

On the face of it, this sounds simple – a document contains text, let’s just extract it. However, the reality behind the scenes is far from simple. Whilst documents contain text, that text is wrapped up in a whole host of structures that control format, rich media and so on.

Our challenge in Aiimi Labs was to find an efficient way for our product, InsightMaker, to handle the myriad of document formats out there. Even just looking at Microsoft Office documents, we have numerous format versions that date back over decades – consider Word, for instance, the 32-bit Office version was released in 1995!

We tested a series of different approaches from 3rd party vendors, such as the Microsoft Filter interface which was developed by Microsoft to support its indexing service, and Open Source offerings, such as Apache Tika.

After several rounds of functionality and performance testing, we settled on Apache Tika. Its huge breadth of supported document formats and the fact that it was Open Source gave us a scalable approach to document conversion, as well as the flexibility and control we needed – something that would pay back in spades later in our journey.

Time(out) trials...

After deploying this at numerous customer sites, we found a problem. Some document formats (especially those containing embedded documents, something Tika recursively handles) could cause the Java processes' memory consumption and CPU cycles to spiral out of control. Now, in an enterprise processing half a billion documents this very quickly becomes a problem.

To solve this challenge, we decided to fork the Apache Tika code and make some internal modifications to the REST endpoints. Specifically, we added a timeout parameter. This allows us to control the maximum time that is spent trying to convert a document. If the conversion exceeds this time limit, we internally terminate the conversion process. We also have plans in our InsightMaker product roadmap to add memory utilisation monitoring. This will further enhance how we can reliably convert masses of documents, since long execution time alone is not always an indicator of a problem!

Unpacking complex CAD drawings

Whilst we found Apache Tika great for Office documents, PDF files, ZIP files and numerous other unusual formats, it did not handle things like CAD drawings.

CAD is an important format for us to be able unpack and extract text from because we have several customers in the Asset Management and Engineering sector. Unless you’re a domain expert, CAD drawings are often some of the the most difficult files to find in an organisation. CAD files also contain a wealth of information which we can convert to metadata and entities, allowing us to begin joining up drawings with their structured SAP data.

For CAD, we decided to use ASPOSE for conversion to text; but, like Tika, we had similar problems with certain large drawings taking an excessive amount of time to process. To resolve this, all we needed to do was carry over the same timeout approach that we used with Tika. We also invested a lot of time making sure that we only extract useful text from CAD drawings – not the thousands of numerical measurement values.

So, that’s the first step of enrichment covered – text extraction. Tomorrow, I will talk about how we extract business entities from information and start to put them to work in the business.

Cheers, and see you soon, Paul…

If you missed my blogs in the 12 Days of Information Enrichment series, you can catch up here.

Aiimi Insights, delivered to you.

Discover the latest data and AI insights, opinions, and news from our experts. Subscribe now to get Aiimi Insights delivered direct to your inbox each month.

Aiimi may contact you with other communications if we believe that it is legitimate to do so. You may unsubscribe from these communications at any time. For information about  our commitment to protecting your information, please review our Privacy Policy.


Enjoyed this insight? Share the post with your network.