Intelligent Document Processing has a Data Problem
April 28, 2021 / Intelligent Document Processing, Intelligent Process Automation
Intelligent automation has a data problem. That is, it typically takes enormous quantities of data to effectively automate processes that involve unstructured content. But there is a solution, and it involves intelligent document automation platforms that make effective use of artificial intelligence (AI) technologies including deep learning and transfer learning.
The problem with unstructured content
Unstructured content accounts for some 80% or more of all the data in a typical company. Most everything outside of highly structured content such as spreadsheets and databases is considered unstructured, including email, Word documents, PDFs, images and more. (See this previous post for a deeper dive on unstructured vs. structured content.)
The endless variety of potential formats and content inherent in unstructured data presents a significant problem when it comes to applying AI technology to automate processes that involve such data.
Approaches to training AI models
At a high level, AI technology works by training models, or algorithms, on how to perform a given function. In the case of automated document processing, that means training a model to understand what data you’re looking to extract from a given document. To automate the life insurance underwriting process, for example, you need to train a model to identify data points that are critical to that process – such as an applicant’s age, health history, occupation and so on.
When applying AI to automate document processing, there are essentially two ways to train a model. One is to present numerous examples of the sorts of documents you’re dealing with and identify the data points you want to extract from each. Taking that approach, for a given process you’d need to present thousands or even hundreds of thousands of documents to train a model with any degree of accuracy.
Deep learning, however, flips that approach on its head. You simply provide examples of what you want the end result to be, and the platform figures out how to create a model that produces the desired result. (Under the covers there’s lots of wonky technology involving things like neural networks, but we don’t need to go there.)
Related Article: 3 Keys to Scaling Document Process Automation Enterprise-Wide
How Indico reduces data requirements
In practice, you do still need a sizeable database of labeled data points in order to create effective models. But a good intelligent document processing platform will have that base covered for you. The Indico Intelligent Process Automation platform, for example, is built on a database of some 500 million labeled data points. (We spent the first couple of years of our existence building that database.) That’s enough to provide context behind virtually any type of document or image you throw at it.
Transfer learning enables you to take our generalized model and apply it to your specific process, but with only a fraction of the data you would normally need to train a model. To automate that insurance underwriting process, for example, it would take just 100 to 200 of the documents that are actually involved in the process to train a model with accuracy rates of 90% or better.
As opposed to a traditional AI approach, that reduces your data requirements by 100x to 1000x. In effect, it makes intelligent automation feasible for the vast majority of companies that don’t have the funds, time or ability to collect the required data on their own.
Limited compute requirements
Another benefit is the Indico approach dramatically decreases the level of computer power required for automated document processing. We’ve done most of the model training up front, so our platform can run on just one or two GPUs – not the 10 or more required with traditional AI approaches (an issue we covered in this previous post).