Rule-based or templated approaches to document process automation struggle to deal with unstructured content, which makes up the vast majority of content in a typical organization. Intelligent automation solutions, however, are able to understand unstructured documents much like a human does, enabling them to effectively automate processes involving unstructured content for process automation use cases in Financial Services, Insurance, and more.

Download the Everest Group Whitepaper: Unstructured Data Process Automation

Download Now

Unstructured Content Fast Facts

= 85% of Data

in the Enterprise

Less than 2%

is leveraged by AI

80% reduction

in resources with AI

Manually Extracting

relevant information is laborious and error-prone

Document Variation

makes rule-based workflow automation impractical

Intelligent Automation

is needed to understand document context

The Unstructured Data Challenge

unstructured content processes including PDF, excel and other documents, flowing

As companies continually seek to automate as many processes as they can, they often hit a wall when it comes to unstructured data such as long form financial documents, contracts and emails. The reason is simple: by its very nature, unstructured content varies dramatically from one document to the next.

Historically companies dealt with unstructured content by using the only means at their disposal: having humans read the documents, find relevant information and manually enter it into the desired destination system. Such a process is inherently unscalable and error-prone, as humans tend to get tired and bored.

More recently, companies have tried to use rule-based approaches such as OCR Templating and or Robotic Process Automation to automate document processing. This is often met with little to no success because these templates are too brittle to handle the variation inherent in unstructured content. Intelligent process automation software, on the other hand, offers a viable solution in many cases.

This guide will explain the various approaches for dealing with unstructured document processing so you can make an informed decision on which will work best for you.

As companies continually seek to automate as many processes as they can, they often hit a wall when it comes to unstructured data such as long form financial documents, contracts and emails. The reason is simple: by its very nature, unstructured content varies dramatically from one document to the next.

Historically companies dealt with unstructured content by using the only means at their disposal: having humans read the documents, find relevant information and manually enter it into the desired destination system. Such a process is inherently unscalable and error-prone, as humans tend to get tired and bored.

More recently, companies have tried to use rule-based approaches such as OCR Templating and or Robotic Process Automation to automate document processing. This is often met with little to no success because these templates are too brittle to handle the variation inherent in unstructured content. Intelligent process automation software, on the other hand, offers a viable solution in many cases.

This guide will explain the various approaches for dealing with unstructured document processing so you can make an informed decision on which will work best for you.

Machine readable text

What is OCR and do you need it?

Among the first tools you’re likely to encounter while researching solutions for automating processes involving unstructured content is optical character recognition, or OCR. 

You’d use OCR to deal with scanned documents. This means the documents are effectively images, even if they are saved in PDF format.  Images, of course, are not readily machine-readable, so computers can’t immediately process what humans can clearly see is text in the document. OCR addresses that issue by identifying text in such documents and converting it to a digitized format that computers can deal with.

In some instances, you may not even need OCR to interpret a PDF document. If you take a Word document, for example, and save it as a PDF, the text information should be automatically preserved in text format, and any system can then process it directly. 

Caution! Beware OCR Templating…

You may have come across the term “OCR templating” in your search for document processing solutions. It’s often confused with vanilla OCR. While it also uses OCR in its workflow (i.e., turning images into readable text), the key difference arises in the term “templating.” This means that these vendors offer to produce rule-based templates built from a handful of your documents.

Among the first tools you’re likely to encounter while researching solutions for automating processes involving unstructured content is optical character recognition, or OCR. 

You’d use OCR to deal with scanned documents. This means the documents are effectively images, even if they are saved in PDF format.  Images, of course, are not readily machine-readable, so computers can’t immediately process what humans can clearly see is text in the document. OCR addresses that issue by identifying text in such documents and converting it to a digitized format that computers can deal with.

In some instances, you may not even need OCR to interpret a PDF document. If you take a Word document, for example, and save it as a PDF, the text information should be automatically preserved in text format, and any system can then process it directly. 

Caution! Beware OCR Templating…

You may have come across the term “OCR templating” in your search for document processing solutions. It’s often confused with vanilla OCR. While it also uses OCR in its workflow (i.e., turning images into readable text), the key difference arises in the term “templating.” This means that these vendors offer to produce rule-based templates built from a handful of your documents.

Rule-Based Systems for Automated Document Processing

Another approach to automating unstructured document processing is to use rule-based tools. 

This approach requires humans to mark up a sample document by drawing bounding boxes around the relevant bits of information you want to extract and then label each box with the type of data you’re trying to capture – name, address, material changes, etc. 

You’re basically creating an expert system based on three essential elements:

Topology, meaning where elements appear on the page. For example, a date may always be near the top while a total, as in an invoice, typically appears at the bottom. 

Regex, which is a computer programming technique for searching words or information according to letter patterns. For example, you may write a rule to identify Social Security numbers by searching for the pattern: xxx-xx-xxxx. If, however, OCR misidentifies a number or letter (which is often the case), like the number “1” for the letter “l”, regex rules tend to unravel.

Taxonomy, which is the classification of rules, often following a hierarchy. “Number” may be a main category, followed by numerous subcategories, such as “dollar amount,” “social security number,” “street address” and so on. 

For example, you could write a rule to pick up the material changes note in an SEC filing like so:

  1. Look for word “primarily” because it is usually an indicator of the note
  2. Find the first period after the word “primarily” to find the end of the sentence
  3. Extract everything from “primarily” up until the period, classify it as the note

gif showing rule-based templates failing

This sort of approach works well if you need to process many documents that all look the same and will remain so – perhaps financial statements for personal bank accounts that all come from the same bank. If all the statements are from the same financial institution and all the data you’re after is in the same place on every single statement, then something like an OCR templating approach may be a viable solution.

 

But as soon as any variation enters the equation, like statements from a different bank, or variation in language for a longer form of content, such as picking up material changes in SEC filings, humans will need to intervene to create a new or revised template. It’s easy to see how coming up with all the required templates can quickly become unwieldy – and costly. And for use cases consisting of thousands of documents or more, all unique, rule-based approaches will fail. 

Intelligent Automation for Unstructured Content Processing

Turn documents into data

Intelligent process automation (IPA) takes a different approach. Rather than create templates for each possible variation, intelligent process automation tools use technologies such as natural language processing (NLP) that enable them to accurately understand text, tables and images within the context of any given document.

IPA tools do use OCR, to “read” text so that it can be input to the IPA platform. However, rather than requiring rules that identify each specific element in a document, IPA uses deep learning to contextually understand what content you’re looking to extract from a given set of documents.

There’s no need to create templates for every possible variation of a document you may have to deal with, because IPA tools can take what they’ve learned from one document and apply it to others – a concept known as transfer learning.

For example, if you were building a bot to help analysts quickly review SEC filings, you might want it to automatically highlight revenue increases or decreases to assess how well each company is performing. You could write a rule to look for the word “increase,” but could it also accommodate synonyms, misspellings, and implied meaning? If the factors that contributed to the increase are found not next to the word “increase,” but in a completely different sentence, how do you write rules for that?

Because IPA understands language and context, it can pick up on revenue change information no matter what synonym you use, and attribute it to the correct catalysts. In this same vein, OCR misspelling errors that throw off regex rules present little challenge to IPA. Humans can easily read misspelled words because the brain automatically recognizes the mistake and corrects for it. NLP models can do much the same thing, enabling them to more accurately interpret scanned images than rule-based systems can. With IPA, template-based solutions for highly variable content become obsolete.

In the example above, from an Apple 10-Q form, how do you write enough rules to show that “increase,” “favorable,” and multitudes of synonyms are all related, and that iPhone sales and a strong U.S. dollar were important contributing factors? With this example you can see that rule-based approaches are not automation, they are insanity. 

 

How to Evaluate Intelligent Automation Solution Providers

Rule-based vs. Intelligent Automation

Both rule-based and intelligent automation approaches have their place. Which you choose just depends on your exact application requirements. 

To determine which is best for you, there’s basically one big question to consider; What is the nature of the documents you’re dealing with? 

  • If there’s little to no variation between document formats, a rule-based approach should work.
  • If you’ve got lots of variation, you may want to consider intelligent process automation. 

Many vendors purport to have intelligent automation solutions that will meet your requirements for unstructured data digitization, but not all are selling “real” solutions with sound artificial intelligence technology behind them. So you’ll need to conduct due diligence for your organization. 

Following are some questions to consider as you vet vendors

What’s your algorithm strategy?

Some vendors have homegrown algorithms while others, like Indico, use open source algorithms that are constantly being improved. If a vendor can’t explain where its algorithm comes from and what it does, consider that a big red flag. 

What’s your database strategy? 

Answers here again will vary. In Indico’s case, the strategy centers around our base model that comprises more than 500 million data points that enable the IPA platform to understand human language and context. Users then customize the model for their specific application but require 100x to 1000x less data than if starting from scratch. 

What’s your application strategy?

AI solutions must come with applications that make it accessible to those who actually have to use it, including business people. Indico’s platform, for example, has a point-and-click user interface that makes it simple for anyone to create working process automation solutions and models in just hours. 

As you can see, there are many options for processing unstructured documents. With the advances in machine learning that have been made lately, many industries are increasingly choosing intelligent process automation over the rule-based techniques of the past. 

Both rule-based and intelligent automation approaches have their place. Which you choose just depends on your exact application requirements. 

To determine which is best for you, there’s basically one big question to consider; What is the nature of the documents you’re dealing with? 

  • If there’s little to no variation between document formats, a rule-based approach should work.
  • If you’ve got lots of variation, you may want to consider intelligent process automation. 

Many vendors purport to have intelligent automation solutions that will meet your requirements for unstructured data digitization, but not all are selling “real” solutions with sound artificial intelligence technology behind them. So you’ll need to conduct due diligence for your organization. 

Following are some questions to consider as you vet vendors

What’s your algorithm strategy?

Some vendors have homegrown algorithms while others, like Indico, use open source algorithms that are constantly being improved. If a vendor can’t explain where its algorithm comes from and what it does, consider that a big red flag. 

What’s your database strategy? 

Answers here again will vary. In Indico’s case, the strategy centers around our base model that comprises more than 500 million data points that enable the IPA platform to understand human language and context. Users then customize the model for their specific application but require 100x to 1000x less data than if starting from scratch. 

What’s your application strategy?

AI solutions must come with applications that make it accessible to those who actually have to use it, including business people. Indico’s platform, for example, has a point-and-click user interface that makes it simple for anyone to create working process automation solutions and models in just hours. 

As you can see, there are many options for processing unstructured documents. With the advances in machine learning that have been made lately, many industries are increasingly choosing intelligent process automation over the rule-based techniques of the past.