Outlining the Difference Between Structured, Unstructured and Semi-structured Data
March 25, 2021 / Intelligent Process Automation, Robotic Process Automation
As you delve into process automation, before long you’ll learn about the three basic forms of data and why they matter when it comes to automation: structured, unstructured and semi-structured data. In a nutshell, you can automate processes involving structured data with simple tools, but when it comes to unstructured and semi-structured, you’ll need an intelligent automation platform.
In this post, we’ll walk you through the attributes of each type of data and explain why the data type matters when it comes to intelligent document processing.
Structured data: best for RPA and templates
As its name implies, structured data is highly organized, typically in a database or spreadsheet with rows and columns. As a result, each piece of data can be mapped to a specific, fixed field or location.
Structured data is often managed using the Structured Query Language (SQL), a common programming language for relational databases. With relational databases, it’s possible to view data by various criteria, such as customers by region, and to answer queries such as, “customers who spent more than $500 with us last year.”
It’s relatively easy to automate processes that involve structured data. Robotic process automation (RPA) tools or solutions that use optical character recognition and templates work well with structured data. You can build automation routines that tell the tools exactly where the data they need resides. So long as there’s no deviation from that norm, the tools should work well to automate simple, repetitive tasks, such as extracting data from a spreadsheet and entering into a customer relationship management (CRM), enterprise resource planning (ERP) or other downstream system.
Unstructured data: requires intelligent automation
Unstructured data, on the other hand, adheres to no special format. Types of unstructured data include the text in an email message, PDFs, Word files, photos, presentations, call center or legal transcripts and more.
It’s widely accepted that the vast majority – at least 80% – of all data in any given organization is unstructured. Given it follows no predetermined format, it’s much more difficult to automate processes involving unstructured data. Indeed, until the relatively recent advent of artificial intelligence technology, it was all but impossible.
But AI changes the game. With enough data, we can now train models to “read” unstructured data much like a human does, complete with an understanding of the context behind any given document or image. The models can extract key data elements required to automate a given process, such as financial figures, social security numbers, names, addresses and so on. Or, a model may be fed an image of a damaged car and be smart enough to know, “This car has been in an accident and has damage to the right front fender.”
Semi-structured data: usually requires intelligence
Semi-structured data falls somewhere in between the other two categories. Back to the email example, while the text of the email is unstructured, the header contains structured elements: the “to” and “from” fields, date and time, for example. So, as a whole, an email may be considered an example of semi-structured content.
Digital photos are another example. Typically, they will also contain a date, time and perhaps location where the photo was taken – all structured elements, although the image is wholly unstructured.
For such cases, it’s possible to use an RPA or templated tool to automate some of a process for handling these data types – such as categorizing by date. But you’ll still need an intelligent automation tool to find and extract relative data. Keeping in mind the intelligent tool can also be used on structured data, it makes more sense just use it to automate the entire document processing effort.
Invoices are often touted as an example of semi-structured content. That may be the case if your company gets invoices from only four or five suppliers, and they consistently use the same format. In that case, it’s conceivable you could train an RPA or templated tool to extract key data elements to automate invoice processing.
But large companies likely receive invoices from dozens if not hundreds of companies that use many different formats. You’d be hard-pressed to create templates to handle each of them, and would forever be troubleshooting them as they change over time. Here again, it makes more sense to treat the invoices as unstructured content and use an intelligent data processing tool to automate invoice processing.
Indico’s Intelligent Process Automation platform can handle the gamut of document processing needs, whether it involves highly structured documents, completely unstructured or something in between. It’s effective because it’s built on a database of more than 500 million labeled data points, providing a deep base of knowledge that gives it the context required to “read” and understand virtually any type of content.
Taking advantage of AI technology known as transfer learning, Indico makes it easy for business process owners to put that database to use to automate their own processes. Our intuitive tools enable business process owners to quickly label actual documents, telling the model which data to focus on. In a matter of hours, you can build a model that will be up to 95% accurate.