PDF, or Portable Document Format, files were developed to enable people using different operating systems to open, review, and print files without altering any of the file’s original elements or design. Successful adoption of this format over the years means that there is now a wealth of information stored in PDF files. Extracting that information programatically, however, remains a difficult task. That’s because different programs (e.g. Microsoft Word, Pages) generate PDFs differently. While the final product may look the same to the human eye, it appears very differently to a computer. It’s not uncommon for most PDF extraction tools out there to run into performance and formatting issues. For instance, columns of text may be combined into a single paragraph with sentences mashed up or out of order. Such problems prevent us from efficiently and automatically analyzing the data stored in these files, so our team developed an API to address this. The API extracts body text, images, metadata, and tables. It is compatible with Custom Collections and, with a bit of reformatting, CrowdLabel.
There are five key pain points that we focused on while developing this API:
- Header identified as body text
- Footers and footnotes identified as part of body text
- Columns of text combined into a single paragraph with incorrect sentences
- Contents of tables included as part of body text
- Very slow performance
To give you some insight into the process the API goes through to address these issues, let’s walk through one of our main challenges: data extraction from tables. The goal is to preserve the integrity of the table’s contents, and prevent it from getting jumbled up into the rest of the body text. First, we need to identify existing tables. To do so, the API first finds all path elements in the PDF that are straight lines, and then renders them as black and white images. Next it uses some old school computer vision filters to find points where lines cross, and marks these as the boundaries of the table. If there are enough “cells” in the image for our API to consider it a table, it then finds the pixel coordinates of the table’s four corners and query for all text elements contained within those boundaries. The API then removes those text elements from the article body to produce a cleaner, plain text PDF. If you want to access the data from the tables, the API will return it separately using an optional argument (see the code in the following section!).
Using the PDF Extraction API
If you haven’t already installed indico, follow our Quickstart Guide. It will walk you through the process of getting your API key and installing the
indicoio Python library. If you run into any problems, check the Installation section of the docs, or reach out to us via chat. Assuming your account is all set up and you’ve installed everything, let’s get started!
Using the API is as simple as running these three lines of code:
import indicoio indicoio.config.api_key = 'YOUR_API_KEY' indicoio.pdf_extraction(“filename”, images=True, metadata=True, text=True, tables=True)
By default, the
text arguments are set to
False. Specify the parts you need by setting them to
What does each argument return?
Images is self-explanatory, and
text returns all PDF body text (separately from headers, footers, and tables).
Metadata includes the following information:
- Whether or not it is encrypted
- Program it was created with
- File size
- PDF version
- Modification date
- Creation date
- Number of pages
- Page size
For the moment,
tables returns the cells of each row as a series of lists.
For example, the table above is returned like so:
u'tables': [[[u'Column 1 ', u'Column 2 ', u'Column 3 ', u'Column 4 '], [u'1 ’, u'2 ’, u'3 ', u'4 '], [u'11 ', u'22 ', u'33 ', u'44 ']]]
We’ll be the first to admit that this format is not ideal, and we’re working hard on making it better! (At the moment columns sometimes get mashed together into a single column, or text is included in the wrong cells. Turns out it’s a tricky problem to solve, so we’d rather return something that’s not malformatted, even if it’s unwieldy.) For now, our emphasis is on accurately separating and removing the table content from the body text.
The PDF Extraction API returns the data in JSON format so you can easily to transfer it to Custom Collections for training your own machine learning models, or analyze it using our pre-trained models. If you need someone to label the data, you can reformat the data and save it to a CSV, then easily upload to CrowdLabel for seamless labeling and model training.
We’re intent on creating an API that can extract data from any type of PDF, but like we discussed earlier, PDFs can be structured in many different ways, depending on the program and system that created them. For now, our API is unable to extract
text information from PDFs produced using Firefox and Safari, although it can still pull the metadata.
There’s clearly a plethora of (relatively) easily accessible information available for analysis and model training online. Offline documents contain equally vital data, but the limited capabilities of existing tools make it difficult to tap into that information in a scalable and efficient manner. With tools like our PDF Extraction API for obtaining otherwise difficult-to-access data, combined with Custom Collections for building customized machine learning models, we’re aiming to make machine learning an accessible and seamless solution for people trying to solve diverse problems.
Questions about indico’s PDF Extraction API or feedback on how to improve it? Reach out to us at firstname.lastname@example.org.