PDF Extraction

pdf_extraction(data, {[api_key]: String, [cloud]: String, [images]: Boolean, [text]: Boolean, [metadata]: Boolean, [tables]: Boolean})
Extract images, body text, metadata, and tables from PDF documents.

Arguments

data – String | List – required – filename of PDF to be analyzed
[api_key] – String – optional – your indico API key
[cloud] – String – optional – your private cloud subdomain
[images] – Boolean (defaults to False) – optional – when True, returns all the images in the PDF file in PIL format
[text] – Boolean (defaults to False) – optional – when True, returns all body text (separately from headers, footers, and tables)
[metadata] – Boolean (defaults to False) – optional – when True, returns the following: tagged, form, producer, author, encryption status, program used to create the file, file size, PDF version, optimised, modification date, title, creation date, number of pages, page size
[tables] – Boolean (defaults to False) – optional – when True, returns the contents of tables in the document, separately from body text

Output

PDF Extraction returns the extracted information in JSON format so you can easily transfer the data to Custom Collections for training your own machine learning models, or for analysis using our pre-trained models. If you need someone to label the data, you can reformat the data and save as a CSV for use with CrowdLabel.

# single output
[
    {
        'images': '[,  ]',
        'text':  
'This is some text. 
And a little more text. 
Here\'s more text after the table.',
        'metadata':  {'tagged': 'no', 'form': 'none', 'producer': 'Mac OS X 10.7.5 Quartz PDFContext', 'author': 'somedude', 'encrypted': 'no', 'creator': 'Safari', 'file size': '5871217 bytes', 'pdf version': '1.3', 'optimized': 'no', 'moddate': 'Thu Feb 28 05:31:00 2012', 'title': 'Check this PDF Out', 'creationdate': 'Thu Feb 28 05:31:00 2012', 'pages': '23', 'page size': '612 x 792 pts (letter) (rotated 0 degrees)'}
        'tables':  [[['	', 'Column 1', 'Column 2', 'Column 3', 'Column 4'], ['Info1', 'Info2', 'Info3', 'Info4'], ['Info11', 'Info22', 'Info33', 'Info44'], ['Info111', 'Info222', 'Info333', 'Info444'], ['	']]]

    }
]

# batch output
[
    [
    {
        'images': '[,  ]',
        'text':  
'This is some text. 
And a little more text. 
Here\'s more text after the table.',
        'metadata':  {'tagged': 'no', 'form': 'none', 'producer': 'Mac OS X 10.7.5 Quartz PDFContext', 'author': 'somedude', 'encrypted': 'no', 'creator': 'Safari', 'file size': '5871217 bytes', 'pdf version': '1.3', 'optimized': 'no', 'moddate': 'Thu Feb 28 05:31:00 2012', 'title': 'Check this PDF Out', 'creationdate': 'Thu Feb 28 05:31:00 2012', 'pages': '23', 'page size': '612 x 792 pts (letter) (rotated 0 degrees)'}
        'tables':  [[['	', 'Column 1', 'Column 2', 'Column 3', 'Column 4'], ['Info1', 'Info2', 'Info3', 'Info4'], ['Info11', 'Info22', 'Info33', 'Info44'], ['Info111', 'Info222', 'Info333', 'Info444'], ['	']]]

    }
],
    [
    {
        'images': '[,  ]',
        'text':  
'This is some text. 
And a little more text. 
Here\'s more text after the table.',
        'metadata':  {'tagged': 'no', 'form': 'none', 'producer': 'Mac OS X 10.7.5 Quartz PDFContext', 'author': 'somedude', 'encrypted': 'no', 'creator': 'Safari', 'file size': '5871217 bytes', 'pdf version': '1.3', 'optimized': 'no', 'moddate': 'Thu Feb 28 05:31:00 2012', 'title': 'Check this PDF Out', 'creationdate': 'Thu Feb 28 05:31:00 2012', 'pages': '23', 'page size': '612 x 792 pts (letter) (rotated 0 degrees)'}
        'tables':  [[['	', 'Column 1', 'Column 2', 'Column 3', 'Column 4'], ['Info1', 'Info2', 'Info3', 'Info4'], ['Info11', 'Info22', 'Info33', 'Info44'], ['Info111', 'Info222', 'Info333', 'Info444'], ['	']]]

    }
]
]

Example

require 'indico'
Indico.api_key =  'YOUR_API_KEY'

# single example
Indico.pdf_extraction("filename")

# batch example
Indico.pdf_extraction([
    "filename",
    "filename2"
])