Extraction API Documentation
Introduction
The Extraction API provides a powerful interface for extracting structured information from unstructured documents such as PDFs and images. It's particularly useful for processing commercial invoices and other trade-related documents. Key features include:
- Pipeline Execution: Run pre-defined or custom extraction pipelines to process documents and extract relevant information.
- Custom Extractions: Create tailored extraction schemas to target specific fields and tables within documents.
- Document Linking: Connect extracted content with structured data stored in an index, enhancing the context and usability of the extracted information.
- Context-Aware Extraction: Utilize document context and specific hints to improve extraction accuracy for different document types and formats.
This API is ideal for businesses involved in international trade, logistics, and supply chain management, as well as developers looking to automate document processing and improve data accuracy.
Key Concepts
To effectively use the Extraction API, it's important to understand the following key concepts:
Entities
Entities are the fundamental building blocks of extracted information. An entity represents a single piece of data within a document and has the following key properties:
- field_name: The name of the field as it appears in the document.
- field_key: A unique identifier for the field in the extraction schema.
- type: The data type of the field (e.g., "Text", "Date", "Number", "Quantity").
- description: A detailed description of what the field represents and where it might be found in the document.
- examples: (Optional) Sample values for the field to aid in extraction.
For example:
"invoice_number": {
"field_name": "invoice_number",
"field_key": "invoice_number",
"type": "Text",
"description": "The invoice number.",
"examples": ["A-12345"]
}
Key Value Sets
Key Value Sets represent logical groupings of related entities. In the context of our extraction schema, they are typically used for non-tabular data in the document. Each key-value pair consists of a field definition similar to an entity.
Tables
Tables represent structured data organized in rows and columns. In the Extraction API, tables are defined using the following structure:
- key: A unique identifier for the table.
- description: A description of what the table represents.
- columns: An object containing definitions for each column in the table.
Each column is defined similarly to an entity, with properties like field_name, field_key, type, and description.
For example:
"items": {
"key": "items",
"description": "The table of items in the invoice. The table contains the following columns:",
"columns": {
"purchase_order_number": {
"field_name": "purchase_order_number",
"field_key": "purchase_order_number",
"type": "Text",
"description": "The purchase order number (#PO) or customer purchase order number.",
"examples": ["3611245"]
},
// ... other columns
}
}
Example: Creating an Extraction for Invoice Number
Here's an example of how to create an extraction to extract an invoice number from a given invoice:
POST /extractions
{
"description": "Extract invoice number from invoice",
"generationInstruct": {
"document_context": "The document is a commercial invoice with typical attributes common in international trade.",
"instruction": {
"invoice_number": {
"field_name": "invoice_number",
"field_key": "invoice_number",
"type": "Text",
"description": "The invoice number.",
"examples": []
}
}
}
}
This extraction creates a simple pipeline to extract the invoice number. The API will use the provided description to identify and extract the invoice number from the document.
Example: Creating an Extraction for Commercial Invoice Line Items
Here's an example of how to create an extraction to extract line items from a commercial invoice:
POST /extractions
{
"description": "Extract line items from commercial invoice",
"generationInstruct": {
"document_context": "The document is a commercial invoice with typical attributes common in international trade.",
"instruction": {
"items": {
"key": "items",
"description": "The table of items in the invoice. The table contains the following columns:",
"columns": {
"purchase_order_number": {
"field_name": "purchase_order_number",
"field_key": "purchase_order_number",
"type": "Text",
"description": "The purchase order number (PO#) or customer purchase order number. This is not the item number.",
"examples": ["3611245"]
},
"quantity": {
"field_name": "quantity",
"field_key": "quantity",
"type": "Quantity",
"description": "The number of items shipped.",
"examples": []
},
"article_description": {
"field_name": "article_description",
"field_key": "article_description",
"type": "Text",
"description": "The description of the article in the line.",
"examples": []
},
"unit_price": {
"field_name": "unit_price",
"field_key": "unit_price",
"type": "Number",
"description": "The price per item.",
"examples": []
},
"order_value": {
"field_name": "order_value",
"field_key": "order_value",
"type": "Number",
"description": "The amount of the item.",
"examples": []
},
"delivery_date": {
"field_name": "delivery_date",
"field_key": "delivery_date",
"type": "Date",
"description": "The date of the delivery of the item.",
"examples": []
},
}
}
},
"hints": [
"When the article description contains the word electronics, set the purchase_order_number to none."
]
}
}
This extraction creates a schema to extract line items from a commercial invoice. It defines a table structure with columns for purchase order number, quantity, article description, unit price, order value, and delivery date. An example of custom context is given in the hints section. The API will use this schema to identify and extract the line item data from the invoice document.
Best Practices for Creating Extractions
Provide Context: Use the
document_context
field to describe the type of document you're extracting information from.Be Specific in Field Descriptions: Clearly describe what each field represents and where it might be found in the document.
Use Correct Data Types: Specify the correct data type for each field (Text, Number, Date, Quantity) to ensure proper parsing and formatting of extracted data.
Provide Examples: When possible, include examples for fields that might have specific formats or variations.
Define Structured Data: Use the table structure (like the
items
field in the example) to define how repeated structured data should be extracted.Include Custom Hints: The hints section can be used to provide additional context. For example if you're dealing with invoices from multiple companies, include hints about company-specific formats or labeling.
Handle Edge Cases: Include instructions for handling missing data or alternative formats.
By following these practices and customizing your extraction schema to your specific needs, you can create highly accurate and reliable document extraction processes using the Extraction API.