Extract API

Odoo provides a service to automate the processing of documents of type invoices, expenses or resumes.

The service scans documents using an OCR engine and then uses AI-based algorithms to extract fields of interest such as the total, due date, or invoice lines for invoices, the total, date, payment reference for expenses, or the name, email, phone number for resumes.

This service is a paid service. Each document processing will cost you one credit. Credits can be bought on iap.odoo.com.

You can either use this service directly in the Accounting, Expense, or Recruitment App or through the API. The Extract API, which is detailed in the next section, allows you to integrate our service directly into your own projects.

Overview

The extract API uses the JSON-RPC2 protocol; its endpoint routes are located at https://iap-extract.odoo.com.

Version

The version of the Extract API is specified in the route.

The latest versions are:
  • invoices: 122

  • expenses: 132

  • applicant: 102

Flow

The flow is the same for the three services.

  1. Call /parse to submit your invoices (one call for each invoice). On success, you receive a document_uuid in the response.
  2. You then have to regularly poll /get_result to get the document’s parsing status.
    Alternatively, you can provide a webhook_url at the time of the call to /parse and you will be notified (via a POST request) when the result is ready.
  3. Once the result is received, you can validate it by calling /validate and sending the expected values.
    This step is optional but greatly helps the system to improve.

The HTTP POST method should be used for all of them. A python implementation of the full flow for invoices can be found here and a token for integration testing is provided in the integration testing section.

Parse

Request the processing of a document from the OCR. The route will return a document_uuid (it replaces document_id, which is deprecated); you can use it to obtain the result of your request.

Routes

  • /api/extract/invoice/1/parse

  • /api/extract/expense/1/parse

  • /api/extract/applicant/1/parse

  • /iap/invoice_extract/parse (deprecated)

  • /iap/expense_extract/parse (deprecated)

Request

jsonrpc (required)

see JSON-RPC2

method (required)

see JSON-RPC2

id (required)

see JSON-RPC2

params
account_token (required)

The token of the account from which credits will be taken. Each successful call costs one token.

version (optional)

The version will determine the format of your requests and the format of the server response. Some results can be unavailable in older versions. If not specified, the latest version will be used.

documents (required)

The document must be provided as a string in the ASCII encoding. The list should contain only one string. If multiple strings are provided only the first string corresponding to a pdf will be processed. If no pdf is found, the first string will be processed. This field is a list only for legacy reasons. The supported extensions are pdf, png, jpg and bmp.

dbuuid (optional)

Unique identifier of the Odoo database.

webhook_url (optional)

A webhook URL can be provided. An empty POST request will be sent to webhook_url/document_uuid when the result is ready.

user_infos (optional)

Information concerning the person sending the document to the extract service. It can be the client or the supplier (depending on the perspective). This information is not required in order for the service to work but it greatly improves the quality of the result.

user_company_vat (optional)

VAT number of the user.

user_company_name (optional)

Name of the user’s company.

user_company_country_code (optional)

Country code of the user. Format: ISO3166 alpha-2.

user_lang (optional)

The user language. Format: language_code + _ + locale (e.g. fr_FR, en_US).

user_email (optional)

The user email.

purchase_order_regex (optional)

Regex for purchase order identification. Will default to Odoo PO format if not provided.

perspective (optional)

Can be client or supplier. This field is useful for invoices only. client means that the user information provided are related to the client of the invoice. supplier means that it’s related to the supplier. If not provided, client will be used.

{
    "jsonrpc": "2.0",
    "method": "call",
    "params": {
        "account_token": string (hex),
        "version": int,
        "documents": [string],
        "user_infos": {
            "user_company_vat": string,
            "user_company_name": string,
            "user_company_country_code": string,
            "user_lang": string,
            "user_email": string,
            "purchase_order_regex": string,
            "perspective": string,
        },
    },
    "id": string,
}

注解

The user_infos parameter is optional but it greatly improves the quality of the result, especially for invoices. The more information you can provide, the better.

Response

jsonrpc

see JSON-RPC2

id

see JSON-RPC2

result

Dictionary containing the following content.

status (replaces status_code)

The code indicating the status of the request. “success” in case of success. Other values are detailed in the table below.

status_msg

A string giving verbose details about the request status.

document_uuid (replaces document_id)

Only present if the request is successful.

status_code (deprecated)

The code indicating the status of the request. 0 in case of success. Other values are detailed in the table below.

document_id (deprecated)

Only present if the request is successful.

status

status_msg

status_code (deprecated)

success

Success

0

error_internal

An error occurred

2

error_no_credit

You don’t have enough credit

3

error_unsupported_format

Unsupported file format

6

error_maintenance

Server is currently under maintenance. Please try again later.

9

{
    "jsonrpc": "2.0",
    "id": string,
    "result": {
        "status": string,
        "status_code": int,  // deprecated
        "status_msg": string,
        "document_uuid": string,
        // "document_id": int,  // deprecated
    }
}

警告

The document_id field is deprecated and will be removed in the future. Please use document_uuid instead.

Version introducing document_uuid:
  • invoices: 122

  • expenses: 132

  • applicant: 102

注解

The API does not actually use the JSON-RPC error scheme. Instead the API has its own error scheme bundled inside a successful JSON-RPC result.

Get results

Routes

  • /api/extract/invoice/1/get_result

  • /api/extract/invoice/1/get_result_batch

  • /api/extract/expense/1/get_result

  • /api/extract/expense/1/get_result_batch

  • /api/extract/applicant/1/get_result

  • /api/extract/applicant/1/get_result_batch

  • /api/extract/invoice/1/get_results (deprecated)

  • /iap/invoice_extract/get_result (deprecated)

  • /iap/expense_extract/get_result (deprecated)

Request

jsonrpc (required)

see JSON-RPC2

method (required)

see JSON-RPC2

id (required)

see JSON-RPC2

params

Dictionary containing the following content.

version (required)

Same as for /parse.

documents_uuids (required, replaces documents_ids)

The list of document_id for which you want to get the current parsing status.

documents_ids (deprecated)

The list of document_id for which you want to get the current parsing status.

{
    "jsonrpc": "2.0",
    "method": "call",
    "params": {
        "version": int,
        // "documents_ids": [],  // deprecated
        "documents_uuids": [],
    },
    "id": string,
}

注解

The code snippet shows the request to the /api/extract/invoice/1/get_result_batch route. You can use the endpoint /api/extract/invoice/1/get_result to get the result of a single document. In that case, you don’t need to provide a list of document_uuids but a single document_uuid.

Response

When getting the results from the parse, the detected field vary a lot depending on the type of document. Each response is a list of dictionaries, one for each document. The keys of the dictionary are the name of the field and the value is the value of the field.

jsonrpc

see JSON-RPC2

id

see JSON-RPC2

result

Dictionary where each key is a document_id. For each document_id

status

Same as for /parse.

status_code

Same as for /parse.

status_msg

Same as for /parse.

results

Only present if the request is successful.

full_text_annotation

Contains the unprocessed full result from the OCR for the document

{
    "jsonrpc": "2.0",
    "id": string,
    "result": {
        "document_id_1": {
            "status": string,
            "status_code": int,  // deprecated
            "status_msg": string,
            "results": [
                {
                    "full_text_annotation": string,
                    "feature_1_name": feature_1_result,
                    "feature_2_name": feature_2_result,
                    ...
                },
                ...
            ]
        },
        "document_id_2": {
            "status": string,
            "status_code": int,  // deprecated
            "status_msg": string,
            "results": [
                {
                    "full_text_annotation": string,
                    "feature_1_name": feature_1_result,
                    "feature_2_name": feature_2_result,
                    ...
                },
                ...
            ]
        },
        ...
    }
}

注解

The example shows the response from the /api/extract/invoice/1/get_result_batch route. When using the /api/extract/invoice/1/get_result route (note the singularity), the response will be the value of the key associated to the invoice.

警告

Result keys are strings despite the fact that the document_ids given in the request body are integers.

Common fields

feature_result

Each field of interest we want to extract from the document such as the total or the due date are also called features. An exhaustive list of all the extracted features associated to a type of document can be found in the sections below.

For each feature, we return a list of candidates and we spotlight the candidate our model predicts to be the best fit for the feature.

selected_value (optional)

The best candidate for this feature.

selected_values (optional)

The best candidates for this feature.

candidates (optional)

List of all the candidates for this feature ordered by decreasing confidence score.

"feature_name": {
    "selected_value": candidate_12,
    "candidates": [candidate_12, candidate_3, candidate_4, ...]
}
candidate

For each candidate we give its representation and position in the document. Candidates are sorted by decreasing order of suitability.

content

Representation of the candidate.

coords

[center_x, center_y, width, height, rotation_angle]. The position and dimensions are relative to the size of the page and are therefore between 0 and 1. The angle is a clockwise rotation measured in degrees.

page

Page of the original document on which the candidate is located (starts at 0).

"candidate": [
    {
        "content": string|float,
        "coords": [float, float, float, float, float],
        "page": int
    },
    ...
]

Invoices

Invoices are complex and can have a lot of different fields. The following table gives an exhaustive list of all the fields we can extract from an invoice.

Feature name

Specifities

SWIFT_code

content is a dictionary encoded as a string.

It contains information about the detected SWIFT code (or BIC).

Keys:

bic

detected BIC (string).

name (optional)

bank name (string).

country_code

ISO3166 alpha-2 country code of the bank (string).

city (optional)

city of the bank (string).

verified_bic

True if the BIC has been found in our DB (bool).

Name and city are present only if verified_bic is true.

iban

content is a string

aba

content is a string

VAT_Number

content is a string

Depending on the value of perspective in the user_infos, this will be the VAT number of the supplier or the client. If perspective is client, it’ll be the supplier’s VAT number. If it’s supplier, it’s the client’s VAT number.

qr-bill

content is a string

payment_ref

content is a string

purchase_order

content is a string

Uses selected_values instead of selected_value

country

content is a string

currency

content is a string

date

content is a string

Format : YYYY-MM-DD

due_date

Same as for date

total_tax_amount (previously global_taxes_amount)

content is a float

invoice_id

content is a string

subtotal

content is a float

total

content is a float

supplier

content is a string

client

content is a string

email

content is a string

website

content is a string

feature_result for the invoice_lines feature

It follows a more specific structure. It is basically a list of dictionaries where each dictionary represents an invoice line. Each value follows a feature_result structure.

"invoice_lines": [
    {
        "description": feature_result,
        "discount": feature_result,
        "product": feature_result,
        "quantity": feature_result,
        "subtotal": feature_result,
        "total": feature_result,
        "taxes": feature_result,
        "total": feature_result,
        "unit": feature_result,
        "unit_price": feature_result
    },
    ...
]

Expense

The expenses are less complex than invoices. The following table gives an exhaustive list of all the fields we can extract from an expense report.

Feature name

Specifities

description

content is a string

country

content is a string

date

content is a string

total

content is a float

currency

content is a string

bill_reference

content is a string

Applicant

This third type of document is meant for processing resumes. The following table gives an exhaustive list of all the fields we can extract from a resume.

Feature name

Specifities

name

content is a string

email

content is a string

phone

content is a string

mobile

content is a string

Validate

The validation step is an optional step but is strongly recommended. By telling the system if it were right or wrong for each feature you give an important feedback. It has no direct impact but it helps the system to greatly improve its prediction accuracy for the documents you will send in the future.

Routes

  • /api/extract/invoice/1/validate

  • /api/extract/invoice/1/validate_batch

  • /api/extract/expense/1/validate

  • /api/extract/expense/1/validate_batch

  • /api/extract/applicant/1/validate

  • /api/extract/applicant/1/validate_batch

  • /iap/invoice_extract/validate (deprecated)

  • /iap/expense_extract/validate (deprecated)

Request

jsonrpc (required)

see JSON-RPC2

method (required)

see JSON-RPC2

id (required)

see JSON-RPC2

params (/validate route only)

dictionary containing the following fields

document_uuid (required, replaces document_id)

Same as for /parse.

values

Contains the validation for each feature. For invoices, the field merged_line indicates if the lines were merged or not.

document_id (deprecated)

Same as for /parse.

invoice_lines have been merged or not.

params (/validate_batch route only)

dictionary containing the following fields

documents

Contains the validation for each feature for each document, the document_uuid are the keys and their values is the content of the value field of the /validate route.

// for the /validate route
{
    "jsonrpc": "2.0",
    "method": "call",
    "params": {
        // "document_id": int,  // deprecated
        "document_uuid": string,
        "values": {
            "merged_lines": bool,  // for invoices
            "feature_name_1": validation_1,
            "feature_name_2": validation_2,
            ...
        }
    },
    "id": string,
}

// for the /validate_batch route
{
    "jsonrpc": "2.0",
    "method": "call",
    "params": {
        "documents": {
            document_uuid_1: {
                "merged_lines": bool,  // for invoices
                "feature_name_1": validation_1,
                "feature_name_2": validation_2,
                ...
            },
            document_uuid_2: {
                "merged_lines": bool,  // for invoices
                "feature_name_1": validation_1,
                "feature_name_2": validation_2,
                ...
            },
            ...
        }
    },
    "id": string,
}

注解

You don’t have to validate all the features in order for the validation to succeed. However /validate can’t be called multiple times for a same document. Therefore you should validate all the features you want to validate at once.

validation

A validation for a given feature is a dictionary containing the textual representation of the expected value for this given feature. This format apply for all the features except for invoice_lines which has a more complex validation format.

"feature_name": {
    "content": string|float
}
invoice_lines

lines is a list of dictionaries. Each dictionary represents an invoice line. The dictionary keys speak for themselves. Note that there is no content for this feature.

"invoice_lines": {
    "lines": [
        {
            "description": string,
            "quantity": float,
            "unit_price": float,
            "product": string,
            "taxes_amount": float,
            "taxes": [
                {
                    "amount": float,
                    "type": "fixed"|"percent",
                    "price_include": bool
                },
                ...
            ],
            "subtotal": float,
            "total": float
        },
        ...
    ]
}

Response

jsonrpc

see JSON-RPC2

id

see JSON-RPC2

result
status

Same as for /parse.

status_msg

Same as for /parse.

status_code (deprecated)

Same as for /parse.

status

status_msg

status_code (deprecated)

success

Success

0

error_validation_format

Validation format is incorrect

12

{
    "jsonrpc": "2.0",
    "id": string,
    "result": {
        "status": string,
        // "status_code": int,  // deprecated
        "status_msg": string,
    }
}

Integration Testing

You can test your integration by using integration_token as account_token in the /parse request.

Using this token put you in test mode and allows you to simulate the entire flow without really parsing a document and without being billed one credit for each successful document parsing.

The only technical differences in test mode is that the document you send is not parsed by the system and that the response you get from /get_result is a hard-coded one.

A python implementation of the full flow for invoices can be found here.