Document Intelligence API

OCR documents,
extract data,
and build workflows.

One API for document intelligence: clean OCR text, schema-based extraction, classification, semantic location, and workflow primitives.

Keep JSON Schema for extraction. Use OCR, classification, and location directly when your workflow needs them.

Extraction is solved.
Everything around it isn't.

Every AI vendor can pull text from a PDF. The real problems are everything else.

01

Setup overhead

Training models, building templates, labeling examples — for every new document type. Then re-doing it when the format changes.

02

The trust gap

Getting a value with no way to verify it. No confidence score. No source reference. No audit trail. Just "here's what we think."

03

Workflow fragmentation

Separate tools for classification, extraction, reconciliation, and code mapping. Three vendors and a Zapier integration to process one document.

04

Format sprawl

Different pipelines for PDFs, images, CSVs, and Excel files. Different parsers, different edge cases, different failure modes.

Nugget replaces all of it with one composable API for OCR, extraction, classification, location, and reasoning.

A JSON Schema is all you need

Define the output you want. Nugget figures out how to get it.

1

Define your schema

Standard JSON Schema — describe the fields you want extracted, their types, and which are required.

{
  "type": "object",
  "properties": {
    "vendor": { "type": "string" },
    "total":  { "type": "number" }
  }
}
2

Send a document

Upload any supported format. Nugget handles OCR, parsing, and chunking automatically.

PDF JPEG PNG TIFF WebP CSV XLSX
3

Get verified data

Structured JSON matching your schema — with confidence scores and source evidence on every field.

{
  "value": "Acme Corp",
  "confidence": 0.97,
  "evidence": [{ ... }]
}
POST /v1/ocr
curl -X POST "$NUGGET_URL/v1/ocr" \
  -H "x-api-key: $NUGGET_API_KEY" \
  -F "file=@document.pdf"
200 OK returns OCR inline. 202 Accepted returns job_id and status_url for polling.

See it in action

A complete extraction in one call — the result is pushed to your webhook.

Request
POST /v1/extractions
curl -X POST "$NUGGET_URL/v1/extractions" \
  -H "x-api-key: $NUGGET_API_KEY" \
  -F "file=@invoice.pdf" \
  -F "callback_url=https://yourapp.com/webhooks/nugget" \
  -F "callback_signing_secret=$NUGGET_SIGNING_SECRET" \
  -F 'schema={
    "type": "object",
    "properties": {
      "vendor_name": {
        "type": "string",
        "description": "Company that issued the invoice"
      },
      "invoice_number": {
        "type": "string"
      },
      "total_amount": {
        "type": "number",
        "description": "Total amount due"
      },
      "line_items": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "description": { "type": "string" },
            "quantity": { "type": "integer" },
            "unit_price": { "type": "number" }
          }
        }
      }
    },
    "required": ["vendor_name", "invoice_number"]
  }'
Webhook delivery
POST → your callback_url
X-Nugget-Event: extraction.completed
X-Nugget-Signature: sha256=9f86d081884c…

{
  "job_id": "8f3c2a91",
  "status": "succeeded",
  "result": {
    "data": {
      "vendor_name": {
        "value": "Acme Corp",
        "confidence": 0.97,
        "evidence": [{
          "page": 1,
          "quote": "ACME CORP\n123 Business Lane",
          "bbox": {
            "x_min": 0.05, "y_min": 0.02,
            "x_max": 0.45, "y_max": 0.08
          }
        }]
      },
      "invoice_number": {
        "value": "INV-2024-0847",
        "confidence": 0.99,
        "evidence": [{
          "page": 1,
          "quote": "Invoice #: INV-2024-0847",
          "bbox": {
            "x_min": 0.62, "y_min": 0.04,
            "x_max": 0.93, "y_max": 0.09
          }
        }]
      },
      "total_amount": {
        "value": 4250.00,
        "confidence": 0.95,
        "evidence": [{
          "page": 2,
          "quote": "Total Due: $4,250.00",
          "bbox": {
            "x_min": 0.64, "y_min": 0.86,
            "x_max": 0.92, "y_max": 0.91
          }
        }]
      }
    }
  }
}

Prefer to poll? Omit callback_url and GET the returned status_url instead.

Every field has a receipt.

Nugget doesn't just give you data — it proves where every value came from. Per-field confidence scores, exact evidence quotes from the source text, page numbers, and pixel-level bounding boxes.

In regulated industries — trade compliance, financial services, insurance — "we think this is the right value" isn't good enough. You need an audit trail. Nugget builds one automatically.

0.97
Confidence score
pg 1
Source page
"ACME CORP"
Evidence quote
0.05, 0.02
Bounding box
Field result
{
  "path": "vendor_name",
  "value": "Acme Corp",
  "confidence": 0.97,
  "evidence": [
    {
      "page": 1,
      "quote": "ACME CORP\n123 Business Lane\nSan Francisco, CA 94102",
      "bbox": {
        "x_min": 0.051,
        "y_min": 0.023,
        "x_max": 0.448,
        "y_max": 0.081
      }
    }
  ],
  "error": null
}

Eight operations. One API.
The full document lifecycle.

Not just an extractor — a composable pipeline for real document workflows.

OCR

Get clean text, page text, tokens, geometry, and optional raw DocAI output through one endpoint. Nugget returns inline OCR or queues automatically.

POST /v1/ocr

Split

Handle composite document bundles without manual separation.

POST /v1/split

Classify

Route documents to the right workflow before you even open them.

POST /v1/classify

Locate

Find semantic regions before extraction — sections, tables, appendices, or signature blocks.

POST /v1/locate

Extract

Turn documents into clean, typed data — without building extraction rules.

POST /v1/extractions

Resolve

When multiple sources disagree, get the right answer automatically.

POST /v1/resolve

Map

Connect messy real-world descriptions to your canonical master data.

POST /v1/mappings

Infer

Derive structured conclusions — regulatory flags, categorizations, summaries — from extracted data.

POST /v1/infer

Compose them into a pipeline
OCR Split Classify Locate Extract Resolve Map Infer

Process a composite customs bundle: split the PDF into individual documents, classify each one, locate relevant regions, extract the fields you need, resolve conflicts between overlapping sources, map product descriptions to HS codes, and infer regulatory requirements — all through one API.

Built for real document workflows

From trade compliance to invoice processing — if your workflow involves documents and structured data, Nugget handles it.

Trade & Customs

From dock to database in seconds

Process commercial invoices, packing lists, and bills of lading. Extract line items, match HS codes with fuzzy mapping, and flag regulatory requirements automatically.

// Split a customs bundle, then
// extract from each segment
{
  "hs_code": { "type": "string" },
  "country_of_origin": { "type": "string" },
  "declared_value": { "type": "number" }
}
Invoice & AP

Any vendor, any format, one schema

Extract vendor details, line items, and totals from any invoice format. No template setup per vendor — write a schema once and it works across all of them.

// Works for every vendor
{
  "vendor": { "type": "string" },
  "invoice_no": { "type": "string" },
  "due_date": { "type": "string" },
  "total": { "type": "number" }
}
Document Pipelines

End-to-end, not point solution

Classify incoming documents, split composite PDFs, extract fields, resolve conflicts between sources, and push clean structured data to your downstream systems.

// Webhook delivery on completion
{
  "callback_url":
    "https://example.com/webhook",
  "cache": true
}

Under the hood

Built for production from day one.

Intelligent caching

Fingerprint-based caching at every layer — re-processing identical documents is nearly free. Upload once, then reference by file_id across all endpoints.

Webhooks & async

Nugget picks sync vs async automatically — you never choose. Async jobs get at-least-once webhook delivery with HMAC signing, so no polling.

Multi-tenant

API key auth, per-tenant rate limits, isolated storage. Built for shared infrastructure.

Large document support

Up to 200 pages per PDF with intelligent chunking and automatic queued OCR for large uncached documents.

Field normalization

Built-in ops: trim, collapse whitespace, uppercase, titlecase, strip periods, and custom hints.

Auto-repair

Schema-validated outputs with automatic retry and repair when the model output doesn't match.

Text & protected inputs

Extract or classify straight from raw text, and process password-protected PDFs — decrypted in memory, never stored.

Ready to try it?

Explore the interactive API docs, try a request, see the response shape. No sign-up required to browse.