One API for document intelligence: clean OCR text, schema-based extraction, classification, semantic location, and workflow primitives.
Keep JSON Schema for extraction. Use OCR, classification, and location directly when your workflow needs them.
Every AI vendor can pull text from a PDF. The real problems are everything else.
Training models, building templates, labeling examples — for every new document type. Then re-doing it when the format changes.
Getting a value with no way to verify it. No confidence score. No source reference. No audit trail. Just "here's what we think."
Separate tools for classification, extraction, reconciliation, and code mapping. Three vendors and a Zapier integration to process one document.
Different pipelines for PDFs, images, CSVs, and Excel files. Different parsers, different edge cases, different failure modes.
Nugget replaces all of it with one composable API for OCR, extraction, classification, location, and reasoning.
Define the output you want. Nugget figures out how to get it.
Standard JSON Schema — describe the fields you want extracted, their types, and which are required.
{
"type": "object",
"properties": {
"vendor": { "type": "string" },
"total": { "type": "number" }
}
}
Upload any supported format. Nugget handles OCR, parsing, and chunking automatically.
Structured JSON matching your schema — with confidence scores and source evidence on every field.
{
"value": "Acme Corp",
"confidence": 0.97,
"evidence": [{ ... }]
}
curl -X POST "$NUGGET_URL/v1/ocr" \ -H "x-api-key: $NUGGET_API_KEY" \ -F "file=@document.pdf"
A complete extraction in one call — the result is pushed to your webhook.
curl -X POST "$NUGGET_URL/v1/extractions" \ -H "x-api-key: $NUGGET_API_KEY" \ -F "file=@invoice.pdf" \ -F "callback_url=https://yourapp.com/webhooks/nugget" \ -F "callback_signing_secret=$NUGGET_SIGNING_SECRET" \ -F 'schema={ "type": "object", "properties": { "vendor_name": { "type": "string", "description": "Company that issued the invoice" }, "invoice_number": { "type": "string" }, "total_amount": { "type": "number", "description": "Total amount due" }, "line_items": { "type": "array", "items": { "type": "object", "properties": { "description": { "type": "string" }, "quantity": { "type": "integer" }, "unit_price": { "type": "number" } } } } }, "required": ["vendor_name", "invoice_number"] }'
X-Nugget-Event: extraction.completed X-Nugget-Signature: sha256=9f86d081884c… { "job_id": "8f3c2a91", "status": "succeeded", "result": { "data": { "vendor_name": { "value": "Acme Corp", "confidence": 0.97, "evidence": [{ "page": 1, "quote": "ACME CORP\n123 Business Lane", "bbox": { "x_min": 0.05, "y_min": 0.02, "x_max": 0.45, "y_max": 0.08 } }] }, "invoice_number": { "value": "INV-2024-0847", "confidence": 0.99, "evidence": [{ "page": 1, "quote": "Invoice #: INV-2024-0847", "bbox": { "x_min": 0.62, "y_min": 0.04, "x_max": 0.93, "y_max": 0.09 } }] }, "total_amount": { "value": 4250.00, "confidence": 0.95, "evidence": [{ "page": 2, "quote": "Total Due: $4,250.00", "bbox": { "x_min": 0.64, "y_min": 0.86, "x_max": 0.92, "y_max": 0.91 } }] } } } }
Prefer to poll? Omit callback_url and GET the returned status_url instead.
Nugget doesn't just give you data — it proves where every value came from. Per-field confidence scores, exact evidence quotes from the source text, page numbers, and pixel-level bounding boxes.
In regulated industries — trade compliance, financial services, insurance — "we think this is the right value" isn't good enough. You need an audit trail. Nugget builds one automatically.
{
"path": "vendor_name",
"value": "Acme Corp",
"confidence": 0.97,
"evidence": [
{
"page": 1,
"quote": "ACME CORP\n123 Business Lane\nSan Francisco, CA 94102",
"bbox": {
"x_min": 0.051,
"y_min": 0.023,
"x_max": 0.448,
"y_max": 0.081
}
}
],
"error": null
}
Not just an extractor — a composable pipeline for real document workflows.
Get clean text, page text, tokens, geometry, and optional raw DocAI output through one endpoint. Nugget returns inline OCR or queues automatically.
POST /v1/ocr
Handle composite document bundles without manual separation.
POST /v1/split
Route documents to the right workflow before you even open them.
POST /v1/classify
Find semantic regions before extraction — sections, tables, appendices, or signature blocks.
POST /v1/locate
Turn documents into clean, typed data — without building extraction rules.
POST /v1/extractions
When multiple sources disagree, get the right answer automatically.
POST /v1/resolve
Connect messy real-world descriptions to your canonical master data.
POST /v1/mappings
Derive structured conclusions — regulatory flags, categorizations, summaries — from extracted data.
POST /v1/infer
Process a composite customs bundle: split the PDF into individual documents, classify each one, locate relevant regions, extract the fields you need, resolve conflicts between overlapping sources, map product descriptions to HS codes, and infer regulatory requirements — all through one API.
From trade compliance to invoice processing — if your workflow involves documents and structured data, Nugget handles it.
Process commercial invoices, packing lists, and bills of lading. Extract line items, match HS codes with fuzzy mapping, and flag regulatory requirements automatically.
// Split a customs bundle, then // extract from each segment { "hs_code": { "type": "string" }, "country_of_origin": { "type": "string" }, "declared_value": { "type": "number" } }
Extract vendor details, line items, and totals from any invoice format. No template setup per vendor — write a schema once and it works across all of them.
// Works for every vendor { "vendor": { "type": "string" }, "invoice_no": { "type": "string" }, "due_date": { "type": "string" }, "total": { "type": "number" } }
Classify incoming documents, split composite PDFs, extract fields, resolve conflicts between sources, and push clean structured data to your downstream systems.
// Webhook delivery on completion { "callback_url": "https://example.com/webhook", "cache": true }
Built for production from day one.
Fingerprint-based caching at every layer — re-processing identical documents is nearly free. Upload once, then reference by file_id across all endpoints.
Nugget picks sync vs async automatically — you never choose. Async jobs get at-least-once webhook delivery with HMAC signing, so no polling.
API key auth, per-tenant rate limits, isolated storage. Built for shared infrastructure.
Up to 200 pages per PDF with intelligent chunking and automatic queued OCR for large uncached documents.
Built-in ops: trim, collapse whitespace, uppercase, titlecase, strip periods, and custom hints.
Schema-validated outputs with automatic retry and repair when the model output doesn't match.
Extract or classify straight from raw text, and process password-protected PDFs — decrypted in memory, never stored.
Explore the interactive API docs, try a request, see the response shape. No sign-up required to browse.