How accurate is AI invoice processing?

On well-tuned pipelines we routinely see 85–92% auto-post rates with human review on the rest. That depends heavily on layout variety, vendor mix, and how cleanly your business rules can be encoded. We measure accuracy honestly — per-field recall and precision, not vanity numbers — and tune until the math justifies deploying.

What about handwritten invoices, receipts, or low-resolution scans?

Vision LLMs handle these meaningfully better than 2020-era OCR. Handwriting accuracy is lower but workable for cleanly written numerical fields (totals, dates). Phone-camera receipts work well when the shot is reasonably framed. Low-res scans hit a quality floor below which any system struggles — we identify the threshold during the prototype phase.

Do I need to retrain the model for my vendors?

Usually no. General vision LLMs (Claude vision, GPT-4o vision, Gemini) handle layout variation without retraining. For very high-volume vendors with consistent layouts, we sometimes use per-vendor templates as a cost/latency optimisation — but the general model is the fallback for everything else.

What's the cost per invoice in production?

LLM cost: €0.01–€0.05 per invoice depending on length and number of pages. Infrastructure: trivial at typical volumes. Reviewer time on the ~10–15% that go to review: ~30 seconds per invoice. Loaded cost typically lands at €0.10–€0.30/invoice vs €3–€5 for fully manual processing.

Can it integrate with our ERP?

Yes — we've shipped NetSuite, Microsoft Dynamics 365 Business Central, Xero, Sage Intacct, QuickBooks Online. Custom field mapping per ERP is part of the build. Where the ERP supports webhooks we use them; otherwise polling with idempotency keys.

How accurate is AI invoice processing?

On well-tuned pipelines we routinely see 85–92% auto-post rates with human review on the rest. That depends heavily on layout variety, vendor mix, and how cleanly your business rules can be encoded. We measure accuracy honestly — per-field recall and precision, not vanity numbers — and tune until the math justifies deploying.

What about handwritten invoices, receipts, or low-resolution scans?

Vision LLMs handle these meaningfully better than 2020-era OCR. Handwriting accuracy is lower but workable for cleanly written numerical fields (totals, dates). Phone-camera receipts work well when the shot is reasonably framed. Low-res scans hit a quality floor below which any system struggles — we identify the threshold during the prototype phase.

Do I need to retrain the model for my vendors?

Usually no. General vision LLMs (Claude vision, GPT-4o vision, Gemini) handle layout variation without retraining. For very high-volume vendors with consistent layouts, we sometimes use per-vendor templates as a cost/latency optimisation — but the general model is the fallback for everything else.

What's the cost per invoice in production?

LLM cost: €0.01–€0.05 per invoice depending on length and number of pages. Infrastructure: trivial at typical volumes. Reviewer time on the ~10–15% that go to review: ~30 seconds per invoice. Loaded cost typically lands at €0.10–€0.30/invoice vs €3–€5 for fully manual processing.

Can it integrate with our ERP?

Yes — we've shipped NetSuite, Microsoft Dynamics 365 Business Central, Xero, Sage Intacct, QuickBooks Online. Custom field mapping per ERP is part of the build. Where the ERP supports webhooks we use them; otherwise polling with idempotency keys.

All resources

AI agents

How AI invoice processing actually works (and where it breaks)

May 5, 2026· updated May 21, 20266 min read

The 90-second version

You receive a stack of invoices. Someone keys them into your ERP. We replace that with code.

A modern AI invoice pipeline:

Ingest — email, portal, SFTP. Deduplicate via content hash.
Extract — Claude / GPT-4o / Gemini vision against a typed schema (Zod or Pydantic).
Validate — schema parses cleanly; line items sum to total within tolerance; tax rates valid.
Match — query ERP for candidate POs; apply tolerance rules.
Decide — high confidence → auto-post; medium → human review; low → reject with structured reason.
Post — write to ERP with original PDF attached, audit log entry, source email marked.

The model is not the hard part. The hard parts are the schema, the reviewer UI, and the eval suite.

Where it breaks

Predictable failure modes, in roughly the order they bite:

Layout variation

Different vendors send different layouts. Some put line items in tables, others in free-form text. Some use abbreviations ("Qty" vs "Quantity"). Some have tax codes that need decoding.

How we handle it: general-purpose vision extraction with a strict Zod schema. The model figures out which field is which from semantic context, not position. For very high-volume vendors with consistent layouts, we layer per-vendor templates on top as an optimization.

Multi-page invoices

Long invoices, often with continuation lines or split totals.

How we handle it: pass all pages to the vision model in one call (most production models handle 50+ page documents reasonably). For very long documents, chunk and run a second-pass merge step.

Tax math that doesn't add up

Vendor invoice says €100 subtotal, €19 tax, €119 total. Schema says quantities × unit prices = subtotal. Math doesn't reconcile. Could be rounding. Could be a missing line. Could be a discount we missed.

How we handle it: tolerance rules (€1 or 0.5% mismatch acceptable for rounding). Beyond tolerance, the invoice goes to review with the discrepancy highlighted.

PO matching across small differences

The invoice says "Wireless Mouse v2" and the PO says "Mouse, wireless, model X-200." Same item. Different description.

How we handle it: vector similarity between invoice line descriptions and PO line descriptions, plus exact match on price within tolerance. The agent picks the most likely PO and surfaces alternatives if confidence is low.

Duplicate invoices

Same vendor, same invoice number, sent twice. Or sent once via email and once via portal.

How we handle it: hash the (vendor, invoice number, total) tuple. Duplicates get caught at ingestion and held with a "possible duplicate" flag for review.

Unknown vendors

A new supplier sends their first invoice. No vendor record exists.

How we handle it: route to a "new vendor" review queue. Reviewer either creates the vendor record (then auto-post going forward) or rejects.

Currency and locale

Decimal separators differ (1,500.00 vs 1.500,00). Date formats vary (DD/MM/YYYY vs MM/DD/YYYY). Currency codes implicit.

How we handle it: explicit currency and locale fields in the schema. Vision LLMs are generally good at inferring from context, but we validate and ask for review on ambiguous cases.

Auth-gated supplier portals

Some vendors send invoices through their portal where you have to log in and download.

How we handle it: where supported, OAuth or API integration. Where not, a credential-managed scraper. Where impractical, manual upload by your AP team kicks off the rest of the pipeline.

Anatomy of a working pipeline

[Email inbox] [Portal upload] [SFTP]
            \   |   /
             [Firestore queue + dedup hash]
                       ↓
             [Claude vision extraction → Zod-typed payload]
                       ↓
             [Business rules: tax math, PO match, vendor whitelist, dup detect]
                       ↓
             [Confidence routing]
              ├─ high → auto-post to ERP
              ├─ med  → review queue (Next.js admin UI)
              └─ low  → reject with structured reason
                       ↓
             [Audit log + dashboard]

Every box observable. Every transition logged. Every decision reversible.

What you should measure

Don't measure "accuracy" without disaggregating. What we actually track:

Metric	Why
Auto-post rate	% of invoices that skip human review
Per-field recall	% of fields correctly extracted, by field
Per-field precision	% of extracted fields that are correct
Review queue latency	how fast humans clear the queue
Cost per invoice (LLM + reviewer time)	the real economic question
Cycle time (receipt → posted)	downstream business metric
Error escape rate	% of posted invoices later corrected

Vanity numbers like "99% accuracy" without disaggregation are unfalsifiable.

The reviewer UI matters more than the model

Reviewer ergonomics dictate whether the system actually saves time. A bad reviewer UI means a human spends 3 minutes per flagged invoice; a good one means 30 seconds.

What a good UI has:

Document on the left, extracted fields on the right.
Keyboard navigation between fields.
Click-to-highlight: clicking a field highlights the source region on the document.
Bulk approval for the queue.
Per-vendor flagging ("always escalate this vendor's invoices").
Quick reject with structured reason templates.

We spend meaningful engineering on the reviewer UI. It's the difference between automation that saves the team time and automation that adds work.

Real numbers from a real engagement

From one client (anonymised), processing ~400 supplier invoices per week:

Metric	Before	After (week 8)
Human time per invoice	4–6 min	~30 sec (review only)
% keyed manually	100%	~13%
Cycle time	2–4 days	~4 hours
Visible error rate	~3% estimated	<1%
Per-invoice cost (loaded)	€3.80	€0.18

Payback hit at month 4. Full breakdown in the Document Intake Agent case study.

What we won't do

Skip the schema sprint. Spending the first week defining "what is an invoice for your business" is the highest-leverage step. Skip it and the whole pipeline misses.
Skip the reviewer UI. Without it, the eventual savings won't materialise.
Skip the shadow-mode period. Running parallel with the human team for 1-2 weeks catches a class of bugs no eval finds.
Promise specific accuracy numbers before the prototype phase. Anyone who quotes "99% accurate" pre-prototype is bluffing.

Where to go next

For the buyer's perspective on AP automation more broadly, see AI agents for accounts payable. For the full architecture of how document agents work under the hood, see How AI agents actually work.

If you have an AP volume problem, our Document Processing service page covers the engagement model. Or drop us a note — one paragraph is enough.

Frequently asked questions

Keep reading

Article

AI agents for accounts payable: a deployment guide

AI agents in AP automate the high-volume, low-margin work of invoice keying and PO matching. Honest savings: €3-5 per invoice in loaded cost, 70-90% reduction in human handling time, payback typically 4-8 months on €25-50k builds. The agent isn't the hard part — the reviewer UI and the ERP integration are.

Article

RAG done right: the patterns that survive production

Production RAG is engineering, not magic. The patterns that survive: hybrid retrieval (vector + BM25), rerank top-k with a cross-encoder, metadata filtering, source dating, citation rendering, sampled human review. Without these, your retrieval is good in the demo and broken in production.

Article

How AI agents actually work (under the hood)

An AI agent is a reasoning loop: the model plans, calls a tool, observes the result, replans. Underneath: function-calling APIs, retrieval-augmented context, typed tool schemas, guardrails, evals, and observability. This is the technical breakdown — what each layer does and how they fit together.

Service

AI Document Processing

Invoices, contracts, receipts, forms — extracted, validated, and pushed straight into your system of record.

Agent type

Document Processing Agent

Invoices, contracts, receipts, and forms → structured data with confidence-tier human review

Want this delivered in your stack?

If the article describes a workflow you'd like to ship, drop us a note. We reply within one business day.

Get a proposal