AI agents

How AI invoice processing actually works (and where it breaks)

· updated May 21, 20266 min read

The 90-second version

You receive a stack of invoices. Someone keys them into your ERP. We replace that with code.

A modern AI invoice pipeline:

  1. Ingest — email, portal, SFTP. Deduplicate via content hash.
  2. Extract — Claude / GPT-4o / Gemini vision against a typed schema (Zod or Pydantic).
  3. Validate — schema parses cleanly; line items sum to total within tolerance; tax rates valid.
  4. Match — query ERP for candidate POs; apply tolerance rules.
  5. Decide — high confidence → auto-post; medium → human review; low → reject with structured reason.
  6. Post — write to ERP with original PDF attached, audit log entry, source email marked.

The model is not the hard part. The hard parts are the schema, the reviewer UI, and the eval suite.

Where it breaks

Predictable failure modes, in roughly the order they bite:

Layout variation

Different vendors send different layouts. Some put line items in tables, others in free-form text. Some use abbreviations ("Qty" vs "Quantity"). Some have tax codes that need decoding.

How we handle it: general-purpose vision extraction with a strict Zod schema. The model figures out which field is which from semantic context, not position. For very high-volume vendors with consistent layouts, we layer per-vendor templates on top as an optimization.

Multi-page invoices

Long invoices, often with continuation lines or split totals.

How we handle it: pass all pages to the vision model in one call (most production models handle 50+ page documents reasonably). For very long documents, chunk and run a second-pass merge step.

Tax math that doesn't add up

Vendor invoice says €100 subtotal, €19 tax, €119 total. Schema says quantities × unit prices = subtotal. Math doesn't reconcile. Could be rounding. Could be a missing line. Could be a discount we missed.

How we handle it: tolerance rules (€1 or 0.5% mismatch acceptable for rounding). Beyond tolerance, the invoice goes to review with the discrepancy highlighted.

PO matching across small differences

The invoice says "Wireless Mouse v2" and the PO says "Mouse, wireless, model X-200." Same item. Different description.

How we handle it: vector similarity between invoice line descriptions and PO line descriptions, plus exact match on price within tolerance. The agent picks the most likely PO and surfaces alternatives if confidence is low.

Duplicate invoices

Same vendor, same invoice number, sent twice. Or sent once via email and once via portal.

How we handle it: hash the (vendor, invoice number, total) tuple. Duplicates get caught at ingestion and held with a "possible duplicate" flag for review.

Unknown vendors

A new supplier sends their first invoice. No vendor record exists.

How we handle it: route to a "new vendor" review queue. Reviewer either creates the vendor record (then auto-post going forward) or rejects.

Currency and locale

Decimal separators differ (1,500.00 vs 1.500,00). Date formats vary (DD/MM/YYYY vs MM/DD/YYYY). Currency codes implicit.

How we handle it: explicit currency and locale fields in the schema. Vision LLMs are generally good at inferring from context, but we validate and ask for review on ambiguous cases.

Auth-gated supplier portals

Some vendors send invoices through their portal where you have to log in and download.

How we handle it: where supported, OAuth or API integration. Where not, a credential-managed scraper. Where impractical, manual upload by your AP team kicks off the rest of the pipeline.

Anatomy of a working pipeline

[Email inbox] [Portal upload] [SFTP]
            \   |   /
             [Firestore queue + dedup hash]
                       ↓
             [Claude vision extraction → Zod-typed payload]
                       ↓
             [Business rules: tax math, PO match, vendor whitelist, dup detect]
                       ↓
             [Confidence routing]
              ├─ high → auto-post to ERP
              ├─ med  → review queue (Next.js admin UI)
              └─ low  → reject with structured reason
                       ↓
             [Audit log + dashboard]

Every box observable. Every transition logged. Every decision reversible.

What you should measure

Don't measure "accuracy" without disaggregating. What we actually track:

MetricWhy
Auto-post rate% of invoices that skip human review
Per-field recall% of fields correctly extracted, by field
Per-field precision% of extracted fields that are correct
Review queue latencyhow fast humans clear the queue
Cost per invoice (LLM + reviewer time)the real economic question
Cycle time (receipt → posted)downstream business metric
Error escape rate% of posted invoices later corrected

Vanity numbers like "99% accuracy" without disaggregation are unfalsifiable.

The reviewer UI matters more than the model

Reviewer ergonomics dictate whether the system actually saves time. A bad reviewer UI means a human spends 3 minutes per flagged invoice; a good one means 30 seconds.

What a good UI has:

  • Document on the left, extracted fields on the right.
  • Keyboard navigation between fields.
  • Click-to-highlight: clicking a field highlights the source region on the document.
  • Bulk approval for the queue.
  • Per-vendor flagging ("always escalate this vendor's invoices").
  • Quick reject with structured reason templates.

We spend meaningful engineering on the reviewer UI. It's the difference between automation that saves the team time and automation that adds work.

Real numbers from a real engagement

From one client (anonymised), processing ~400 supplier invoices per week:

MetricBeforeAfter (week 8)
Human time per invoice4–6 min~30 sec (review only)
% keyed manually100%~13%
Cycle time2–4 days~4 hours
Visible error rate~3% estimated<1%
Per-invoice cost (loaded)€3.80€0.18

Payback hit at month 4. Full breakdown in the Document Intake Agent case study.

What we won't do

  • Skip the schema sprint. Spending the first week defining "what is an invoice for your business" is the highest-leverage step. Skip it and the whole pipeline misses.
  • Skip the reviewer UI. Without it, the eventual savings won't materialise.
  • Skip the shadow-mode period. Running parallel with the human team for 1-2 weeks catches a class of bugs no eval finds.
  • Promise specific accuracy numbers before the prototype phase. Anyone who quotes "99% accurate" pre-prototype is bluffing.

Where to go next

For the buyer's perspective on AP automation more broadly, see AI agents for accounts payable. For the full architecture of how document agents work under the hood, see How AI agents actually work.

If you have an AP volume problem, our Document Processing service page covers the engagement model. Or drop us a note — one paragraph is enough.

Frequently asked questions

Keep reading

Want this delivered in your stack?

If the article describes a workflow you'd like to ship, drop us a note. We reply within one business day.