From 40¢ to Under a Nickel: The Real Cost Bridge for AI Document Review

DecoverAI Research & Engineering · A step-by-step cost bridge for running Responsiveness Review and Privilege/ACP Assessment on every document in a matter, plus a worksheet for pricing it against Relativity · July 2026

Here's the arithmetic behind a number we hear constantly from prospects: Relativity prices aiR's privilege review at roughly 30¢ a document — we've written about that comparison before. Add a comparably naive, one-shot Responsiveness Review pass through a frontier model — in our own nine-model benchmark, that's claude-opus-4-8 at $0.0813/doc — and you're at $0.3813. Call it 40 cents. And that's before privilege log drafting, quality control, or project management touch the number at all.

We don't think 40¢ is a technology ceiling. It's what a document review pipeline costs when nobody has gone through the exercise of removing waste from it. This post is that exercise, run in order: six decisions, each one building on the last, that take the combined cost of clearing both gates — responsive or not, privileged or not — from 40¢ to under a nickel using nothing but off-the-shelf hosted models. Then we get to where we think the next unlock actually is: small, open-weight models, fine-tuned on labels your own review process is already producing for free.

40¢

Naive baseline: one flagship model, one-shot, full text, both checks

<5¢

Target for both checks combined, using architecture alone — no self-hosting required

58%

Of a typical privilege-eligible collection cleared deterministically before any model runs

~45%

Of documents that actually need a second, full-text pass after a cheap first look

1.Lever One — Model Selection Beats Model Tier

The single biggest lever requires no new architecture at all: stop assuming the most expensive model is the most accurate one. We ran the same nine models — spanning Qwen, DeepSeek, MiniMax, Kimi, and the Claude family — against gold-labeled sets for both tasks, holding the document set and review logic constant and varying only the model. The pattern repeats almost exactly on both tasks:

Model	Privilege F1 / $ per doc	Responsiveness F1 / $ per doc
Qwen 3.6 Plus	0.868 · `$0.0205`	0.87 · `$0.0205`
DeepSeek V4 Pro	0.815 · `$0.0027`	0.81 · `$0.0027`
claude-haiku-4-5	0.824 · `$0.0195`	—
claude-opus-4-8	0.760 · `$0.0813`	0.76 · `$0.0813`
claude-sonnet-4-6	Precision 0.448 — disqualifying	F1 0.61 (recall 0.963, precision 0.448)

Two things jump out. First, it's the same headline number twice: claude-opus-4-8 costs $0.0813 a document whether you're checking responsiveness or privilege, because underneath the label it's the same operation both times — read a document, decide against a definition — and it's the lowest-F1 performer in both studies despite being the most expensive model tested in both. Second, DeepSeek V4 Pro lands within a few points of the best F1 on both tasks at a price 30× lower than opus. Just this swap — no pre-filtering, no cascade, still one-shot, still full text — takes the combined cost of both checks from $0.1626/doc to $0.0054/doc.

Cost-per-dollar leaders still need per-task validation

claude-sonnet-4-6 looks cheap and posts strong recall, but its 44.8% precision means it flags roughly half of everything as positive. On responsiveness that's an expensive false-positive problem; on privilege it's a sanctions-exposure problem, since every improperly withheld document is a decision a judge can unwind. A benchmark has to be run per task, not borrowed from a leaderboard — a model's F1 on one classification question doesn't transfer to a different one, even when both are phrased as "is this document X."

2.Lever Two — Stop Paying to Classify What You Can Already Answer With a Lookup

Model selection cuts the price of a model call. The next lever cuts the number of model calls, by refusing to spend one on a question a deterministic check can already answer. Before any document reaches a model in our privilege pipeline, it passes through a pure rule-based screen: is a privileged party even present in the header roster, and has a confirmed non-privileged third party already broken confidentiality? No inference, no tokens — just set membership against a roster built from the matter's own custodian and counsel data.

That single deterministic gate clears 58% of a privilege-eligible collection before a model is ever called. The same principle governs responsiveness ingestion: documents with no extractable text, exact-duplicate files already scored under a different file ID, and known non-evidentiary system files are identified by hash lookup and never generate a model call at all. None of this is a modeling technique. It's the recognition that a meaningful share of any collection can be resolved by a lookup table, and every document resolved that way costs exactly $0 in model spend — which pulls the blended cost across the whole collection well below the per-call price quoted in any benchmark table.

3.Lever Three — Read the Metadata Before You Read the Document

Everything that survives the deterministic gate still doesn't need the same amount of model attention. Both pipelines run a cheap first pass over a compact, structured summary of the document — not the full text — and only escalate to a full-text pass with a stronger model when that first pass can't confidently resolve the question.

Deterministic

Roster / dedup / no-text screen — zero model calls

→

Cheap pass

Structured metadata + compact summary, small model

→

Escalation (∼45%)

Full document text, stronger model, only for unresolved cases

In practice, roughly 45% of documents need that deeper pass; the rest are confidently resolved from structured metadata and a compact summary alone, at a fraction of the full-text price. Confidence is calibrated, not self-reported — models cluster their own stated confidence near 1.0 regardless of whether they're actually right, so escalation is governed by a validated decision threshold and a hard floor below which a document is never auto-resolved either way. Ambiguity always escalates rather than resolves; cost discipline never gets to trade away recall. The privilege pipeline runs the identical shape one layer further: a coarse legal-purpose screen on a summary escalates anything with a legal signal, and full-text judgment is reserved for whatever survives every gate before it.

4.Lever Four — Cache the Parts of the Prompt That Don't Change

A matter's tagging or privilege definition — the legal standard every document gets judged against — is often tens of thousands of characters, and it's identical across every document in the scan. Compiling it into a compact brief once per scan, rather than re-sending and re-billing the full definition on every one of hundreds of thousands of calls, turns a fixed cost into a rounding error. Layer provider-side prompt caching on top of that stable prefix and the repeated tokens are billed at a fraction of the standard input rate on every call after the first.

The gotcha that actually matters

Prompt caching only pays off if an identical prefix lands on the same backend twice. Some routing layers load-balance a single model name across multiple providers or regions that don't share a cache — so a stable, cacheable prefix keeps missing the cache, and every call quietly bills at full price. Pin the route to one deployment and the discount holds. It's a one-line fix with an outsized effect on a bill that's otherwise identical on paper.

5.Lever Five — Spend Attorney Hours on a Sample, Not the Whole Population

The cascade from Lever Three already tells you exactly where the risk in the collection lives: the roughly 45% of documents that needed the deeper pass are, by construction, the hardest and most ambiguous share. Point full attorney review there. For the confidently-resolved majority, a flat "read it all again" QC pass doesn't buy back the accuracy it costs — a calibrated statistical sample, sized off the model's own measured precision and recall on a gold set, tells you the true error rate on the whole population without re-reviewing the whole population.

This is also where the AI cost curve and the human cost curve actually meet. Our own pricing benchmark put manual first-pass responsiveness review at roughly $1.50/document. A 100% manual QC pass on top of an AI-assisted pipeline reintroduces most of the cost the first four levers just removed — sampling, calibrated by the pipeline's own confidence bands, is what keeps the nickel a nickel end to end, not just on the model-inference line item.

6.The Bridge, Summarized

Stage	What changes	Combined cost, both checks
Baseline	One flagship model, one-shot, full text, per task	`$0.3813/doc`
+ Model selection	Swap flagship for the accuracy-per-dollar leader	`$0.0054/doc`
+ Deterministic pre-filter	58% of eligible documents cost $0 in model spend	`≈$0.003/doc` blended
+ Cascade & caching	Cheap metadata pass first; full text only for the ∼45% that need it; stable prefixes cached	<$0.05/doc, target met

Four architecture decisions, zero self-hosting, zero fine-tuning — and the combined AI cost of clearing both gates on a document lands comfortably under a nickel. It's worth noting this isn't a hypothetical target: it's close to the same number already baked into our own public cost estimator, which plans AI review — classification, relevance, and privilege screening together — at a flat 1¢ per document.

7.A Worksheet: Pricing This Against Relativity for Your Own Matter

The formula underneath both platforms is the same, whether or not either vendor itemizes it that way:

Total AI review cost = (docs × privilege $/doc) + (docs × responsiveness $/doc) + (flagged docs × log $/entry, if billed separately)

Here's what to plug in on each side. The Relativity/aiR column below is only as good as the quote in front of you — we've sourced the one number that's publicly benchmarked (aiR's privilege pricing) and flagged everywhere else you need your own reseller's number, rather than guessing on your behalf:

Line item	Relativity + aiR (typical)	DecoverAI
Privilege review	aiR list price ≈ `$0.30/doc` — confirm your negotiated rate	`≈$0.003/doc` average across our ACP benchmark set
Responsiveness review	Not publicly benchmarked the way aiR is — get this figure from your vendor or reseller quote	Included at the same flat rate; publicly modeled at `1¢/doc`
Privilege log drafting	$10–$15/entry if billed as a distinct pass, on the flagged subset only	Included, no separate per-entry fee
Hosting	Varies by reseller; commonly $20–30/GB/month plus $200–800/user/month in seat fees	`$60/GB/month` flat, unlimited users
Billing model	Per-document, per-entry, and per-seat line items stack independently	One flat rate that scales with data volume, not headcount or document count

Worked at 250,000 documents / 100 GB — the same matter size we used in our pricing benchmark — the aiR privilege line alone runs $75,000 before a single document is checked for responsiveness, before any log entries are drafted, and before hosting or seats are counted. The equivalent DecoverAI cost for both AI review passes and log generation together is folded into the same $36,000 all-in total for the matter. You don't have to take our benchmark's word for either side of that: run your own document count and your own negotiated per-document rate through the formula above, or drop your numbers into the cost estimator for an instant side-by-side.

8.Why We Think Small, Open Models Are the Next Lever

Everything above gets both checks under a nickel using hosted, proprietary APIs — no self-hosting, no training run, nothing beyond good architecture and honest benchmarking. We think the next real unlock isn't a better hosted model. It's a small, open-weight model, fine-tuned on your own matter's labels, that you can run yourself.

The case for it starts with what document review actually is: a narrow, repeatable classification task against a fixed definition, run over and over on structurally similar documents. That's a much better fit for supervised fine-tuning than for open-ended chat — and it means the training data isn't a new cost center. Your review process is already producing it for free. Every document the cascade escalates is a labeled example of a case the cheap pass got wrong or wasn't sure about; every attorney QC decision on a sampled document (Lever Five, above) is a gold-standard human label. Both are decisions your team was already making — turning them into a fine-tuning dataset costs you the storage, not a new labeling operation.

This is an active research track for us, not a shipped default: we're running supervised fine-tuning and DPO (direct preference optimization) experiments on open-weight 7–9B models — Qwen2.5-7B-Instruct, DeepSeek's 7B chat model, and GLM-4-9B — using LoRA adapters to distill a flagship model's privilege and responsiveness judgments into a fraction of the parameter count, and comparing the fine-tuned models' precision and recall against the base open-weight models and against the flagship teacher itself.

Why "cheaper" isn't the whole argument

The four levers above already push the hosted-API cost of both checks below a penny in the best case. A fine-tuned small model you run yourself isn't primarily about beating that number further — it's about what stops moving. Inference becomes a fixed compute cost you control, not a variable line item tied to a frontier lab's next price change, deprecation notice, or rate limit. And privileged content never has to leave your environment to be scored by someone else's model in the first place — which matters as much to a general counsel weighing waiver risk as it does to a CFO weighing a bill.

That's the framing we keep coming back to: outside counsel sets a litigation budget before document one gets reviewed. A review cost that scales with a third party's API pricing is a budget nobody can actually hold to for the life of a matter. A review cost that's a fixed, owned, self-hosted compute expense — trained on labels your own team already generates as a byproduct of doing the work — is one they can. That's the bet we're placing our research effort on.

9.Conclusion

Forty cents a document is what you get by default: one frontier model, one shot, full text, no pre-filtering, no cascade, no caching. None of that is a law of physics — it's a starting point nobody has optimized yet. Six ordered decisions get you under a nickel: pick the model that actually earns its price, refuse to spend a model call on what a lookup can already answer, read the metadata before the document, cache the prompt content that never changes, sample your QC instead of duplicating it, and — where the volume and the stakes justify it — own the model outright by fine-tuning a small open-weight one on labels you were already generating. The first five are available today. The sixth is where we're headed next.

Free Download

Cost Estimation Worksheet: the formula, a fill-in comparison table, and the worked 250,000-document example from this post, in one printable page — so you can run the DecoverAI-vs.-Relativity math on your own matter.

Want to run this math against your own matter — document count, current AI review spend, and what it would cost on a flat rate? Book a session with our technical team, or start with the cost estimator.