When a legal team asks which AI model to use for privilege review, the answer they usually get is the name of whatever flagship model their vendor happens to support. The implicit assumption is that the best model — the most capable, the most expensive — produces the best results.
Our benchmark data says otherwise. Across nine models spanning a 68-fold cost range, the most expensive model tested scored lower than models costing 30× less. The performance ceiling is not set by the model — it is set by how the review pipeline is architected. Once that structure is right, a more capable (and more expensive) model has little left to contribute.
This post explains what the data shows, why privilege review has a distinct accuracy calculus compared to responsiveness, and how to select a model that meets your accuracy requirements without leaving money on the table.
1.The Question You Actually Need to Answer
Most model selection conversations start from the wrong question. Teams ask: which model is most capable? The right question is: which model produces the best outcome for this specific classification task at acceptable cost?
Those two questions have different answers. General-purpose capability — measured by coding benchmarks, reasoning tests, and MMLU scores — does not translate directly to legal document classification accuracy. The privilege review task is structured, binary, and constrained by a written protocol. A model’s ability to reason about chemistry or write poetry does not predict its ability to apply attorney-client privilege criteria to a contract attachment.
For responsiveness review, the question is: does this document contain information related to the matter under a written definition? The cost of an error is primarily downstream review labor.
For privilege review, the question is: does this document contain attorney-client communications or work product that should be withheld? The cost of an error has legal consequences. A false negative — producing a document that should have been withheld — may constitute an inadvertent waiver. A false positive — withholding a document that is not actually privileged — can expose you to sanctions for improper withholding and inflates the privilege log unnecessarily.
This makes precision more consequential in privilege review than in most document classification tasks. When the model calls something privileged, it should be right. But recall still matters: missing a genuinely privileged document creates inadvertent waiver risk.
With that framing in place, here is what nine models actually produce on a controlled document classification benchmark.
2.What the Benchmark Shows
Decover’s Working Paper 2026-02 benchmarks nine large language models — from Alibaba, DeepSeek, MiniMax, Moonshot AI, and Anthropic — on a fixed 100-document responsiveness classification task using a gold-labeled controlled set. The same production review pipeline runs for each model; only the underlying model changes.
This is a responsiveness benchmark, not a privilege benchmark. Privilege review is a harder, more judgment-intensive task, and model rankings may differ. But the structural finding — that cost does not predict accuracy, and that a well-designed pipeline reaches a performance plateau that model spend cannot break through — is robust and expected to generalize. We are building a privilege-specific benchmark; results here are the best empirical evidence currently available.
Figure 1. Cost vs. F1 across nine models on a 100-document gold-labeled sample. Upper-left is the prize — higher accuracy at lower cost. The performance plateau is visible: eight of nine models cluster in an 11-point F1 band (0.76–0.87) across a 30× cost range. The most expensive model (Claude Opus, far right) scores below models costing 30× less. Source: Decover Research Working Paper 2026-02.
The headline result: eight of the nine models produce F1 scores between 0.76 and 0.87 — an 11-point band — across a cost range from $0.0012 to $0.0813 per document. More expensive models do not land at the top of that band. The best performer overall (Qwen 3.6 Plus, F1 = 0.87) costs $0.0205/doc. The most expensive model tested (claude-opus-4-8, $0.0813) scores F1 = 0.76 — the second-lowest of the eight.
| Model | F1 | Precision | Recall | Accuracy | $/doc | Cost: 100k docs |
|---|---|---|---|---|---|---|
| Qwen 3.6 PlusBest F1 | 0.868 | 0.885 | 0.852 | 0.930 | $0.0205 | $2,050 |
| MiniMax M3 | 0.830 | 0.846 | 0.815 | 0.910 | $0.0196 | $1,960 |
| claude-haiku-4-5 | 0.824 | 0.875 | 0.778 | 0.910 | $0.0195 | $1,950 |
| DeepSeek V4 ProBest value | 0.815 | 0.815 | 0.815 | 0.900 | $0.0027 | $270 |
| DeepSeek V4 Flash | 0.793 | 0.808 | 0.778 | 0.890 | $0.0012 | $120 |
| Kimi K2.6 | 0.769 | 0.800 | 0.741 | 0.880 | $0.0203 | $2,030 |
| Kimi K2.7 Code | 0.769 | 0.800 | 0.741 | 0.880 | $0.0195 | $1,950 |
| claude-opus-4-8Highest cost | 0.760 | 0.826 | 0.704 | 0.880 | $0.0813 | $8,130 |
| claude-sonnet-4-6Precision risk | 0.612 | 0.448 | 0.963 | 0.670 | $0.0385 | $3,850 |
n = 100 (27 positive, 73 negative). Single run — indicative, not audited. Cost is metered from real token usage at June 2026 prices. Source: Decover Research Working Paper 2026-02.
3.Why the Expensive Model Underperforms
The gap between claude-opus-4-8 ($0.0813/doc, F1 = 0.76) and DeepSeek V4 Pro ($0.0027/doc, F1 = 0.81) is not a fluke. It reflects a structural reality about how AI review accuracy works.
The accuracy ceiling on this type of task is set by the review architecture, not by model capability. Decover’s pipeline is a two-step process: Step 1 reads the document once and builds a structured understanding of its content. Step 2 evaluates that structured understanding against the classification criteria. Uncertain cases escalate for a more careful second look before any decision is committed.
Once this structure is in place, a frontier model has limited remaining signal to contribute. The architecture handles clear cases correctly regardless of model tier. The incremental value of a more capable model concentrates in genuinely ambiguous documents — a small fraction of any realistic collection. The result is the performance plateau visible in the chart: most models land in roughly the same accuracy band.
The implication for buyers is direct: you are not paying for accuracy when you choose a more expensive model. You are paying for a brand name on a review that will produce similar results at many times the cost.
4.The Precision Problem: One Model to Watch
One result in the benchmark stands apart and deserves specific attention before recommending any model for privilege review.
claude-sonnet-4-6 achieves recall = 96.3% — the highest in the study. It found 26 of 27 responsive documents. But it did so by flagging 58 documents as responsive when only 27 actually are. That is 32 false positives out of 73 non-responsive documents — a false positive rate of 43.8%.
In a responsiveness review, this means wasted human review time. In a privilege review, the implications are more serious. If the model is identifying privileged documents, a 43.8% false positive rate means nearly half of the documents it calls privileged are not actually privileged — and will be improperly withheld unless caught by attorney QC. A privilege log built on this output would contain a material proportion of entries that should not be there, creating sanctions exposure and bloating opposing counsel’s challenges.
The precision figure for this model is 0.448 — less than 50% of its “privileged” calls are correct. This is disqualifying for privilege review without robust attorney review of every call.
This failure mode — extreme recall at the cost of precision — reflects a model that treats uncertainty as a reason to include rather than a reason to investigate. That instinct is reasonable for certain safety-critical tasks, but privilege review is not one of them. Over-withholding on privilege grounds draws scrutiny and sanctions; an imprecise privilege log is a liability, not a safeguard.
5.Reading the Precision-Recall Tradeoff for Privilege
Privilege review has a different error asymmetry than responsiveness review. The table below maps the two error types to their legal consequences:
False Negative (missed privilege): A document that should be withheld as privileged is produced. Risk of inadvertent waiver under FRE 502. May require a clawback under a protective order, but waiver risk remains. This is a recall error.
False Positive (improper withholding): A document that is not privileged is withheld. Incorrect privilege log entries, potential sanctions for improper withholding, and opposing counsel challenges. This is a precision error.
The practical conclusion: Both errors are consequential, but the proportionality calculus favors precision. A reasonable recall rate (75–85%) with high precision produces a defensible privilege log. High recall at very low precision produces a privilege log that opposing counsel can attack entry by entry.
On this framing, the models that perform best for privilege review are those with high precision and reasonable recall — not those that maximize recall at any cost. Looking at the benchmark data through this lens:
- Qwen 3.6 Plus leads on both: precision = 0.885, recall = 0.852. Well-balanced and well-calibrated.
- claude-haiku-4-5 posts the highest precision of any model at 0.875, with recall = 0.778. Strong precision profile at a mid-range cost.
- DeepSeek V4 Pro shows perfectly balanced precision and recall (both 0.815) at a fraction of the cost of the top models. This balance is a good signal for privilege review — the model is not systematically biased in either direction.
- claude-sonnet-4-6 should be avoided for privilege review without extensive attorney QC. Precision = 0.448 means more than half of its privilege calls are wrong.
- claude-opus-4-8 has reasonable precision (0.826) but the lowest recall of any model outside the outlier (0.704) — meaning it misses more privileged documents than the better-value options, at 30× the cost.
6.A Decision Framework for Model Selection
Model selection is not a single right answer. It depends on matter size, richness, QC budget, and risk tolerance. Here is a framework that maps those parameters to model choices:
claude-opus-4-8: F1 = 0.76 at $0.0813/doc. This is the worst accuracy-per-dollar in the study. It costs $8,130 per 100,000 documents while underperforming models that cost $270 for the same volume. Unless your platform vendor requires it and offers no alternative, the data does not justify the premium.
claude-sonnet-4-6 for privilege without QC: Precision = 0.448. Using this model for privilege classification without attorney review of every positive call creates an imprecise privilege log and substantial sanctions exposure.
7.What Model Spend Looks Like at Scale
The cost difference between models is not marginal — it is structural. At 100,000 documents, the spread between the value operating point and the most expensive option is $7,860 in model inference alone. That money either goes to a vendor’s margin or funds attorney QC time that meaningfully reduces waiver risk.
Extrapolated linearly from per-document metered costs at June 2026 prices. Higher F1 bar shown for context: Qwen leads at 0.87, claude-opus at 0.76.
The practical implication: if you are running 100,000 documents through privilege review, the $7,860 difference between DeepSeek V4 Pro and claude-opus-4-8 funds roughly 52 hours of contract attorney review at $150/hour — enough to QC every case the model escalated as uncertain, and then some. The savings do not reduce quality. At this scale, they improve it.
8.The Architecture Floor
The performance plateau has an important corollary: the review pipeline architecture sets a floor that model selection cannot sink below, and a ceiling that model selection cannot break through.
If a review system reads each document from scratch for each question, applies no structured understanding, and uses no escalation mechanism for uncertain cases — accuracy will be lower than what the benchmark shows, regardless of which model is underneath it. The plateau exists because Decover’s two-step pipeline — document understanding followed by criteria evaluation, with escalation on uncertainty — creates a stable performance floor. Model selection then determines where within the plateau you land.
This is the correct way to think about AI privilege review: architecture first, model second. A well-designed system at the value operating point outperforms a poorly designed system running the most expensive model available.
If your platform can hit F1 ≈ 0.81–0.87 on a controlled validation sample, you are in the performance band where model selection is a cost decision, not an accuracy decision. Select the model with the cost profile that matches your matter economics, and allocate the savings to attorney QC on escalated cases.
If your platform cannot demonstrate performance in this band on your document type, the gap is in the architecture — no model upgrade will close it.
9.Applying This to Your Next Privilege Review
Here is what a practical model selection conversation looks like, applied to the benchmark data:
- Ask your vendor for a controlled validation run before committing to a model. Run the production pipeline over a gold-labeled sample of your actual document population. Any platform that cannot produce this validation has not done the work to know how well its system performs on your matter.
- Look at precision, not just recall. A vendor quoting 95% recall is a red flag if they do not also quote precision. claude-sonnet-4-6 achieves 96% recall at 44.8% precision — a number that would cause an inadvertent waiver risk and a privilege log full of improvable entries.
- Run the cost math at your volume. The difference between $0.003/doc and $0.08/doc is irrelevant on a 1,000-document matter and decisive on a 500,000-document set. Know your volume before accepting a vendor’s default model choice.
- Budget the model savings into attorney QC. Lower model cost is not an excuse to reduce quality assurance — it is a reason to fund more of it. The cases a well-designed system escalates as uncertain are exactly the ones that need attorney eyes. The savings fund that work.
- Re-run on each new matter type. Model performance varies with document collection, privilege definition specificity, and richness rate. A model that performs well on commercial litigation emails may behave differently on healthcare records or international matters. Validate each time.
10.Conclusion
The benchmark result is clear: on controlled document classification, model price does not predict performance. Eight of nine models land within an 11-point F1 band regardless of cost. The most expensive model tested scores below models costing 30× less. For privilege review specifically, the precision profile matters as much as the headline accuracy figure — and one model in the study (claude-sonnet-4-6) has a precision score that is effectively disqualifying for unsupervised privilege classification.
The operating recommendation is to use Qwen 3.6 Plus if you need maximum accuracy, DeepSeek V4 Pro if you are optimizing for cost at acceptable quality, and to avoid any model with precision below 0.75 for privilege review without attorney review of every positive call. The savings between the value and the premium operating points, at 100,000 documents, fund weeks of attorney QC time that materially reduces waiver risk — a better use of the money than a model that underperforms despite its price.
The full benchmark methodology and working paper are available at decover.ai/research/model-benchmark-2026. To see a controlled validation run on your document type, book a session with our technical team.