Nine Models, One Benchmark: Emerging Frontier AI on Legal Responsiveness Review

DecoverAI Research · Internal Benchmark Study · June 2026 · Single run — indicative, not audited

Abstract

We evaluate the cost and accuracy of automated responsiveness review across nine large language models — spanning Alibaba (Qwen), DeepSeek, MiniMax, Moonshot AI (Kimi), and Anthropic (Claude) — on a fixed, gold-labeled 100-document sample drawn from a 500-document controlled set (27 responsive, 73 not responsive; 27% richness). Holding the document set, responsiveness definition, and review logic constant while varying only the underlying model, we find that F1 on the responsive class ranges from 0.61 to 0.87, with a 68× cost spread from $0.0012 to $0.0813 per document. The best performer (Qwen 3.6 Plus, F1 = 0.87) costs $0.0205/doc; the most expensive model tested (claude-opus-4-8, $0.0813) scores F1 = 0.76. DeepSeek V4 Pro achieves F1 = 0.81 at $0.0027/doc — the highest accuracy-per-dollar in the study. One model (claude-sonnet-4-6) exhibits extreme recall bias: recall = 0.963 at precision = 0.448, yielding F1 = 0.61. These results extend our prior 11-model benchmark and confirm that model selection, not model tier, determines review accuracy for this task.

1.Background

The comfortable assumption in legal AI procurement is that capability tracks price. Buy the most capable model, get the best document review. It is intuitive, and on responsiveness review the data says it is wrong — consistently.

Our prior benchmark across eleven models from OpenAI, Google, and Anthropic on a gold-labeled 498-document set found that accuracy plateaus near F1 0.86 and does not improve past a certain cost threshold. The most expensive model tested (gpt-5.4-pro, ~$0.93/doc) scored lower than a model costing 54× less. That study covered models already well-known to legal technology buyers.

This study extends the benchmark to the emerging frontier: Alibaba’s Qwen, DeepSeek, MiniMax, and Moonshot AI’s Kimi. These models have moved rapidly from research artifacts to production-grade systems in the past year and are increasingly available via standard API. The question is whether they meet the accuracy bar for legal document review — and at what cost.

Responsiveness review is the highest-volume classification judgment in discovery: for each document, does it contain information responsive to the matter under a written definition? It is also where both cost and recall matter most. Miss a responsive document and you have a disclosure problem. Re-read every document with an expensive model and you have a budget problem.

2.Methods

Materials & Methods

2.1 Dataset. The benchmark runs over a fixed 100-document sample drawn from a 500-document gold-labeled controlled set (seed 42). The gold set contains 127 responsive and 373 not-responsive documents (25.4% richness). The sampled 100 documents include 27 responsive and 73 not-responsive documents. Every model’s output is scored against the same human-validated labels.

2.2 Task. A binary responsiveness decision per document, made against a single written responsiveness definition held identical across all nine runs.

2.3 System under test. We do not test a simplified or “benchmark mode.” The same production responsiveness review pipeline is pointed at the gold sample for each model run. Both steps of the two-step pipeline (document understanding and responsiveness evaluation) use the same model in each run.

2.4 Scoring. We report precision, recall, F1, and accuracy on the 100-document sample. F1 on the responsive class is the headline metric. Recall is tracked separately because a missed responsive document is the costlier error in discovery.

2.5 Cost measurement. Dollar figures are metered from real token usage and multiplied by live provider prices at the time of the run, including caching discounts applied. The cost per document is the bill paid — not a list-price estimate.

2.6 Controls. Document set, responsiveness definition, and review logic are held constant. Only the underlying model varies. Every data point is the same test with a different engine.

One interpretive note: this is a single run from a 100-document sample. The sample provides approximately ±15-percentage-point uncertainty on recall estimates at the 95% confidence level (27 positive cases). Results are indicative of ordering and relative performance, not precision-audited population figures.

3.Results

Figure 1. Cost vs. F1 across nine models on the 100-document responsiveness sample. Upper-left is the prize — higher accuracy at lower cost. Qwen 3.6 Plus leads (F1 = 0.87, $0.020/doc). DeepSeek V4 Pro (F1 = 0.81, $0.003/doc) offers the highest accuracy-per-dollar. claude-sonnet-4-6 sits in the lower-right quadrant — mid-range cost, lowest F1 — due to extreme recall bias (see Section 4). Cost is metered from real token usage.

0.87

Peak F1 — Qwen 3.6 Plus at $0.020/doc

30×

DeepSeek V4 Pro cost advantage over Claude Opus at comparable F1

0.45

claude-sonnet-4-6 precision — nearly half its “responsive” flags are wrong

The cluster picture is striking. Seven of nine models land in a narrow F1 band of 0.76–0.87 — an 11-point spread across a 68× cost range. Two models stand apart for different reasons: DeepSeek V4 Flash offers F1 = 0.79 at $0.0012/doc (the cheapest in the study, and still solidly accurate); claude-sonnet-4-6 lands at F1 = 0.61 due to a qualitatively different failure mode (discussed in Section 4).

Model	F1	Precision	Recall	Accuracy	$/doc	TP	FP	FN	TN
Qwen 3.6 PlusBest F1	0.868	0.885	0.852	0.930	$0.0205	23	3	4	70
MiniMax M3	0.830	0.846	0.815	0.910	$0.0196	22	4	5	69
claude-haiku-4-5	0.824	0.875	0.778	0.910	$0.0195	21	3	6	70
DeepSeek V4 ProBest value	0.815	0.815	0.815	0.900	$0.0027	22	5	5	68
DeepSeek V4 Flash	0.793	0.808	0.778	0.890	$0.0012	21	5	6	68
Kimi K2.6	0.769	0.800	0.741	0.880	$0.0203	20	5	7	68
Kimi K2.7 Code	0.769	0.800	0.741	0.880	$0.0195	20	5	7	68
claude-opus-4-8	0.760	0.826	0.704	0.880	$0.0813	19	4	8	69
claude-sonnet-4-6Outlier	0.612	0.448	0.963	0.670	$0.0385	26	32	1	41

n = 100 (27 responsive, 73 not-responsive). Single run — indicative, not audited. Cost is metered from real token usage.

4.The Sonnet Anomaly: Extreme Recall at the Cost of Precision

claude-sonnet-4-6 deserves its own section because it is not simply a poor performer — it is a qualitatively different kind of result.

The Sonnet Profile

Recall: 96.3%. The model found 26 of 27 responsive documents — the highest recall in the study. Only one responsive document was missed.

Precision: 44.8%. Of the 58 documents the model flagged as responsive, 32 were not. The model flagged 58 documents when only 27 are actually responsive.

F1: 0.61. The harmonic mean of precision and recall. High recall cannot compensate for precision this low.

What this looks like in production: On a 10,000-document set with 27% richness, this model would flag approximately 5,800 documents for human review. About 2,700 would be genuinely responsive. The reviewer would process 3,100 false positives — wasted review effort — to recover those 2,700. At $150/hour reviewer rates, that is a substantial cost to recover documents a higher-precision model would have surfaced more cleanly.

The model appears to have maximized recall as a primary objective, treating uncertainty as a reason to flag rather than a reason to withhold judgment. This is a recognizable failure mode in binary classification: a model trained on “don’t miss anything” objectives becomes near-indiscriminate in its inclusions.

Whether high-recall/low-precision is a viable strategy depends on downstream review cost. If human review per document is cheap, 96% recall at 45% precision might be defensible. But if the review cost is meaningful — which it is in nearly every litigation context — a model delivering 85% recall at 88% precision (Qwen) produces fewer false positives while missing only marginally more responsive documents.

5.DeepSeek V4 Pro: The Value Argument

The result that deserves equal attention on the opposite end is DeepSeek V4 Pro: F1 = 0.815 at $0.0027 per document.

For context: claude-opus-4-8 scores F1 = 0.760 at $0.0813/doc. DeepSeek V4 Pro outperforms it by 5.5 F1 points at 30× lower cost. DeepSeek V4 Flash, at $0.0012/doc, scores F1 = 0.793 — also higher than Opus at roughly 68× lower cost.

These are not corner-case results. The DeepSeek models produce well-calibrated outcomes: both show precision and recall within a few points of each other, suggesting a review process that neither over-includes nor over-excludes. That balance is harder to achieve than simply maximizing one metric.

The practical implication for large-volume review is significant. At 100,000 documents, the difference between $0.0027/doc and $0.0813/doc is $270 vs. $8,130 in model costs — while the lower-cost model actually performs better. The cost savings fund human review time for the cases the model escalates.

6.Why Architecture Sets the Ceiling

Seven of nine models cluster between F1 0.76 and 0.87 despite a 30× cost spread within that cluster. That plateau is not a coincidence. It reflects something about the review architecture that the model can’t overcome by being more capable.

The core design: read each document once, build a structured understanding of it, then evaluate that understanding against the responsiveness criteria. Most AI review approaches re-read the document from scratch for each evaluation question, which produces inconsistency and expense. By separating document comprehension from criteria evaluation, the architecture creates a stable, reproducible understanding that multiple questions can query without re-incurring the full reading cost.

On top of this sits a review process that allocates effort proportionally: clear cases (confidently responsive or confidently not-responsive) resolve quickly; uncertain cases escalate for a full-document review before any decision is committed. The defaults lean toward recall — when uncertain, take a second look rather than guess — because in discovery, an over-inclusion a reviewer can drop is cheaper than a responsive document that never surfaces.

Once this architecture is doing the work, a more capable model has little left to add. The structure already handles the easy cases correctly. A frontier model adds marginal signal on genuinely ambiguous documents — but ambiguous documents are a small fraction of any realistic collection. The result is the plateau: every model that follows the review logic lands in roughly the same accuracy band, and only the outlier (sonnet) that diverges from it falls significantly below.

7.Practical Implications

The emerging frontier is production-ready for this task. Qwen 3.6 Plus, MiniMax M3, and DeepSeek V4 Pro all achieve accuracy competitive with established Claude and GPT models at comparable or lower cost. These are no longer experimental alternatives.
DeepSeek is the value operating point for high-volume review. F1 = 0.81 at $0.0027/doc means a 100k-document review costs roughly $270 in model inference. That is real-world affordable, with accuracy that exceeds models costing 30× more.
Avoid models with precision below 0.60 regardless of recall. Extreme recall without precision creates a different problem: the review team processes false positives at scale. In most litigation contexts, this shifts cost from model inference to human review time — a worse trade.
Model selection, not model tier, determines accuracy. The cheapest model in this study (DeepSeek Flash, $0.0012) scores F1 = 0.79. The most expensive (Claude Opus, $0.0813) scores F1 = 0.76. Spend less, get more. The ceiling is set by the architecture, not the frontier model underneath it.
The benchmark is reproducible. Because it runs the production pipeline over a fixed gold set with metered cost, it can be re-run as models evolve. Operating-point selection is an ongoing process, not a one-time decision.

8.Limitations

This study measures one document sample and one tag type — responsiveness — against an attorney-relevance standard. The 100-document sample provides approximately ±15-percentage-point uncertainty on recall estimates (27 positive cases), which means the rank ordering of closely-spaced models (e.g., MiniMax vs. Haiku, both at F1 ≈ 0.82) should be treated as indicative rather than definitive. A larger sample would narrow these intervals.

The cost figures are metered and transferable; they reflect what these models actually cost to run at the time of measurement. Absolute F1 will vary with the matter, the responsiveness definition, and the underlying document collection. The finding we draw is about the shape of the cost/accuracy relationship and the existence of the architecture plateau, which we expect to generalize.

Extending the benchmark across privilege, issue tags, and multiple matter types is on the roadmap. The prior 11-model study over 498 documents provides a complementary view of the GPT and Gemini families at full-set scale.

9.Conclusion

The emerging frontier has arrived for legal document review — and it has arrived cheaply. Qwen 3.6 Plus now leads our benchmark at F1 = 0.87. DeepSeek V4 Pro achieves F1 = 0.81 at $0.0027/doc. Both outperform or match established models costing far more.

The finding from our first benchmark holds: on responsiveness review, accuracy is bounded by how the review is architected, not by model spend. A well-structured review reaches F1 ≈ 0.87 at roughly two cents a document, and the most expensive model tested scores lower. For legal teams, the implication is concrete: high-recall responsiveness review is affordable at volume today, and the path to better results runs through review architecture, not a larger model budget.

Methodology questions or want to see a run in detail? We’re happy to walk a technical evaluator through it — book a session. The full working paper is available at decover.ai/research/model-benchmark-2026.

Nine Models, One Benchmark: What the Emerging Frontier Costs on Legal Responsiveness Review