We evaluate the cost and accuracy of automated responsiveness review across eleven large language models — spanning OpenAI, Google, and Anthropic — on a fixed, gold-labeled set of 498 documents. Holding the document set, the responsiveness definition, and the review logic constant while varying only the underlying model, we find that accuracy — F1 on the responsive class — plateaus near 0.86 and does not improve with model price. The best cost/accuracy operating point reaches F1 ≈ 0.86 and recall ≈ 0.90 at a metered cost of ~$0.017 per document, while the most expensive model evaluated costs roughly 54× more and scores lower (F1 ≈ 0.77). We conclude that, for responsiveness review, accuracy is bounded by how the review is architected rather than by model spend.
1.Background
High-recall responsiveness review costs under two cents per document when the review is architected to read each document once and evaluate it against a structured understanding — not re-read it from scratch for every question. DecoverAI benchmarked eleven models on a gold-labeled 498-document set and found that accuracy plateaus at F1 0.86 and $0.017 per document, while a model 54 times more expensive scores lower. The ceiling is set by how the review is built, not by model spend.
Responsiveness review — deciding, document by document, whether each one is relevant to a matter under a written responsiveness definition — is the single highest-volume judgment in discovery. It is also where cost and recall both matter most: miss a responsive document and you have a legal problem; re-read every document with an expensive model and you have a budget problem. The practical question this study answers is narrow: for responsiveness, what is the best accuracy we can achieve, and at what real-dollar cost?
2.Methods
We wanted numbers we'd be comfortable putting in front of a technical evaluator, which meant the methodology had to be honest about what it measures. The benchmark was designed to isolate a single variable — the model — and to report costs that reflect the real bill rather than list-price estimates.
2.1 Dataset. The benchmark runs over a fixed set of 498 documents labeled for responsiveness against an attorney-relevance standard. That gold set is the answer key — every model's output is scored against the same human-validated labels.
2.2 Task. A binary responsiveness decision per document, made against a single written responsiveness definition held identical across all runs.
2.3 System under test. We do not test a simplified or special “benchmark mode.” The same responsiveness review pipeline that runs in production is pointed at the gold set, so the scores reflect how the system actually behaves — not an idealized version of it.
2.4 Scoring. We report precision, recall, and F1 on the documents tagged responsive, comparing the system's decisions against the gold labels. F1 is the headline metric; recall is tracked separately, because in discovery a missed responsive document is the costlier error.
2.5 Cost measurement. The dollar figures are metered, not modeled. We record real token usage on each run and multiply by live provider prices, including any caching discounts actually applied. The cost per document is the bill we actually paid — which is what makes comparison across eleven models meaningful.
2.6 Controls. To isolate the effect of the model, we hold the document set, the responsiveness definition, and the review logic constant, and swap only the underlying model. Every point in the results is the same test with a different engine.
3.Results
Each point below is a full pass over all 498 documents, scored the same way, with its true cost measured the same way.
Figure 1. Cost vs. accuracy across the model roster, 498-document responsiveness benchmark. Eleven models, colored by provider; the upper-left is the prize. Accuracy plateaus well before peak model cost — the best operating point reaches F1 ≈ 0.86 at ~$0.017 per document, while a model roughly 54× more expensive scores lower. Cost is metered (real token usage at live provider prices, including cache discounts); accuracy figures are indicative, not an audited benchmark.
The shape of the curve is the result. Accuracy climbs quickly as you move from the cheapest models, then flattens into a plateau around F1 0.86. After that, paying more buys nothing measurable:
- Best operating point: F1 ≈ 0.86, recall ≈ 0.90, at a metered ~$0.017 per document.
- Most expensive model: ~$0.93 per document (~54× the best point) and a lower score, F1 ≈ 0.77.
- Cheapest model: still reaches F1 ≈ 0.72 at under a fifth of a cent per document.
- Mid-tier models cluster on the plateau — several score within a few points of the best while the frontier model sits below them.
In short: across nearly three orders of magnitude in cost, the accuracy spread is small, the top is reached early, and the single most expensive engine is not the most accurate.
4.Discussion
If a bigger model isn't what closes the gap, what does? The short answer is that most of the accuracy — and almost all of the cost advantage — comes from how the review is structured, not from which model sits underneath it.
The core idea is simple to state: understand each document once, then evaluate everything against that understanding. Most AI review re-reads every document from scratch for every question, which is why it's both expensive and inconsistent. We read each document once, build a structured understanding of it, and answer each responsiveness question against that — one understanding, many answers.
On top of that sits a review process designed to spend effort where it matters: clear-cut documents are settled quickly and cheaply, while genuinely uncertain ones get a closer, full-document read before any decision is committed. The defaults lean toward recall — when in doubt, the system takes a second look rather than guessing — because in discovery, an over-inclusion a reviewer can quickly drop is far cheaper than a responsive document that's missed entirely.
This is what explains the plateau. Once the structure is doing the work, a more expensive model has very little left to add — it is answering questions that were already going to be answered correctly, just at a higher price. Accuracy is bounded by the architecture, and the architecture is largely model-agnostic.
5.Practical implications
- High-recall responsiveness review at under two cents a document. The metered cost figures are the real bill, and they're transferable — this is what the review actually costs to run at volume.
- Spending more on a “smarter” model is not the upgrade path. On this task, the frontier model was both the most expensive and not the most accurate. The accuracy ceiling is set by how the review is built.
- The system declines to guess. Documents it can't decide are escalated for a deeper look or routed to human review — attorneys keep control of the thresholds and the final calls.
- The result is reproducible. Because the benchmark runs the real pipeline over a fixed gold set with metered cost, the study can be re-run as models evolve and the operating point re-selected.
6.Limitations
We want to be precise about what these results are. The study measures one document set and one tag type — responsiveness — against an attorney-relevance standard. The cost figures are metered and broadly transferable; the accuracy figures are indicative of how the architecture behaves on responsiveness review, not a multi-matter, independently-audited industry benchmark. Extending the measured set across privilege and issue tags is on our roadmap. As with any single-set evaluation, absolute F1 will vary with the matter, the definition, and the underlying collection — the finding we draw is about the shape of the cost/accuracy relationship, which we expect to generalize.
7.Conclusion
On responsiveness review, accuracy does not scale with price — it scales with architecture. A well-structured review reaches F1 ≈ 0.86 at roughly 1.7 cents per document, and the most expensive model we tested could not beat it at 54× the cost. For legal teams, the implication is concrete: high-recall responsiveness review is affordable at volume today, and the path to better results runs through how the review is built, not through a bigger model.
Methodology questions or want to see the run in detail? We're happy to walk a technical evaluator through it — book a session.
