Cost Analysis

Processing Surcharges: The $75–$150/GB Fee That Shouldn’t Exist in 2026

Text extraction, OCR, and deduplication cost fractions of a cent per gigabyte to run on modern cloud infrastructure. eDiscovery vendors still charge $75–$150/GB as a one-time processing surcharge. Here’s what the fee actually covers — and why it persists.

May 12, 2026

What Processing Actually Is

In eDiscovery, “processing” refers to the set of operations that prepare raw collected data for review. The operations themselves are well-defined: text extraction from PDFs and Office documents, OCR (optical character recognition) for scanned documents, deduplication to remove exact and near-exact copies, email threading to group message chains, metadata extraction (sender, recipient, date, subject), indexing for full-text search, and format conversion to the platform’s native document format. Each operation is a well-understood step in a pipeline that, on modern cloud infrastructure, runs largely automatically once data has been ingested.

These operations are computationally intensive but not particularly expensive on modern cloud infrastructure. The reason a processing surcharge exists at all traces to the early 2000s, when these operations required specialized software licenses — Nuix, Relativity Processing — that cost tens of thousands of dollars per seat, and dedicated hardware capable of running those workloads at scale without timing out on a 100 GB data set. Vendors who acquired those licenses and built those rack configurations had a genuine cost to recover. Neither of those conditions is still true. Cloud compute, open-source OCR libraries, and commodity object storage have made the underlying infrastructure available to any operator willing to build on it. The surcharge survived the infrastructure shift. The justification for it did not.

What Processing Actually Costs to Run

Text extraction from a standard Office document — a Word file, an Excel spreadsheet, a PowerPoint deck — takes milliseconds on commodity compute. The operation is essentially a library call: a parser reads the file format, pulls the text layer, and writes it to a search index. The compute required is negligible. OCR for a scanned PDF is more intensive because it involves image preprocessing and character recognition, but it runs at scale for approximately $0.001–$0.003 per page on cloud OCR APIs (AWS Textract, Google Vision). A 100-page scanned document costs between ten and thirty cents to process. A gigabyte of scanned PDFs, assuming roughly 1,500 pages per GB, costs $1.50–$4.50 to OCR at commercial API rates.

Deduplication over 250,000 documents requires hashing and comparison operations — SHA-256 or MD5 hashes of file content, fuzzy matching for near-duplicates — that complete in minutes on modern distributed systems. Email threading is a graph traversal problem: build a graph from message IDs and In-Reply-To headers, identify connected components, sort by timestamp. For a 100 GB data set, the full deduplication and threading pipeline runs in under an hour on a reasonably provisioned cluster. The all-in compute cost for processing a 100 GB matter on cloud infrastructure is measured in hundreds of dollars, not thousands.

The $10,000 processing surcharge ($100/GB) on a 100 GB matter represents a 10–20× markup on the underlying compute cost, before any consideration of platform software overhead, engineering amortization, or support. The table below makes the gap concrete:

Operation What it does Actual cloud cost (100 GB) Vendor charge (100 GB)
Text extraction Pulls text from Office/PDF ~$50–100 Bundled in processing
OCR Converts scans to searchable text ~$100–300 Bundled in processing
Deduplication Removes duplicates, threads email ~$50–100 Bundled in processing
Indexing Full-text search index ~$20–50 Bundled in processing
Total compute ~$220–550 $7,500–15,000

The Rate Range: $75–$150/GB

Processing surcharges at legacy vendors range from $75/GB at the low end of mid-market to $150/GB for full-service platforms with dedicated project management. On a 100 GB matter, that is $7,500–$15,000 as a one-time charge before a single document has been reviewed. On a 500 GB matter — which is not unusual for a commercial dispute with significant email volume or a government investigation spanning multiple years — that surcharge runs $37,500–$75,000.

The charge is typically billed at ingestion — the first invoice item the client sees — which frames every subsequent cost comparison against a baseline that already includes a 10–20× infrastructure markup. This framing is not incidental. Once the client has paid $10,000 to process the data, a $1.50/document review fee feels incremental by comparison. The processing surcharge functions as an anchor. It sets the psychological floor for what the matter costs and makes subsequent line items appear smaller than they would if evaluated independently. A $20,000 first-pass review on top of a $10,000 processing fee feels like 67% of the processing cost. The same $20,000 review evaluated on its own — as the only line item — invites a more careful scrutiny of whether it is justified.

The practical consequence for in-house counsel and litigation support managers is that the total matter cost is not visible at engagement. The processing surcharge is the one line item that is fully calculable upfront — data volume times per-GB rate — but it is rarely presented as the significant percentage of total matter cost that it is. On a 100 GB matter with a $10,000 processing fee and $40,000 in subsequent review costs, processing represents 20% of the total. That is a material line item. It is rarely discussed as one.

Why It Persists

The processing surcharge persists for three reasons, and none of them is that it reflects actual 2026 infrastructure economics. First, it provides front-loaded revenue that covers vendor overhead before the matter is complete. Project management, client service, platform onboarding, and account management costs are real, and vendors who absorb them across a monthly hosting fee are exposed to matters that end earlier than projected. A front-loaded processing surcharge converts uncertain future revenue into certain current revenue at ingestion. From the vendor’s cash-flow perspective, that is a meaningful structural advantage.

Second, it creates an artificial distinction between “processing” and “hosting” that allows vendors to charge for both separately, even though modern platforms run both as part of the same ingestion pipeline. Data that has been ingested for processing is, by definition, already on the vendor’s infrastructure. The decision to treat that ingestion as a separate billable event from the ongoing storage and access fees that follow is a pricing architecture decision, not an operational one.

Third, it insulates the vendor from competition on hosting rates. A vendor who charges $5/GB/month for hosting but $150/GB for processing is pricing more aggressively overall than one who charges $25/GB/month for hosting with processing bundled — but the headline hosting rate looks better in an RFP comparison table. The only way to evaluate the true cost of any two vendors is to model the all-in number for your specific matter: data volume times processing rate (one-time), plus data volume times hosting rate times estimated months, plus per-document review fees, plus project management, plus any other line items. That calculation is rarely performed at RFP stage, which is exactly why the surcharge persists.

What Included Processing Actually Means

Modern AI-augmented platforms include processing as part of the base rate because the incremental cost of processing is low enough to bundle without breaking the platform economics. The math is not complicated: if OCR costs $1.50–$4.50 per GB and text extraction costs another $0.50–$1.00 per GB, an all-in platform priced at $60/GB/month can absorb the full processing cost in the first month of a matter with margin to spare. The processing line item disappears not because it is being subsidized, but because it was never large enough to require separate billing in the first place.

DecoverAI’s $60/GB/month all-in rate covers ingestion, OCR, deduplication, near-duplicate identification, email threading, metadata extraction, and full-text indexing — all the operations that legacy vendors charge separately as processing surcharges. It also covers AI relevance and privilege classification, Bates numbering, redaction, and full production delivery. The absence of a processing line item is not a pricing trick or a promotional offer that will expire. It is a reflection of what cloud infrastructure actually costs to operate in 2026, applied to a pricing model built around that cost reality rather than around the rate cards that made sense when Nuix licenses cost $40,000 per year.

The practical test for any vendor quote is simple: ask for the all-in cost on your specific matter, modeled through to production. Include processing, hosting at the expected duration, per-document review, privilege log generation, project management, and production fees. That number is the one to compare. A vendor with a $5/GB/month hosting rate and a $100/GB processing surcharge on a 100 GB matter held for four months charges $10,000 (processing) + $2,000 (hosting) = $12,000 before any review costs. A vendor with a $60/GB/month all-in rate on the same matter charges $24,000 for four months — which includes review, privilege log generation, and production. The comparison looks different from the all-in perspective than it does from the headline hosting rate alone.

On a 100 GB matter, traditional vendors charge $7,500–$15,000 as a one-time processing surcharge. The actual cloud compute cost to run those operations is $220–$550. The gap is margin, not infrastructure.

Download the Full Pricing Benchmark
The complete line-item analysis across five vendor categories, with all-in cost modeling for 50 GB, 100 GB, and 500 GB matters.
No Processing Surcharges. No Per-Document Fees.

See the full all-in pricing for your matter size.

See Pricing → Book a Demo →
Download the white paper
The full pricing benchmark with all-in cost modeling for 50 GB, 100 GB, and 500 GB matters — no demo required.
No spam. Unsubscribe anytime.