Four chat platforms, one review workflow, and a decision to make before writing a single line of ingestion code: do we build a separate pipeline for each platform, or do we build one pipeline that all four platforms feed into? We have already written about the review-side decisions behind message-level tagging, thread reconstruction, and redaction for Discord and Slack. This post is about the layer underneath all of that — the internal data model that makes it possible to run the same review, tagging, and production logic over iMessage, Slack, Teams, and Discord data without four parallel codebases silently drifting apart.
The short version: chat data from every platform shares a common shape — messages, senders, threads, attachments, reactions — but each platform expresses that shape differently. So we normalize every platform into one canonical schema at ingestion, run every downstream capability against that single schema, and only convert to an RFC 5322 / RSMF-style container at the production boundary, following the precedent set by Relativity's RSMF (Relativity Short Message Format) — itself an RFC 5322 (EML) file carrying normalized short-message data as an attachment.
1.Why Two Layers, Not One
The obvious shortcut is to make the internal working format itself RFC 5322-native — store everything as EML from the start, since that's the format eDiscovery platforms already expect. We didn't do that, for three reasons:
- EML is a poor working format for search, tagging, and redaction at scale. It's optimized for message transport, not query performance. Running relevance classification or full-text search against a store of individual email-shaped files is slower and more awkward than running it against a proper schema.
- RFC 5322 earns its keep at the boundary, not in the middle. It's a strong interoperability format precisely because existing eDiscovery platforms — Relativity, Everlaw, Reveal, and others — already know how to ingest RSMF-style EML productions. That value is entirely about the handoff to a downstream platform, not about how we store data internally.
- Review state shouldn't have to round-trip through email semantics. Tags, redactions, and confidence scores are attributes of a canonical message, not of an email. Keeping them there means the review layer never has to encode or decode them through From/To/Subject headers.
So the pipeline has three stages: platform-specific adapters that map raw exports into the canonical schema, the canonical message model itself — which is what search, tagging, and redaction actually operate against — and a conversation-document batching layer that groups messages into RFC 5322 / RSMF containers only at production time.
2.Architecture Overview
Everything above the last arrow — adapters, canonical model, conversation documents — is internal. Only the final stage has to speak a format another platform understands, and it only has to speak it once, at the very end.
3.Layer 1 — The Canonical Message Model
Every platform adapter maps its raw export into this shared, platform-agnostic schema. This is the single data model that search, tagging, redaction, and AI classification all run against, regardless of which platform a given message came from:
| Field group | Fields | Why |
|---|---|---|
| Identity | message_id (native + internal UUID), platform, sender.platform_user_id, sender.resolved_identity_id, sender.display_name_at_send_time, sender.role_at_send_time, sender.confidence_score |
Handles nickname/username drift consistently across all four platforms, with an audit trail and confidence score exposed to the reviewer |
| Conversation structure | conversation_id, conversation_type (channel / DM / group DM / thread), parent_message_id, thread_id |
Separates a formal thread (Discord Threads, Slack thread_ts, Teams reply chains) from a simple reply-to/quote — structurally different on every platform |
| Timing | timestamp_utc, display_timezone |
All time is stored internally as UTC; the reviewer sees it in the custodian's local timezone |
| Content | body_text, attachments[] (type, hash, vision-model metadata, OCR text), reactions[], voice_transcript |
One shape for text, files, images, embeds/stickers/emoji, and transcribed audio, regardless of source platform |
| Lifecycle | edit_history[], deletion_status (none / content-unavailable / recovered) |
Generalizes Slack, Teams, iMessage, and Discord edit/delete/unsend behavior into one shape |
| Review state | tags[] (field, value, coder, confidence, overturned_by), redaction_state[] (scope: word / message / attachment) |
Tagging and redaction granularity live on the canonical message, independent of the production container |
| Provenance | content_hash, near_dup_cluster_id, chain_of_custody (collection method, source hash manifest, processing exceptions) |
Supports defensible collection and deduplication across all platforms |
| Platform extension | platform_extensions {} |
An open bag for platform-specific fields that don't map cleanly to the core schema |
4.Layer 2 — Four Thin Adapters
Each adapter's only job is to absorb its platform's quirks so the canonical schema stays clean. None of them carry review logic — that all lives above them, against the canonical model:
reactions[]. Message effects (e.g. slam, invisible ink) go into platform_extensions. The service field distinguishes SMS from iMessage. Typically parsed from the local chat.db SQLite store via forensic tooling rather than a public API.thread_ts maps directly to thread_id. Block Kit rich formatting maps to body_text plus a structured extension. App/bot messages are flagged via sender.is_bot.chain_of_custody.Guild → channel → thread forms a hierarchy. message_reference distinguishes an inline reply from a true Thread (thread_id). Roles and nicknames are scoped per server, so sender.role_at_send_time must be guild-scoped. Voice channels have no native message object — voice content only exists if a bot or logging integration captured a transcript.If an adapter starts making review decisions — deciding what's noise, deciding how to display something — that logic has to be reimplemented, correctly, four times. Every platform quirk that can be absorbed into a structural mapping instead of a judgment call stays out of the adapter and becomes a property of the canonical schema that every downstream feature already knows how to handle.
5.The Production Boundary — RFC 5322 / RSMF Wrapper
RFC 5322 already solves two problems that chat threading needs natively: the In-Reply-To and References headers exist specifically to link a message to its parent and its ancestor chain — exactly what parent_message_id and thread_id need to express at the production layer.
| Canonical field | RFC 5322 header |
|---|---|
sender.display_name_at_send_time | From |
conversation_id participants | To |
timestamp_utc | Date |
message_id | Message-ID |
parent_message_id | In-Reply-To |
thread_id ancestor chain | References |
| Conversation name / date range | Subject (synthesized, e.g. “#general 2026-07-01”) |
| Platform-specific extras | X-Discord-GuildId, X-Discord-StickerIds, X-iMessage-Effect, X-Slack-BlockKit, etc. |
That last row is what makes RFC 5322 a durable choice rather than a lossy one. Its own extension mechanism — custom X- headers — already has a sanctioned way to carry platform-specific data without breaking generic EML parsers, the same way RSMF does today. We don't need a bespoke format to preserve a Discord sticker ID or an iMessage effect; we need one more header.
6.Document Boundary and Batching
The production “document” is not a single message. Following the same convention RSMF uses for Slack and Teams conversations, messages are grouped into a conversation document by channel or thread, batched by time period (for example, one file per day), and capped at a fixed message count (for example, N = 10,000) for performance.
Each conversation-document file receives one Message-ID and one Bates number, and carries the full canonical JSON payload — all messages in that window plus their attachments — as the attachment body, mirroring RSMF's own convention of an often-empty EML body with the real data attached separately.
7.Net Result
- One canonical schema and four thin adapters, instead of four separate platform-specific pipelines.
- Review, tagging, redaction, and AI-assisted classification all operate against the canonical model, independent of any production format.
- The RFC 5322 / RSMF output at the production boundary means any existing eDiscovery platform that already ingests RSMF can ingest this without a bespoke connector.
- The
platform_extensionsbag andX-header convention allow new platform-specific fields to be added without a schema rewrite, in the same way RFC 5322 has supported extension headers for decades.
None of the four platforms are treated as first-class citizens in the architecture — the canonical schema is. That's a deliberate bet: the next chat platform that shows up in a matter, whatever it is, needs one more thin adapter, not a new pipeline.
8.Conclusion
The reason this design holds up is that it separates two concerns that are easy to accidentally merge: how you work with the data and how you hand it off. A canonical schema optimized for query performance and review-state tracking is a bad fit for interoperability; a transport format optimized for interoperability is a bad fit for search and redaction at scale. Keeping them as two distinct layers — and only crossing from one to the other once, at the production boundary — is what lets iMessage, Slack, Teams, and Discord all run through the same review workflow without four parallel implementations quietly drifting apart.
To see how this schema handles your own iMessage, Slack, Teams, or Discord export, book a session with our technical team.