Product Architecture · Chat Data · July 2026

One Schema, Four Platforms: How DecoverAI Built a Unified Internal Format for iMessage, Slack, Teams, and Discord

iMessage, Slack, Teams, and Discord all share the same underlying shape — messages, senders, threads, attachments, reactions — but every platform expresses that shape differently. Rather than building four separate ingestion pipelines, we normalize all four into one canonical message schema, and only wrap it into an RFC 5322 / RSMF-style container at the production boundary. Here's the design.

DecoverAI Product & Engineering  ·  Design record for the cross-platform chat ingestion schema underlying iMessage, Slack, Teams, and Discord review  ·  July 2026

Four chat platforms, one review workflow, and a decision to make before writing a single line of ingestion code: do we build a separate pipeline for each platform, or do we build one pipeline that all four platforms feed into? We have already written about the review-side decisions behind message-level tagging, thread reconstruction, and redaction for Discord and Slack. This post is about the layer underneath all of that — the internal data model that makes it possible to run the same review, tagging, and production logic over iMessage, Slack, Teams, and Discord data without four parallel codebases silently drifting apart.

The short version: chat data from every platform shares a common shape — messages, senders, threads, attachments, reactions — but each platform expresses that shape differently. So we normalize every platform into one canonical schema at ingestion, run every downstream capability against that single schema, and only convert to an RFC 5322 / RSMF-style container at the production boundary, following the precedent set by Relativity's RSMF (Relativity Short Message Format) — itself an RFC 5322 (EML) file carrying normalized short-message data as an attachment.

1.Why Two Layers, Not One

The obvious shortcut is to make the internal working format itself RFC 5322-native — store everything as EML from the start, since that's the format eDiscovery platforms already expect. We didn't do that, for three reasons:

So the pipeline has three stages: platform-specific adapters that map raw exports into the canonical schema, the canonical message model itself — which is what search, tagging, and redaction actually operate against — and a conversation-document batching layer that groups messages into RFC 5322 / RSMF containers only at production time.

2.Architecture Overview

iMessage
Slack
Teams
Discord
↓ ↓ ↓ ↓
Canonical message model
Platform-agnostic schema
Conversation document
Batching & Bates boundary
RFC 5322 / RSMF wrapper
Interop production layer
Four platform adapters feed one canonical model, which batches into conversation documents, which wrap into an RFC 5322 / RSMF container only at production.

Everything above the last arrow — adapters, canonical model, conversation documents — is internal. Only the final stage has to speak a format another platform understands, and it only has to speak it once, at the very end.

3.Layer 1 — The Canonical Message Model

Every platform adapter maps its raw export into this shared, platform-agnostic schema. This is the single data model that search, tagging, redaction, and AI classification all run against, regardless of which platform a given message came from:

Field groupFieldsWhy
Identity message_id (native + internal UUID), platform, sender.platform_user_id, sender.resolved_identity_id, sender.display_name_at_send_time, sender.role_at_send_time, sender.confidence_score Handles nickname/username drift consistently across all four platforms, with an audit trail and confidence score exposed to the reviewer
Conversation structure conversation_id, conversation_type (channel / DM / group DM / thread), parent_message_id, thread_id Separates a formal thread (Discord Threads, Slack thread_ts, Teams reply chains) from a simple reply-to/quote — structurally different on every platform
Timing timestamp_utc, display_timezone All time is stored internally as UTC; the reviewer sees it in the custodian's local timezone
Content body_text, attachments[] (type, hash, vision-model metadata, OCR text), reactions[], voice_transcript One shape for text, files, images, embeds/stickers/emoji, and transcribed audio, regardless of source platform
Lifecycle edit_history[], deletion_status (none / content-unavailable / recovered) Generalizes Slack, Teams, iMessage, and Discord edit/delete/unsend behavior into one shape
Review state tags[] (field, value, coder, confidence, overturned_by), redaction_state[] (scope: word / message / attachment) Tagging and redaction granularity live on the canonical message, independent of the production container
Provenance content_hash, near_dup_cluster_id, chain_of_custody (collection method, source hash manifest, processing exceptions) Supports defensible collection and deduplication across all platforms
Platform extension platform_extensions {} An open bag for platform-specific fields that don't map cleanly to the core schema

4.Layer 2 — Four Thin Adapters

Each adapter's only job is to absorb its platform's quirks so the canonical schema stays clean. None of them carry review logic — that all lives above them, against the canonical model:

iMessage
Tapbacks & Effects
Tapbacks map to reactions[]. Message effects (e.g. slam, invisible ink) go into platform_extensions. The service field distinguishes SMS from iMessage. Typically parsed from the local chat.db SQLite store via forensic tooling rather than a public API.
Slack
Native Threads & Block Kit
Native threading via thread_ts maps directly to thread_id. Block Kit rich formatting maps to body_text plus a structured extension. App/bot messages are flagged via sender.is_bot.
Teams
Reply Chains & Adaptive Cards
Reply chains are nested under a channel or chat. Adaptive cards map to the extension bag. Retention and legal hold status, typically surfaced via Microsoft Purview, feed chain_of_custody.
Discord
Guild-Scoped Everything
Guild → channel → thread forms a hierarchy. message_reference distinguishes an inline reply from a true Thread (thread_id). Roles and nicknames are scoped per server, so sender.role_at_send_time must be guild-scoped. Voice channels have no native message object — voice content only exists if a bot or logging integration captured a transcript.
The point of keeping adapters thin

If an adapter starts making review decisions — deciding what's noise, deciding how to display something — that logic has to be reimplemented, correctly, four times. Every platform quirk that can be absorbed into a structural mapping instead of a judgment call stays out of the adapter and becomes a property of the canonical schema that every downstream feature already knows how to handle.

5.The Production Boundary — RFC 5322 / RSMF Wrapper

RFC 5322 already solves two problems that chat threading needs natively: the In-Reply-To and References headers exist specifically to link a message to its parent and its ancestor chain — exactly what parent_message_id and thread_id need to express at the production layer.

Canonical fieldRFC 5322 header
sender.display_name_at_send_timeFrom
conversation_id participantsTo
timestamp_utcDate
message_idMessage-ID
parent_message_idIn-Reply-To
thread_id ancestor chainReferences
Conversation name / date rangeSubject (synthesized, e.g. “#general 2026-07-01”)
Platform-specific extrasX-Discord-GuildId, X-Discord-StickerIds, X-iMessage-Effect, X-Slack-BlockKit, etc.

That last row is what makes RFC 5322 a durable choice rather than a lossy one. Its own extension mechanism — custom X- headers — already has a sanctioned way to carry platform-specific data without breaking generic EML parsers, the same way RSMF does today. We don't need a bespoke format to preserve a Discord sticker ID or an iMessage effect; we need one more header.

6.Document Boundary and Batching

The production “document” is not a single message. Following the same convention RSMF uses for Slack and Teams conversations, messages are grouped into a conversation document by channel or thread, batched by time period (for example, one file per day), and capped at a fixed message count (for example, N = 10,000) for performance.

Each conversation-document file receives one Message-ID and one Bates number, and carries the full canonical JSON payload — all messages in that window plus their attachments — as the attachment body, mirroring RSMF's own convention of an often-empty EML body with the real data attached separately.

7.Net Result

None of the four platforms are treated as first-class citizens in the architecture — the canonical schema is. That's a deliberate bet: the next chat platform that shows up in a matter, whatever it is, needs one more thin adapter, not a new pipeline.

8.Conclusion

The reason this design holds up is that it separates two concerns that are easy to accidentally merge: how you work with the data and how you hand it off. A canonical schema optimized for query performance and review-state tracking is a bad fit for interoperability; a transport format optimized for interoperability is a bad fit for search and redaction at scale. Keeping them as two distinct layers — and only crossing from one to the other once, at the production boundary — is what lets iMessage, Slack, Teams, and Discord all run through the same review workflow without four parallel implementations quietly drifting apart.

To see how this schema handles your own iMessage, Slack, Teams, or Discord export, book a session with our technical team.

Have Chat Data From Multiple Platforms in a Matter?

We'll run your actual iMessage, Slack, Teams, or Discord export through the pipeline — normalization, threading, redaction, and production — before you commit to a platform.

Book a Technical Review →