D079: Voice-Text Bridge — Speech-to-Text Captions, Text-to-Voice Synthesis, and AI Voice Personas
| Status | Draft |
| Phase | Phase 5 (basic STT captions + basic TTS), Phase 6a (AI voice personas, pluggable backends), Phase 7 (cross-language translation) |
| Depends on | D059 (communication system — VoIP, text chat, muting), D034 (SQLite for mute preferences), D052 (community servers — relay forwarding) |
| Driver | Xbox shipped party chat STT/TTS in 2021 (Xbox Accessibility Guideline 119). Google Meet shipped AI voice mimicry in 2025. VRChat’s mute community (~30% of users choose not to speak) proved the demand for text-based voice alternatives. No RTS has integrated STT/TTS as a first-class communication mode. IC’s existing Opus VoIP pipeline (D059) and pluggable LLM/AI backend pattern (player-flow/llm-setup-guide.md) make this architecturally natural. |
Philosophy & Scope Note
This decision is Draft — experimental, requires community validation before Accepted.
What is proven: Platform-level STT/TTS is an accessibility standard (Xbox Guideline 119, Sea of Thieves, Forza Horizon 5). Auto-generated captions for voice chat are a top community request across Discord, VRChat, and competitive gaming communities.
What is experimental: Per-player AI voice personas and the “shy player pipeline” (type → hear a unique AI voice) are novel as integrated game features. VRChat community mods prove demand exists; Google Meet proves the technology works. But no game ships this as a first-class feature — community validation is needed.
Philosophy alignment: This is accessibility-first design (Philosophy Principle 10: “Build with the community”). STT captions and basic TTS directly address the needs of deaf/HoH players and non-verbal players. AI voice personas are experimental flavor on top.
Decision Capsule (LLM/RAG Summary)
- Status: Draft
- Phase: Phase 5 (STT captions + basic TTS) → Phase 6a (AI voice personas, pluggable backends) → Phase 7 (cross-language translation)
- Canonical for: Voice-to-text transcription, text-to-voice synthesis, AI voice personas, pluggable STT/TTS backends, voice accessibility in multiplayer
- Scope:
ic-audio(STT/TTS processing),ic-ui(caption overlay, voice persona settings),ic-net(SynthChatMessageorder routing,is_syntheticVoicePacket flag),ic-game(backend orchestration) - Decision: IC provides a bidirectional voice-text bridge with two independent formats: (1) auto-generated captions that transcribe incoming voice into text overlays, and (2) a text-to-voice pipeline where typed chat is synthesized into per-player AI voices. Both are opt-in per player, independently controllable, and pluggable (local or cloud backends). Muting operates on three independent channels: voice, synthesized voice, and text.
- Why:
- Accessibility: deaf/HoH players need captions; non-verbal players need a voice alternative (Xbox Guideline 119)
- Social comfort: shy players, players in noisy environments, or players uncomfortable with their voice can participate in voice-culture lobbies without speaking
- Cross-language play: STT + translation + TTS enables communication across language barriers (Phase 7)
- IC’s VoIP pipeline (D059 Opus encoding, relay forwarding) and LLM provider pattern already provide the infrastructure
- Non-goals: Real-time voice moderation via STT (explicitly rejected in D059 § Alternatives Rejected — compute-intensive, privacy-invasive, unreliable across accents). Voice cloning of other players without consent. Replacing voice chat — this augments it.
- Out of current scope: Emotion detection in voice. Voice style transfer (making your voice sound like someone else’s in real-time). Real-time voice translation during ranked matches (latency concern).
- Invariants preserved: Deterministic sim unaffected (STT is listener-local cosmetic processing; TTS synthesis is client-side and never affects sim state).
SynthChatMessageis aPlayerOrdervariant in the order stream (same lane asChatMessageper D059), not a separate message lane. VoIP relay architecture unchanged for Mode B (default); Mode A addsis_syntheticbit toVoicePacket. Muting model (D059) extended, not replaced. - Defaults / UX behavior: Both STT and TTS are off by default. Players opt in via Settings → Communication → Voice-Text Bridge. Default synthesis mode is receiver-side (Mode B) — no player hears an AI voice unless they personally enable TTS playback on their own client. Sender-side synthesis (Mode A) requires explicit opt-in from both sender AND receiver (
accept_synthetic_voiceflag). Local models are downloadable model packs (~250MB), not bundled with the base install. - Security / Trust impact: STT runs on the listener’s client (the speaker’s voice is not transcribed server-side — privacy preserved). Default TTS mode (receiver-side) keeps text on the wire — synthesis is local to the listener. Sender-side TTS (Mode A) adds an
is_syntheticbit toVoicePacketfor independent mute routing.SynthChatMessageis a newPlayerOrdervariant (protocol change, replay-visible). - Performance impact: Local STT (Whisper turbo): ~50-200ms per utterance, ~200MB downloadable model pack. Local TTS (Piper/ONNX): <100ms, ~50MB downloadable model pack. Cloud backends: ~75-300ms round-trip. Native platforms: processing on background threads — no frame budget impact. WASM: local models unavailable (no
std::thread), cloud-only via async fetch on single-threaded executor. - Public interfaces / types / commands:
VoiceTextBridge,SttBackend,TtsBackend,VoicePersonaId(namespacedResourceId),VoiceBridgeCapability(session advertisement),SynthChatMessage(order variant),PlayerMuteState(extended),CaptionOverlay,/captions,/tts,/voice-persona,/player <name> mute - Affected docs:
decisions/09g/D059-communication.md(muting model extension,VoicePacketis_syntheticflag,PlayerOrder::SynthChatMessagevariant — protocol version bump),player-flow/multiplayer.md(VoiceBridgeCapabilitylobby exchange during ready-up),player-flow/settings.md(voice-text bridge settings panel),player-flow/llm-setup-guide.md(backend configuration pattern),formats/save-replay-formats.md(replay protocol version forSynthChatMessage) - Keywords: speech to text, text to speech, STT, TTS, captions, voice accessibility, AI voice, voice persona, voice synthesis, shy player, mute player, cross-language, pluggable backend, Whisper, ElevenLabs, Piper
Problem
IC’s communication system (D059) provides text chat channels and relay-forwarded VoIP. Players can mute voice, mute text, or mute both per player. But there is no bridge between the two modalities:
- A deaf player cannot read what a speaking player says (no captions)
- A non-verbal player cannot be “heard” in voice-culture lobbies (no text-to-voice)
- Players who are shy about their voice, in noisy environments, or speaking a different language have no alternative to raw voice chat
- Cross-language teams cannot communicate via voice at all
These gaps are solved by other platforms:
- Xbox Party Chat (2021) ships STT + TTS as platform-level features
- Google Meet (2025) ships AI voice mimicry for translation
- VRChat’s community built TTS Voice Wizard because ~30% of users choose to be mute
Prior Art
| Platform | STT (Voice→Text) | TTS (Text→Voice) | Personalized Voice | Cross-Language | Notes |
|---|---|---|---|---|---|
| Xbox Party Chat (2021) | Yes — real-time overlay | Yes — typed text read to party | No — generic system voices | English (US) only | Accessibility gold standard. Uses Azure Cognitive Services via PlayFab Party. Ships as platform feature, not per-game |
| Sea of Thieves | Yes — voice→text overlay | Yes — text→voice for typed chat | No | English (US) only | Built on Xbox’s PlayFab Party API |
| Google Meet (2025) | Yes — 80+ languages | Yes — AI speech translation | Yes — synthetic voice mimics speaker’s tone, rhythm, pacing | English↔Spanish (expanding) | Uses Gemini AI. The personalized voice feature is the closest precedent to IC’s AI voice personas |
| Microsoft Teams (2025) | Yes — 34 languages | Yes — Interpreter Agent | Partial | 9 languages | Interpreter Agent translates speech in real-time |
| VRChat (community) | Yes — TTS Voice Wizard, TaSTT | Yes — 100+ voices | Yes — voice cloning, character voices | 50+ languages | Community-built tools, not official. Proves demand from mute/shy players |
| Discord | Not native (bot-only) | /tts (basic) | No | Limited | Most-requested accessibility feature, still not native |
| ElevenLabs (API) | No | Yes — 10K+ voices | Yes — clone from 10s audio | 32+ languages | ~75ms latency. Leading personalized TTS API |
| OpenAI Whisper (open-source) | Yes — state-of-the-art accuracy | No | N/A | 99 languages | Open-source, runs locally. Turbo model: 216x real-time speed. Privacy-safe (no cloud). ~200MB model |
Decision
IC implements a Voice-Text Bridge — a bidirectional, opt-in system with two independent formats and pluggable backends.
Format 1: Auto-Generated Captions (STT — Voice→Text)
Incoming voice chat is transcribed into text and displayed as a caption overlay on the listener’s screen. Processing happens entirely on the listener’s client — the speaker’s voice is never transcribed server-side.
UX Flow
- Listener enables captions in Settings → Communication → Voice-Text Bridge → Captions: On
- When a teammate speaks, the listener’s STT backend transcribes the Opus audio stream
- Transcribed text appears as a caption overlay at the bottom of the screen (configurable position)
- Captions are attributed to the speaker (player name + faction color)
- Captions fade after 5 seconds (configurable)
Caption Overlay
┌──────────────────────────────────────────────────────┐
│ GAME VIEW │
│ │
│ │
│ │
│ ┌────────────────────────────────────────────┐ │
│ │ [Alice] Attack the north bridge │ │
│ │ [Bob] I'll send tanks to support │ │
│ └────────────────────────────────────────────┘ │
│ ▲ Caption overlay (position/size configurable) │
└──────────────────────────────────────────────────────┘
Multi-Language Captions (Phase 7)
When cross-language translation is enabled, captions are displayed in the listener’s preferred language:
- STT transcribes voice → source-language text
- Translation engine translates source text → listener’s language
- Caption shows translated text with a language indicator and machine-translation trust label (per D068 § Machine-Translated Content Labeling):
[Alice 🇫🇷→🇬🇧 ⚙️MT] Attack the north bridge
The ⚙️MT (machine-translated) label follows the same pattern D068 uses for machine-translated UI strings and mod descriptions — the player must always know when text has been machine-translated rather than human-authored. The label is non-dismissible (always visible when translation is active).
Caption Interaction with Chat
Captions are not injected into the text chat channel — they appear only in the caption overlay. D059 treats the chat panel as deterministic order-stream data preserved for replay/review (D059-overview-text-chat-voip-core.md § Chat Architecture). STT transcripts are listener-local, non-deterministic artifacts (different listeners may get different transcriptions of the same speech) and must never be mixed into the canonical chat stream.
If the player wants scrollback review of transcripts, they are displayed in a separate transcript panel (a secondary tab or collapsible section below the chat panel), visually distinct from chat:
- Transcript entries are tagged with a
🎤 STTprefix and rendered in italic or a muted color - They are not stored in the replay’s chat log (they are ephemeral, listener-local)
- They are not visible to other players (each listener’s transcription is private to their client)
Format 2: Text-to-Voice Pipeline (TTS — Text→Voice)
A player types in text chat, and the receiving player hears it as synthesized speech. The sender’s text is transmitted as a chat order; the receiver’s client synthesizes it into audio locally.
Two Synthesis Modes
Mode B — Receiver-Side Synthesis (default):
The flow: Player A configures voice parameters → sends chat message → Player B has TTS enabled → Player B’s TTS engine reads Player A’s text using Player A’s voice parameters, after preprocessing the text for the engine.
- Player A configures their voice persona (voice model, pitch, speed, style) in Settings. These parameters are exchanged with all players during the lobby waiting room (ready-up phase) via
VoiceBridgeCapability— before the game starts. The receiver’s client can pre-load the correct TTS voice model during the loading screen - Player A types a message in text chat. Their client sends a
SynthChatMessageorder containing the text + Player A’spersona_id - Player B receives the
SynthChatMessage. Player B’s client checks: does Player B have TTS playback enabled? If no → display as normal text, done. If yes → continue - Player B’s client looks up Player A’s voice parameters (received earlier via
VoiceBridgeCapability): persona ID, pitch, speed, style - Player B’s TTS engine preprocesses the text (rule-based + spell correction — see § TTS Text Preprocessing) and synthesizes it using Player A’s configured voice parameters — not Player B’s own
- Player B hears the message spoken in Player A’s chosen voice. Player B controls whether to hear TTS, but the voice characteristics are Player A’s
This is the default because it respects the stated UX rule: “no player hears an AI voice unless they personally enable TTS playback.” The sender configures how they sound; the receiver decides whether to listen.
Mode A — Sender-Side Synthesis (opt-in, requires receiver consent):
- Player types a message in text chat
- Player’s client synthesizes the text into audio using their configured TTS backend and voice persona
- The synthesized Opus audio is transmitted via the VoIP relay lane with
is_synthetic: truein the voice packet (see § Voice Packet Extension below) - Receivers who have “Accept synthesized voice” enabled hear it as voice
- Receivers who have NOT opted in, or who have the sender’s synth muted, silently drop the packet — the message still arrives as text via the parallel
SynthChatMessageorder
Mode A is opt-in for both sender and receiver. The sender must enable “Send as voice” and the receiver must enable “Accept synthesized voice.” This prevents unsolicited AI voice injection into lobbies.
D059 competitive voice restrictions apply fully to synthetic voice. Mode A synthesized packets inherit all VoiceTarget routing rules from D059:
- Ranked:
VoiceTarget::Allis disabled — synthetic voice is team-only, same as real voice (D059 § Ranked voice channel restrictions) - Observers: Synthetic voice from observers is never forwarded to players during live games (D059 § Anti-coaching)
- Tournament: Organizer controls whether synthetic voice is allowed via
TournamentConfig - Synthetic voice packets are subject to the same relay mute/moderation pipeline as real voice (admin mute, abuse penalties, etc.)
Synthetic voice does not bypass these restrictions. The relay enforces routing rules identically for is_synthetic: true and is_synthetic: false packets — the flag is used only for per-player synth muting, not for routing policy.
Trade-offs:
| Aspect | Sender-Side (Mode A) | Receiver-Side (Mode B, default) |
|---|---|---|
| Network | Uses VoIP bandwidth (Opus audio) + text | Uses text bandwidth only (minimal) |
| Latency | Sender’s TTS latency + network | Network + receiver’s TTS latency |
| Voice consistency | All opted-in receivers hear the same voice | Each receiver uses their own TTS backend/quality |
| Privacy | Synthesized audio leaves sender’s machine | Only text leaves sender’s machine (better) |
| Opt-in guarantee | Requires receiver consent flag | Preserved by design (receiver controls synthesis) |
| Replay | Heard as voice in replay (if voice-in-replay enabled) | Text in order stream + voice params in replay metadata (see below) |
| Muting | Requires is_synthetic packet flag for independent synth muting | Receiver controls locally — trivial to mute |
Replay Voice Reproduction (Mode B)
Mode B records SynthChatMessage orders (with persona_id) in the replay’s tick order stream, but the sender’s VoiceParams (pitch, speed) are exchanged at lobby time, not per-message. To enable replay viewers to synthesize the same voice:
- All players’
VoiceBridgeCapabilitydata is stored in the replay metadata JSON (alongside player names, factions, and other session info). This adds ~50 bytes per player — negligible. - On replay playback, the viewer reads the capability set from metadata, resolves each player’s persona + voice params, and can synthesize
SynthChatMessagetext in the correct voice if the viewer has TTS enabled. - If the replay viewer does not have TTS enabled or lacks the persona’s model,
SynthChatMessageorders display as normal text — the text is always preserved regardless of TTS capability.
{
"players": [
{ "slot": 0, "name": "Alice", "faction": "allies", ... },
{ "slot": 1, "name": "Bob", "faction": "soviet", ... }
],
"voice_bridge_capabilities": [
{ "slot": 0, "tts_enabled": true, "persona_id": "official/commander-alpha", "pitch": 0.9, "speed": 1.1 },
{ "slot": 1, "tts_enabled": false, "persona_id": null, "pitch": 1.0, "speed": 1.0 }
]
}
Chat Order Extension
The current PlayerOrder::ChatMessage { channel, text } has no metadata field. D079 adds a new order variant for TTS-eligible messages:
#![allow(unused)]
fn main() {
/// Extension to PlayerOrder for voice-text bridge (D079).
/// When the sender has TTS enabled, SynthChatMessage is sent INSTEAD OF
/// ChatMessage (not alongside — one canonical order per message, no
/// duplicates in the order stream or replay). Receivers who don't have
/// TTS enabled display the text field as a normal chat message and
/// ignore persona_id.
pub enum PlayerOrder {
// ... existing variants ...
ChatMessage { channel: ChatChannel, text: String },
/// TTS-eligible chat message. Same as ChatMessage but includes
/// the sender's voice persona ID for receiver-side synthesis.
SynthChatMessage {
channel: ChatChannel,
text: String,
/// Sender's voice persona ID (namespaced ResourceId).
/// Receivers use this to select the TTS voice for synthesis.
/// e.g., "official/commander-alpha", "community/modx-voices/tactical".
/// "local/*" IDs trigger fallback to a built-in persona on receivers
/// who don't have the local definition.
persona_id: Option<VoicePersonaId>,
},
}
}
This is a replay-visible protocol change requiring a protocol version bump. SynthChatMessage is a new PlayerOrder discriminant. Per api-misuse-defense.md § O4, unknown PlayerOrder discriminants are deserialization errors — not silently skipped — and protocol version mismatch terminates the handshake before orders flow. There is no graceful mixed-version decoding. Clients must match protocol versions to play together. The version bump is reflected in the replay header’s version field and the wire format’s protocol version header.
Voice Packet Extension (Mode A only)
For sender-side synthesis, the VoicePacket struct (D059) gains a is_synthetic flag to enable independent muting:
#![allow(unused)]
fn main() {
pub struct VoicePacket {
pub speaker: PlayerId,
pub sequence: u32,
pub is_synthetic: bool, // D079: true if this packet is TTS-synthesized, not real voice
// ... existing fields ...
}
}
This is a protocol change to D059’s voice packet format — the is_synthetic flag occupies one bit in the packet header. Relay routing uses it: if a receiver has synth_voice_muted for this speaker, the relay skips forwarding synthetic packets (same bandwidth-saving pattern as D059’s mute hint). This change is part of the same protocol version bump as SynthChatMessage — voice packets are versioned alongside the order protocol. No mixed-version voice decoding.
TTS Text Preprocessing (Optional)
Raw chat text often contains typos, abbreviations, gaming slang, special symbols, emoji, and non-standard grammar that TTS engines handle poorly — producing garbled, robotic, or nonsensical speech. A preprocessing step normalizes the text for the TTS engine before synthesis, while leaving the original chat text untouched for display.
What it fixes:
| Input Pattern | Problem for TTS | Preprocessed Output |
|---|---|---|
atk north w/ tanks asap | Abbreviations read literally (“ay tee kay”, “double-you slash”) | “Attack north with tanks, ASAP” |
gg wp | Read as individual letters | “Good game, well played” |
lol ur base is ded 😂 | Emoji read as unicode name, slang mispronounced | “Ha, your base is dead” |
need $$ for mamoth tank | Symbols read literally, typo mispronounced | “Need money for mammoth tank” |
!!!!! HELP HELP | Excessive punctuation causes stutter/pause spam | “Help! Help!” |
сука блять (Cyrillic in English chat) | TTS engine for English can’t pronounce Cyrillic | Transliterated or skipped with notification |
<script>alert('xss')</script> | Not a TTS issue, but sanitization | Stripped (sanitized before display anyway) |
Implementation tiers (lightest to heaviest):
| Tier | Approach | Latency | Size | Quality | Phase |
|---|---|---|---|---|---|
| 1. Rule-based (default) | YAML dictionary of abbreviations/slang → expansions. Regex symbol stripping. Emoji → description mapping. Punctuation collapsing. WFST-style number/currency/unit expansion (“$5k” → “five thousand dollars”). Moddable per game module | <1ms | ~50KB dictionary | Good for common cases | Phase 5 |
| 2. Spell correction (optional) | SymSpell algorithm (symmetric delete, edit distance 2). Pre-computed dictionary of valid words. 1M× faster than Norvig’s approach, O(1) lookup. Catches typos like “mamoth” → “mammoth”, “atack” → “attack”. Language-independent | <1ms | ~5MB dictionary | Excellent for typos | Phase 5 |
| 3. Statistical normalization (optional) | Ekphrasis-style word segmentation + normalization using word frequency statistics (from gaming corpora). Handles hashtag splitting, elongation normalization (“nooooo” → “no”), social-media-style text. No neural network | ~5ms | ~20MB frequency tables | Good for slang/shorthand | Phase 6a |
| 4. WFST text normalization (optional) | Weighted Finite State Transducer grammars (NeMo-style). Deterministic, rule-compiled. Handles complex number formats, dates, addresses, unit conversions. Production-grade TTS preprocessing used by NVIDIA and Google | <5ms | ~10MB compiled grammar | Excellent for structured text | Phase 6a |
| 5. LLM/SLM-enhanced (optional, advanced) | Small local model or cloud LLM for context-aware normalization. Handles novel abbreviations, ambiguous intent, mixed-language input | 50-500ms | ~500MB+ (local) | Best for edge cases | Phase 7 |
Default stack: Tier 1 (rules) + Tier 2 (spell correction) ship in Phase 5. Both are <1ms, zero neural dependencies, fully offline. Tiers 3-5 are opt-in upgrades for players who want higher quality preprocessing.
Rule-based dictionary (YAML, moddable):
# tts-preprocessing/en.yaml — shipped with IC, moddable per game module
abbreviations:
gg: "good game"
wp: "well played"
glhf: "good luck, have fun"
asap: "as soon as possible"
atk: "attack"
def: "defend"
w/: "with"
ur: "your"
u: "you"
plz: "please"
thx: "thanks"
omw: "on my way"
brb: "be right back"
afk: "away from keyboard"
ez: "easy"
imo: "in my opinion"
tbh: "to be honest"
nvm: "never mind"
# Game-specific terms (ra1 module)
game_terms:
mamoth: "mammoth" # common typo
mig: "MiG" # pronunciation hint
mcv: "M.C.V." # spell out abbreviation
apc: "A.P.C."
gdi: "G.D.I."
nod: "Nod"
emoji_map:
"😂": "(laughing)"
"💀": "(skull)"
"🔥": "(fire)"
"👍": "(thumbs up)"
"❤️": "(heart)"
rules:
collapse_repeated_punctuation: true # "!!!!" → "!"
collapse_repeated_letters: true # "nooooo" → "nooo" (preserve some emphasis)
max_repeated_chars: 3
strip_markdown: true # remove **bold**, *italic* markers
strip_html_tags: true # sanitization
Tier 5 — LLM/SLM-enhanced preprocessing (Phase 7, optional, advanced):
For players who have a local LLM configured (same provider pattern as D047/llm-setup-guide.md), text can pass through a context-aware normalization prompt. This is the heaviest tier and is not recommended as the default — Tiers 1-2 cover the vast majority of cases. Use only when dealing with highly novel slang, mixed-language input, or ambiguous abbreviations that rule-based and statistical tiers can’t handle.
System: You are a text normalizer for a text-to-speech engine in a military RTS game.
Convert the input chat message into clean, natural spoken English.
Fix typos, expand abbreviations, remove excessive punctuation, describe relevant emoji.
Keep it brief. Do not add content. Do not censor. Output only the normalized text.
Input: "atk north w/ tanks asap, ur base is ded lol 😂"
Output: "Attack north with tanks, ASAP. Your base is dead. Ha!"
Settings:
[voice_text_bridge.tts.preprocessing]
enabled = true # on by default when TTS is enabled
tier = "rules+spell" # "rules" | "rules+spell" (default) | "statistical" | "wfst" | "llm" | "disabled"
dictionary = "tts-preprocessing/en.yaml" # Tier 1 rules dictionary (moddable per game module / language)
spell_dictionary = "tts-preprocessing/en-spell.txt" # Tier 2 SymSpell word frequency list
llm_provider = "" # only used when tier = "llm"
The original chat text is never modified. Preprocessing produces a parallel normalized string that is fed to the TTS engine only. Other players see the original text in the chat panel. The sender sees their original text too — the preprocessing is invisible to everyone except the TTS engine.
AI Voice Personas (Phase 6a)
Each player can be assigned (or choose) a unique AI-generated voice persona that is used for their TTS synthesis. This ensures that when multiple players use text-to-voice, each sounds distinct — not the same generic robot voice.
Persona Assignment
# Voice persona definition (stored in <data_dir>/voice-personas/ or shipped in packs)
voice_persona:
id: "official/commander-alpha" # namespaced ResourceId
display_name: "Commander Alpha"
pitch: 0.9 # relative pitch adjustment (0.5 = deep, 1.0 = neutral, 1.5 = high)
speed: 1.1 # speaking rate (0.5 = slow, 1.0 = normal, 1.5 = fast)
language: en-US
tts_model: "piper:en_US-lessac-medium" # local model reference
# OR: tts_model: "elevenlabs:voice_id_abc123" # cloud model reference
Unified Voice Identity (ArmA Pattern)
Inspired by ArmA’s player voice system — where players pick a voice type and pitch that is used for all radio commands. In IC, the voice persona is a unified voice identity: whenever any text is synthesized for a player (typed chat, chat wheel phrase, beacon alert, quick command), the TTS engine uses that player’s configured voice parameters.
The key insight: the TTS engine doesn’t care whether the input text is a player’s typed message or a system-generated “Affirmative” string. It’s all text → voice. The player’s persona parameters (voice model, pitch, speed, style) are applied uniformly to all synthesis for that player.
| Text Source | Example Text | TTS Treatment |
|---|---|---|
| Player types in chat | “atk north with tanks” | Synthesized with player’s persona voice |
| Chat wheel phrase (D059) | “Affirmative” / “Need backup” | Same persona voice — the phrase text is just fed to the TTS engine |
| Beacon alert (D059) | “Enemy spotted at north bridge” | Same persona voice |
| Quick command | “Moving out” / “Understood” | Same persona voice |
This means each player in a match sounds distinct — not because they picked from pre-recorded audio samples, but because the TTS engine produces different-sounding output based on each player’s voice/pitch/speed configuration. Two players saying “Affirmative” sound different because their personas have different parameters.
Player-facing settings (Settings → Communication → Voice Persona):
| Setting | Range | Default | Description |
|---|---|---|---|
| Voice | Built-in library or custom | Auto-assigned | Base TTS model/voice — the “raw material” the engine synthesizes from |
| Pitch | 0.5 – 1.5 slider | 1.0 (neutral) | Lower = deeper, higher = lighter |
| Speed | 0.5 – 1.5 slider | 1.0 (normal) | Speaking rate |
| Preview | [Play Sample] button | — | Hear your persona say a sample phrase before committing |
Note on style/post-processing: D059’s voice effects rule is that the listener controls audio post-processing (radio filter, reverb, clean audio — see D059-voip-effects-ecs.md § Voice Effects). The sender controls voice identity (which voice, pitch, speed); the receiver controls how they hear it (effects, spatial audio, filter). Style post-processing is therefore not part of VoiceParams — it stays receiver-side, consistent with D059.
These parameters are stored in the player’s local settings and advertised as part of the VoiceBridgeCapability session advertisement (see § Persistence Model). The receiver’s TTS engine uses them when synthesizing any text attributed to that player.
Persona Sources
| Source | How It Works | Phase |
|---|---|---|
| Built-in voices | IC ships 8-12 built-in personas per language (varied gender, age, style). Assigned round-robin to players who don’t choose one | Phase 5 |
| Player-selected | Player picks from built-in library or configures a custom persona in Settings | Phase 6a |
| Faction-themed | Allied voices sound Western/NATO; Soviet voices sound Eastern European. Automatic based on faction | Phase 6a |
| Workshop voices | Community-created voice persona packs (D050). Must pass content moderation (D037) | Phase 6a |
| Cloud-cloned | Player uses a cloud TTS API (ElevenLabs) to create a voice from a short recording. The cloud voice ID is stored in the player’s local settings. Receiver-side limitation: cloud-cloned voices only work in sender-side synthesis (Mode A), because the receiver cannot access the sender’s cloud API credentials. In receiver-side synthesis (Mode B, default), cloud-cloned personas fall back to the nearest matching built-in persona (matched by pitch/speed/style parameters). The sender hears their cloud voice locally in preview; teammates hear the built-in fallback unless Mode A is used | Phase 6a |
Persona Transmission
When using sender-side synthesis (Mode A), the synthesized audio already carries the persona’s characteristics — no additional metadata needed beyond is_synthetic in the voice packet. When using receiver-side synthesis (Mode B, default), the sender’s persona_id is included in the SynthChatMessage order (see § Chat Order Extension) so the receiver’s client can select the correct TTS voice.
Pluggable Backend Architecture
STT and TTS backends are independently pluggable, following the same provider pattern as the LLM setup guide (player-flow/llm-setup-guide.md).
Persistence Model — What Lives Where
Three systems, three purposes — no overlap:
| Data | Home | Why |
|---|---|---|
| Backend config (which STT/TTS engine, API keys, model paths, caption position) | settings.toml § [voice_text_bridge] | Local machine preferences. Not shared. Same pattern as [llm] and [audio] settings |
Voice persona handle (selected persona ID — a namespaced ResourceId, not the full definition) | settings.toml + session capability advertisement | Local preference stored in settings. Transmitted to teammates via lobby capability exchange (not via D053 profile — see rationale below). ~40 bytes |
| Custom persona definitions (full TTS model refs, pitch/speed/style, cloud vendor refs) | Local data dir (<data_dir>/voice-personas/) | Too large and too private for the shared profile. D053 assumes ~2KB profile responses. Custom blobs may include vendor API refs that shouldn’t be advertised |
| Per-player mute state (voice/synth/text mute decisions) | Local SQLite (D034), keyed by Ed25519 public key | Local preference — same pattern as D059’s existing mute persistence. Not synced |
Settings IA note: The current settings panel (settings.md) has Audio and Social tabs but no Communication tab. Voice-text bridge settings should be added as a new Communication sub-section under Social (where voice/chat settings naturally belong), or as a new top-level tab if the surface area warrants it. This is a settings.md update deferred until D079 moves to Accepted.
Voice persona is NOT in D053 player profile. D053 requires every profile field to have visibility controls (D053 § Design Principles: “Privacy by default. Every profile field has visibility controls”). But teammates must always receive the persona ID for receiver-side TTS to work — a privacy toggle would break synthesis. This makes it a poor fit for the privacy-controlled profile model.
Instead, the voice persona ID is transmitted as a session capability advertisement during lobby join — the same mechanism used for engine version, mod fingerprint, and other per-session metadata. It’s ephemeral (valid for this session only), always visible to participants, and not part of the persistent public profile.
Namespaced ResourceId format: Persona IDs follow the project’s standard namespaced resource pattern (D068, D062):
official/commander-alpha # built-in persona shipped with IC
official/soviet-officer # built-in faction-themed persona
community/modx-custom-voices/tactical # Workshop persona pack
local/my-custom-voice # locally-defined persona (not resolvable by others — fallback to built-in)
Bare strings like "custom_42" are rejected — the namespace prefix is mandatory. This prevents collisions across packs/machines and makes resolution unambiguous: official/ resolves from built-in data, community/ resolves from installed Workshop packs, local/ triggers fallback on receivers who don’t have it.
#![allow(unused)]
fn main() {
/// Voice persona ID — namespaced ResourceId, not a freeform string.
/// Transmitted in SynthChatMessage.persona_id and lobby capability exchange.
/// NOT part of D053 PlayerProfile (see rationale above).
pub struct VoicePersonaId(pub ResourceId); // e.g., "official/commander-alpha"
/// Lobby/waiting room capability advertisement for voice-text bridge.
/// Exchanged during the lobby ready-up phase (before the game starts),
/// alongside engine version, mod fingerprint, and other session metadata.
/// Each player's voice configuration is available to all participants
/// before the first tick — the receiver's TTS engine can pre-load the
/// correct voice model during the loading screen.
/// See player-flow/multiplayer.md § Lobby for the ready-up flow.
pub struct VoiceBridgeCapability {
/// Whether this player sends SynthChatMessage orders (has TTS enabled).
pub tts_enabled: bool,
/// The player's selected voice persona (namespaced ResourceId).
/// Determines which base TTS model/voice the receiver's engine loads.
pub persona_id: Option<VoicePersonaId>,
/// The player's voice parameters — applied ON TOP of the base persona
/// by the receiver's TTS engine. These are the sender's configured
/// pitch/speed/style, not the receiver's preferences.
pub voice_params: Option<VoiceParams>,
}
/// Player-configured voice parameters. Advertised to teammates via
/// VoiceBridgeCapability so the receiver's TTS engine can reproduce
/// the sender's chosen voice characteristics.
pub struct VoiceParams {
pub pitch: f32, // 0.5 (deep) – 1.5 (high), default 1.0
pub speed: f32, // 0.5 (slow) – 1.5 (fast), default 1.0
// NOTE: No `style` field. Post-processing (radio filter, reverb, etc.)
// is receiver-controlled per D059 § Voice Effects. The sender controls
// voice identity (model + pitch + speed); the receiver controls effects.
}
}
Full custom persona definitions remain in <data_dir>/voice-personas/ (local data, shareable via Workshop packs). accept_synthetic_voice remains in settings.toml (local receive preference).
Reconnect and late-observer rule: VoiceBridgeCapability is exchanged at lobby ready-up, but players may reconnect mid-match (D059 § desync recovery) and observers may join after launch (D059 § spectator). For these cases, the relay includes all active players’ VoiceBridgeCapability data in the reconnection snapshot metadata — the same mechanism used for player names, factions, and other session info that a reconnecting client needs to reconstruct the game state. Late-joining observers receive the capability set as part of the mid-game join handshake. No separate resend protocol is needed — it piggybacks on existing reconnection infrastructure.
Settings Configuration
# settings.toml — local machine config (NOT social/profile state)
[voice_text_bridge]
enabled = false # master toggle (off by default)
[voice_text_bridge.stt]
enabled = false # captions off by default
backend = "local" # "local" | "cloud" | "disabled"
local_model = "whisper-turbo" # model pack name (downloaded via D068, not bundled)
cloud_provider = "" # "azure" | "google" | "deepgram" (requires API key)
cloud_api_key_ref = "" # reference to credential in ic-paths keystore
language = "auto" # auto-detect or explicit language code
caption_position = "bottom" # "bottom" | "top" | "left" | "right"
caption_duration_sec = 5.0
show_transcript_panel = false # show separate transcript panel (scrollback review, never in chat log)
[voice_text_bridge.tts]
enabled = false # text-to-voice off by default
backend = "local" # "local" | "cloud" | "disabled"
local_model = "piper:en_US-lessac-medium" # model pack name (downloaded via D068)
cloud_provider = "" # "elevenlabs" | "azure" | "google"
cloud_api_key_ref = ""
synthesis_mode = "receiver" # "receiver" (Mode B, default) | "sender" (Mode A, requires receiver consent)
accept_synthetic_voice = false # opt-in to hear Mode A synthesized voice from other players
[voice_text_bridge.tts.preprocessing]
enabled = true # on by default when TTS is enabled
tier = "rules+spell" # "rules" | "rules+spell" (default) | "statistical" | "wfst" | "llm" | "disabled"
dictionary = "tts-preprocessing/en.yaml"
spell_dictionary = "tts-preprocessing/en-spell.txt"
[voice_text_bridge.translation]
enabled = false # Phase 7
target_language = "" # translate captions to this language
Backend Options
| Backend | STT | TTS | Latency | Privacy | Quality | Delivery |
|---|---|---|---|---|---|---|
| Local Whisper (ONNX) | Yes | No | ~100-500ms | Full (on-device) | High | Downloadable model pack (~200MB) |
| Local Piper | No | Yes | <100ms | Full (on-device) | Medium | Downloadable model pack (~50MB) |
| Azure Cognitive Services | Yes | Yes | ~200-400ms | Cloud (Microsoft) | Very high | API key (no local model) |
| Google Cloud Speech/TTS | Yes | Yes | ~200-400ms | Cloud (Google) | Very high | API key (no local model) |
| ElevenLabs | No | Yes | ~75ms | Cloud (ElevenLabs) | Excellent | API key (no local model) |
| Deepgram | Yes | No | ~100-300ms | Cloud (Deepgram) | Very high | API key (no local model) |
Model delivery: Local STT/TTS models are not bundled with the base IC install — they are downloadable model packs following the same pattern as LLM model packs (llm-setup-guide.md § Quickest Start) and optional content packs (D068 selective install, D069 install wizard). When a player enables voice-text bridge for the first time:
- Settings → Communication → Voice-Text Bridge → Enable
- IC prompts: “Voice-text bridge requires a model pack (~250MB for STT + TTS). Download now?”
- Model pack downloaded from Workshop (D050) or IC CDN — installed into
<data_dir>/models/voice-bridge/ - Done — local STT/TTS active, works offline from this point
This respects the project’s install philosophy: the base binary is small; large optional content (models, HD assets, campaigns) is downloaded on demand via managed packs. Players who only use cloud backends (API key) need no local models at all.
Default experience (after model download): Local Whisper for STT + Local Piper for TTS. Works offline, no cloud dependency, acceptable quality. Players who want higher quality or personalized voices configure a cloud provider via the same setup flow as LLM providers.
Extended Muting Model
D059’s existing mute model supports per-player voice and text muting. D079 extends this with a third independent channel — synthesized voice:
#![allow(unused)]
fn main() {
/// Extended per-player mute state (D059 + D079).
pub struct PlayerMuteState {
/// Mute the player's real voice (Opus VoIP stream)
pub voice_muted: bool,
/// Mute the player's text chat messages
pub text_muted: bool,
/// Mute the player's synthesized voice (TTS playback of their text)
/// Independent from voice_muted — a player might mute someone's real
/// voice but still hear their typed messages as synthesized speech,
/// or vice versa.
pub synth_voice_muted: bool,
}
}
Mute combinations and their effects:
| voice_muted | text_muted | synth_voice_muted | Effect |
|---|---|---|---|
| false | false | false | Hear everything (real voice + text + synth) |
| true | false | false | No real voice; see text; hear synth of their text |
| false | true | false | Hear real voice; no text; no synth (nothing to synthesize) |
| false | false | true | Hear real voice; see text; no synth playback |
| true | false | true | No voice at all; see text only |
| true | true | true | Fully muted — no communication received |
Console commands (D058):
Namespace note:
/muteis already overloaded across three systems — D058 uses/mute <player>for admin chat mute and/mute [master|music|sfx|voice]for audio mixing; D059 uses/mute <player>for local player mute. D079 does not add another/mutevariant. Instead, it uses the/playernamespace for per-player communication control, keeping/mutefor its existing admin and audio uses.
/captions on|off -- toggle STT captions
/captions language <code> -- set caption language (or 'auto')
/tts on|off -- toggle TTS for your outgoing text (mid-match: toggles SynthChatMessage emission)
/tts mode sender|receiver -- set synthesis mode (lobby only — locked after match start)
/voice-persona list -- show available personas
/voice-persona set <id> -- set your voice persona (lobby only — locked after match start)
/voice-persona preview <id> -- hear a sample of a persona (lobby only)
/player <name> mute voice|text|synth|all -- mute specific channels for a player
/player <name> unmute voice|text|synth|all -- unmute specific channels
/player <name> mute-status -- show current mute state for a player
D058 currently has /players (list connected players) but no /player <name> ... command family. D079 introduces the /player <name> namespace for per-player communication control, avoiding further overload of /mute (which D058 uses for both admin chat mute and audio mixing). D058 should be updated to formalize this namespace when D079 moves to Accepted. The existing D059 /mute <player> shorthand (which mutes voice + text) continues to work as a convenience alias for /player <name> mute all.
Session-lock rule: Voice persona and synthesis mode are locked after match start. VoiceBridgeCapability is exchanged once during lobby ready-up — there is no runtime capability-update path. Changing persona mid-match would require re-broadcasting to all participants and potentially re-loading TTS models on every receiver, which is disruptive and complex. /tts on|off remains toggleable mid-match (it only controls whether the local client emits SynthChatMessage orders — no capability re-broadcast needed). /voice-persona set and /tts mode are rejected with a message after match start: “Voice persona and synthesis mode can only be changed in the lobby.”
Implementation Architecture
Processing Pipeline
CAPTION PATH (speaker → caption on listener's screen):
Mic → Opus encode → Relay → Opus decode → [Listener's STT] → Caption overlay
↓
(optional) Chat log
TTS PATH (typist → voice on listener's speakers):
Keyboard → Text message → [Preprocess] → [Sender's TTS] → Opus encode → Relay → Speaker
(Mode A: sender-side) ↑ ↓
Rule-based or Listener hears voice
LLM normalize
(original text
untouched in chat)
OR:
Keyboard → Text message → Relay → [Preprocess] → [Listener's TTS] → Speaker
(Mode B: receiver-side) ↑ ↓
Rule-based or Listener hears voice
LLM normalize
Threading Model
STT and TTS run on background threads on native platforms — never on the game loop or render thread:
Native (Windows/macOS/Linux/Steam Deck):
Main thread (GameLoop): Orders, sim, render — unaffected
Audio thread: Opus encode/decode, jitter buffer (existing D059)
STT thread: Whisper inference (background, feeds captions to UI via channel)
TTS thread: Piper/cloud synthesis (background, feeds Opus frames to audio thread)
Browser (WASM):
Single-threaded executor (crate-graph.md § WASM: no std::thread).
Local STT/TTS models are NOT available (Whisper/Piper require background threads).
Cloud-only backends work via async fetch (non-blocking HTTP calls on the main executor).
If no cloud backend is configured, voice-text bridge is disabled on WASM with a
settings note: "Local voice models require the native app. Use a cloud provider
for browser play."
Per-platform policy:
| Platform | Local STT/TTS | Cloud STT/TTS | Threading |
|---|---|---|---|
| Windows, macOS, Linux, Steam Deck | Yes (background threads) | Yes (async HTTP on background thread) | std::thread |
| Mobile (iOS/Android) | Yes (background threads, but battery/thermal concern — default off, cloud preferred) | Yes | Platform threads |
| Browser (WASM) | No (no std::thread, models too large for browser) | Yes (async fetch on single-threaded executor) | Single-threaded async |
Relay Impact
Two protocol changes (require protocol version bump — no mixed-version decoding per api-misuse-defense.md § O4):
-
SynthChatMessageorder variant — a newPlayerOrdervariant alongsideChatMessage. Addspersona_id: Option<VoicePersonaId>(namespacedResourceId). Relay routes it identically toChatMessage(same lane, same filtering). -
is_syntheticbit inVoicePacket(Mode A only) — one bit in the packet header. Relay reads it for mute-hint routing: if a receiver hassynth_voice_mutedfor this speaker, the relay skips forwarding (same bandwidth-saving pattern as D059’s per-player mute hint). Older clients that don’t understand the flag treat the packet as normal voice.
For the default Mode B (receiver-side synthesis), the relay sees only a SynthChatMessage order — no voice traffic generated. The receiver’s client synthesizes locally.
Implementation Estimate
| Component | Crate(s) | Est. Lines | Phase |
|---|---|---|---|
| STT backend trait + local Whisper | ic-audio | ~400 | Phase 5 |
| TTS backend trait + local Piper | ic-audio | ~350 | Phase 5 |
| Caption overlay UI | ic-ui | ~250 | Phase 5 |
| Sender-side TTS pipeline | ic-audio, ic-net | ~200 | Phase 5 |
| TTS text preprocessing (rule-based) | ic-audio | ~150 | Phase 5 |
| Extended mute model | ic-game | ~80 | Phase 5 |
| Console commands | ic-game | ~100 | Phase 5 |
| Settings UI (voice-text bridge panel) | ic-ui | ~200 | Phase 5 |
| Cloud backend adapters (Azure, Google, ElevenLabs, Deepgram) | ic-audio | ~300 | Phase 6a |
| Voice persona system + Workshop packs | ic-audio, ic-game | ~350 | Phase 6a |
| Faction-themed auto-assignment | ic-game | ~100 | Phase 6a |
| TTS preprocessing (spell + statistical + WFST) | ic-audio | ~200 | Phase 6a |
| TTS preprocessing (LLM/SLM tier, optional) | ic-audio | ~100 | Phase 7 |
| Receiver-side TTS pipeline (Mode B) | ic-audio, ic-net | ~150 | Phase 5 |
| Cross-language translation pipeline | ic-audio | ~250 | Phase 7 |
| Total | ~3,180 |
Alternatives Considered
| Alternative | Verdict | Reason |
|---|---|---|
| Platform-level only (rely on Xbox/OS STT/TTS) | Rejected | Only works on Xbox/Windows. Linux, macOS, WASM, Steam Deck have no platform STT/TTS. IC must be cross-platform |
| Always-on captions (opt-out instead of opt-in) | Rejected | STT processing costs CPU. Default-off respects players who don’t need it and avoids surprise resource usage |
| Single TTS voice for all players | Rejected | Multiple players using the same voice is confusing — “who said that?” Per-player personas solve attribution |
| Cloud-only backends | Rejected | Privacy concern (voice data leaves device). Offline play broken. Local backends must be the default; cloud is optional upgrade |
| Voice cloning of other players | Rejected | Consent and safety concern. Players can only clone/customize their own voice persona. Impersonation is explicitly blocked |
| Real-time voice style transfer (voice changer) | Deferred | Different from TTS — this is modifying live voice, not synthesizing from text. Interesting but orthogonal to the voice-text bridge. Could be a future D059 extension |
Cross-References
- Communication system: D059 (VoIP architecture, text chat, muting model, relay forwarding, Opus codec)
- Player profile: D053 (reference only — D079 explicitly does NOT store voice persona in D053 profile; uses session capability advertisement instead)
- Community servers: D052 (relay forwarding, moderation)
- LLM provider setup:
player-flow/llm-setup-guide.md(pluggable backend configuration pattern) - Workshop: D050 (community voice persona packs)
- Settings:
player-flow/settings.md(voice-text bridge settings panel) - Console commands: D058 (
/captions,/tts,/voice-persona,/player <name> mute)