Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

D031: Observability & Telemetry — OTEL Across Engine, Servers, and AI Pipeline

Decision Capsule (LLM/RAG Summary)

  • Status: Accepted
  • Phase: Multi-phase (instrumentation foundation + server ops + advanced analytics/AI training pipelines)
  • Canonical for: Unified telemetry/observability architecture, local-first telemetry storage, and optional OTEL export policy
  • Scope: game client, relay/tracking/workshop servers, telemetry schema/storage, tracing/export pipeline, debugging and analytics tooling
  • Decision: All components record structured telemetry to local SQLite as the primary sink using a shared schema; OpenTelemetry is optional export infrastructure for operators who want dashboards/traces.
  • Why: Works offline, supports both players and operators, enables cross-component debugging (including desync analysis), and unifies gameplay/debug/ops/AI data collection under one instrumentation model.
  • Non-goals: Requiring external collectors (Prometheus/OTEL backends) for normal operation; separate incompatible telemetry formats per component.
  • Invariants preserved: Local-first data philosophy (D034/D061), offline-capable components, and mod/game agnosticism at the schema level.
  • Defaults / UX behavior: Telemetry is recorded locally with retention/rotation; operators may optionally enable OTEL export for live dashboards.
  • Security / Trust impact: Structured telemetry is designed for analysis without making external infrastructure mandatory; privacy-sensitive usage depends on the telemetry policy and field discipline in event payloads.
  • Performance / Ops impact: Unified schema simplifies tooling and reduces operational complexity; tracing/puffin stack is chosen for low disabled overhead and production viability.
  • Public interfaces / types / commands: shared telemetry.db schema, tracing instrumentation, optional OTEL exporters, analytics export/query tooling (see body)
  • Affected docs: src/06-SECURITY.md, src/03-NETCODE.md, src/decisions/09e-community.md (D034/D061), src/15-SERVER-GUIDE.md
  • Revision note summary: None
  • Keywords: telemetry, observability, OTEL, OpenTelemetry, SQLite telemetry.db, tracing, puffin, local-first analytics, desync debugging

Decision: All components — game client, relay server, tracking server, workshop server — record structured telemetry to local SQLite as the primary sink. Every component runs fully offline; no telemetry depends on external infrastructure. OTEL (OpenTelemetry) is an optional export layer for server operators who want Grafana dashboards — it is never a requirement. The instrumentation layer is unified across all components, enabling operational monitoring, gameplay debugging, GUI usage analysis, pattern discovery, and AI/LLM training data collection.

Rationale:

  • Backend servers (relay, tracking, workshop) are production infrastructure — they need health metrics, latency histograms, error rates, and distributed traces, just like any microservice
  • The game engine already has rich internal state (per-tick state_hash(), snapshots, system execution times) but no structured way to export it for analysis
  • Replay files capture what happened but not why — telemetry captures the engine’s decision-making process (pathfinding time, order validation outcomes, combat resolution details) that replays miss
  • Behavioral analysis (V12 anti-cheat) already collects APM, reaction times, and input entropy on the relay — OTEL is the natural export format for this data
  • AI/LLM development needs training data: game telemetry (unit movements, build orders, engagement outcomes) is exactly the training corpus for ic-ai and ic-llm
  • Bevy already integrates with Rust’s tracing crate — OTEL export is a natural extension, not a foreign addition
  • Stack validated by production Rust game infrastructure: Embark Studios’ Quilkin (production game relay) uses the exact tracing + prometheus + OTEL stack IC targets, confirming it handles real game traffic at scale. Puffin (Embark’s frame-based profiler) complements OTEL for per-tick instrumentation with ~1ns disabled overhead. IC’s “zero cost when disabled” requirement is satisfied by puffin’s AtomicBool guard and tracing’s compile-time level filtering. See research/embark-studios-rust-gamedev-analysis.md
  • Desync debugging needs cross-client correlation — distributed tracing (trace IDs) lets you follow an order from input → network → sim → render across multiple clients and the relay server
  • A single instrumentation approach (OTEL) avoids the mess of ad-hoc logging, custom metrics files, separate debug protocols, and incompatible formats

Key Design Elements:

Unified Local-First Storage

Every component records telemetry to a local SQLite file. No exceptions. This is the same principle as D034 (SQLite as embedded storage) and D061 (local-first data) applied to telemetry. The game client, relay server, tracking server, and workshop server all write to their own telemetry.db using an identical schema. No component depends on an external collector, dashboard, or aggregation service to function.

-- Identical schema on every component (client, relay, tracking, workshop)
CREATE TABLE telemetry_events (
    id            INTEGER PRIMARY KEY,
    timestamp     TEXT    NOT NULL,        -- ISO 8601 with microsecond precision
    session_id    TEXT    NOT NULL,        -- random per-process-lifetime
    component     TEXT    NOT NULL,        -- 'client', 'relay', 'tracking', 'workshop'
    game_module   TEXT,                    -- 'ra1', 'td', 'ra2', custom — set once per session (NULL on servers)
    mod_fingerprint TEXT,                  -- D062 SHA-256 mod profile fingerprint — updated on profile switch
    category      TEXT    NOT NULL,        -- event domain (see taxonomy below)
    event         TEXT    NOT NULL,        -- specific event name
    severity      TEXT    NOT NULL DEFAULT 'info',  -- 'trace','debug','info','warn','error'
    data          TEXT,                    -- JSON payload (structured, no PII)
    duration_us   INTEGER,                -- for events with measurable duration
    tick          INTEGER,                -- sim tick (gameplay/sim events only)
    correlation   TEXT                     -- trace ID for cross-component correlation
);

CREATE INDEX idx_telemetry_ts          ON telemetry_events(timestamp);
CREATE INDEX idx_telemetry_cat_event   ON telemetry_events(category, event);
CREATE INDEX idx_telemetry_session     ON telemetry_events(session_id);
CREATE INDEX idx_telemetry_game_module ON telemetry_events(game_module) WHERE game_module IS NOT NULL;
CREATE INDEX idx_telemetry_mod_fp      ON telemetry_events(mod_fingerprint) WHERE mod_fingerprint IS NOT NULL;
CREATE INDEX idx_telemetry_severity    ON telemetry_events(severity) WHERE severity IN ('warn', 'error');
CREATE INDEX idx_telemetry_correlation ON telemetry_events(correlation) WHERE correlation IS NOT NULL;

Why one schema everywhere? Aggregation scripts, debugging tools, and community analysis all work identically regardless of source. A relay operator can run the same /analytics export command as a player. Exported files from different components can be imported into a single SQLite database for cross-component analysis (desync debugging across client + relay). The aggregation tooling is a handful of SQL queries, not a specialized backend.

Mod-agnostic by design, mod-aware by context. The telemetry schema contains zero game-specific or mod-specific columns. Unit types, weapon names, building names, and resource types flow through as opaque strings — whatever the active mod’s YAML defines. A total conversion mod’s custom vocabulary (e.g., unit_type: "Mammoth Mk.III") passes through unchanged without schema modification. The two denormalized context columns — game_module and mod_fingerprint — are set once per session on the client (updated on ic profile activate if the player switches mod profiles mid-session). On servers, these columns are populated per-game from lobby metadata. This means every analytical query can be trivially filtered by game module or mod combination without JOINing through session.start’s JSON payload:

-- Direct mod filtering — no JOINs needed
SELECT event, COUNT(*) FROM telemetry_events
WHERE game_module = 'ra1' AND category = 'input'
GROUP BY event ORDER BY COUNT(*) DESC;

-- Compare behavior across mod profiles
SELECT mod_fingerprint, AVG(json_extract(data, '$.apm')) AS avg_apm
FROM telemetry_events WHERE event = 'match.pace'
GROUP BY mod_fingerprint;

Relay servers set game_module and mod_fingerprint per-game from the lobby’s negotiated settings — all events for that game inherit the context. When the relay hosts multiple concurrent games with different mods, each game’s events carry the correct mod context independently.

OTEL is an optional export layer, not the primary sink. Server operators who want real-time dashboards (Grafana, Prometheus, Jaeger) can enable OTEL export — but this is a planned optional operations enhancement (M7 operator usability baseline with deeper M11 scale hardening), not a deployment dependency. A community member running a relay server on a spare machine doesn’t need to set up Prometheus. They get full telemetry in a SQLite file they can query with any SQL tool.

Retention and rotation: Each component’s telemetry.db has a configurable max size (default: 100 MB for client, 500 MB for servers). When the limit is reached, the oldest events are pruned. /analytics export exports a date range to a separate file before pruning. Servers can also configure time-based retention (e.g., telemetry.retention_days = 30).

Three Telemetry Signals (OTEL Standard)

SignalWhat It CapturesExport Format
MetricsCounters, histograms, gauges — numeric time seriesOTLP → Prometheus
TracesDistributed request flows — an order’s journey through the systemOTLP → Jaeger/Zipkin
LogsStructured events with severity, context, correlation IDsOTLP → Loki/stdout

Backend Server Telemetry (Relay, Tracking, Workshop)

Standard operational observability — same patterns used by any production Rust service. All servers record to local SQLite (telemetry.db) using the unified schema above. The OTEL metric names below double as the event field in the SQLite table — operators can query locally via SQL or optionally export to Prometheus/Grafana.

Relay server metrics:

relay.games.active                    # gauge: concurrent games
relay.games.total                     # counter: total games hosted
relay.orders.received                 # counter: orders received per tick
relay.orders.forwarded                # counter: orders broadcast
relay.orders.dropped                  # counter: orders missed (lag switch)
relay.tick.latency_ms                 # histogram: tick processing time
relay.player.rtt_ms                   # histogram: per-player round-trip time
relay.player.suspicion_score          # gauge: behavioral analysis score (V12)
relay.desync.detected                 # counter: desync events
relay.match.completed                 # counter: matches finished
relay.match.duration_s                # histogram: match duration

Tracking server metrics:

tracking.listings.active              # gauge: current game listings
tracking.heartbeats.received          # counter: heartbeats processed
tracking.heartbeats.expired           # counter: listings expired (TTL)
tracking.queries.total                # counter: browse/search requests
tracking.queries.latency_ms           # histogram: query latency

Workshop server metrics:

workshop.resources.total              # gauge: total published resources
workshop.resources.downloads          # counter: download events
workshop.resources.publishes          # counter: publish events
workshop.resolve.latency_ms           # histogram: dependency resolution time
workshop.resolve.conflicts            # counter: version conflicts detected
workshop.search.latency_ms            # histogram: search query time

Server-Side Structured Events (SQLite)

Beyond counters and gauges, each server records detailed structured events to telemetry.db. These are the events that actually enable troubleshooting and pattern analysis:

Relay server events:

EventJSON data FieldsTroubleshooting Value
relay.game.startgame_id, map, player_count, settings_hash, balance_preset, game_module, mod_profile_fingerprintWhich maps/settings/mods are popular?
relay.game.endgame_id, duration_s, ticks, outcome, player_countMatch length distribution, completion vs. abandonment rates
relay.player.joingame_id, slot, rtt_ms, mod_profile_fingerprintConnection quality at join time, mod compatibility
relay.player.leavegame_id, slot, reason (quit/disconnect/kicked/timeout), match_time_sWhy and when players leave — early ragequit vs. end-of-game
relay.tick.processgame_id, tick, order_count, process_us, stall_detectedPer-tick performance, stall diagnosis
relay.order.forwardgame_id, player, tick, order_type, sub_tick_us, size_bytesOrder volume, sub-tick fairness verification
relay.desyncgame_id, tick, diverged_players[], hash_expected, hash_actualDesync diagnosis — which tick, which players
relay.lag_switchgame_id, player, gap_ms, orders_during_gapCheating detection audit trail
relay.suspiciongame_id, player, score, contributing_factors{}Behavioral analysis transparency

Tracking server events:

EventJSON data FieldsTroubleshooting Value
tracking.listing.creategame_id, map, host_hash, settings_summaryGame creation patterns
tracking.listing.expiregame_id, age_s, reason (TTL/host_departed)Why games disappear from the browser
tracking.queryquery_type (browse/search/filter), params, results_count, latency_msSearch effectiveness, popular filters

Workshop server events:

EventJSON data FieldsTroubleshooting Value
workshop.publishresource_id, type, version, size_bytes, dep_countPublishing patterns, resource sizes
workshop.downloadresource_id, version, requester_hash, latency_msDownload volume, popular resources
workshop.resolveroot_resource, dep_count, conflicts, latency_msDependency hell frequency, resolution performance
workshop.searchquery, filters, results_count, latency_msWhat people are looking for, search quality

Server export and analysis: Every server supports the same commands as the client — ic-server analytics export, ic-server analytics inspect, ic-server analytics clear. A relay operator troubleshooting laggy matches runs a SQL query against their local telemetry.db — no Grafana required. The exported SQLite file can be attached to a bug report or shared with the project team, identical workflow to the client.

Distributed traces: A multiplayer game session gets a trace ID (the correlation field). Every order, tick, and desync event references this trace ID. Debug a desync by searching for the game’s trace ID across the relay’s telemetry.db and the affected clients’ exported telemetry.db files — correlate events that crossed component boundaries. For operators with OTEL enabled, the same trace ID routes to Jaeger for visual timeline inspection.

Health endpoints: Every server exposes /healthz (already designed) and /readyz. Prometheus scrape endpoint at /metrics (when OTEL export is enabled). These are standard and compose with existing k8s deployment (Helm charts already designed in 03-NETCODE.md).

Game Engine Telemetry (Client-Side)

The engine emits structured telemetry for debugging, profiling, and AI training — but only when enabled. Hot paths remain zero-cost when telemetry is disabled (compile-time feature flag telemetry).

Performance Instrumentation

Per-tick system timing, already needed for the benchmark suite (10-PERFORMANCE.md), exported as OTEL metrics when enabled:

sim.tick.duration_us                  # histogram: total tick time
sim.system.apply_orders_us            # histogram: per-system time
sim.system.production_us
sim.system.harvesting_us
sim.system.movement_us
sim.system.combat_us
sim.system.death_us
sim.system.triggers_us
sim.system.fog_us
sim.entities.total                    # gauge: entity count
sim.entities.by_type                  # gauge: per-component-type count
sim.memory.scratch_bytes              # gauge: TickScratch buffer usage
sim.pathfinding.requests              # counter: pathfinding queries per tick
sim.pathfinding.cache_hits            # counter: flowfield cache reuse
sim.pathfinding.duration_us           # histogram: pathfinding computation time

Gameplay Event Stream

Structured events emitted during simulation — the raw material for AI training and replay enrichment:

#![allow(unused)]
fn main() {
/// Gameplay events emitted by the sim when telemetry is enabled.
/// These are structured, not printf-style — each field is queryable.
pub enum GameplayEvent {
    UnitCreated { tick: u64, entity: EntityId, unit_type: String, owner: PlayerId },
    UnitDestroyed { tick: u64, entity: EntityId, killer: Option<EntityId>, cause: DeathCause },
    CombatEngagement { tick: u64, attacker: EntityId, target: EntityId, weapon: String, damage: i32, remaining_hp: i32 },
    BuildingPlaced { tick: u64, entity: EntityId, structure_type: String, owner: PlayerId, position: WorldPos },
    HarvestDelivered { tick: u64, harvester: EntityId, resource_type: String, amount: i32, total_credits: i32 },
    OrderIssued { tick: u64, player: PlayerId, order: PlayerOrder, validated: bool, rejection_reason: Option<String> },
    PathfindingCompleted { tick: u64, entity: EntityId, from: WorldPos, to: WorldPos, path_length: u32, compute_time_us: u32 },
    DesyncDetected { tick: u64, expected_hash: u64, actual_hash: u64, player: PlayerId },
    StateSnapshot { tick: u64, state_hash: u64, entity_count: u32 },
}
}

These events are:

  • Emitted as OTEL log records with structured attributes (not free-text — every field is filterable)
  • Collected locally into a SQLite gameplay event log alongside replays (D034) — queryable with ad-hoc SQL without an OTEL stack
  • Optionally exported to a collector for batch analysis (tournament servers, AI training pipelines)

State Inspection (Development & Debugging)

A debug overlay (via bevy_egui, already in the architecture) that reads live telemetry:

  • Per-system tick time breakdown (bar chart)
  • Entity count by type
  • Network: RTT, order latency, jitter
  • Memory: scratch buffer usage, component storage
  • Pathfinding: active flowfields, cache hit rate
  • Fog: cells updated this tick, stagger bucket
  • Sim state hash (for manual desync comparison)

This is the “game engine equivalent of a Kubernetes dashboard” — operators of tournament servers or mod developers can inspect the engine’s internal state in real-time.

AI / LLM Training Data Pipeline

The primary ML training pipeline is replay-first — deterministic replay files are the source of truth. Telemetry enriches replay-derived data with contextual signals not present in the order stream (see research/ml-training-pipeline-design.md):

ConsumerData SourcePurpose
ic-ai (skirmish AI)Replay-derived training pairs + telemetry enrichmentLearn build orders, engagement timing, micro patterns
ic-llm (missions)Replay-derived training pairs + telemetry enrichmentLearn what makes missions fun (engagement density, pacing, flow)
ic-editor (replay→scenario)Replay event log (SQLite)Direct extraction of waypoints, combat zones, build timelines into editor
ic-llm (replay→scenario)Replay event log + contextGenerate narrative, briefings, dialogue for replay-to-scenario pipeline
Behavioral analysisRelay-side player profilesAPM, reaction time, input entropy → suspicion scoring (V12)
Balance analysisAggregated match outcomesWin rates by faction/map/preset → balance tuning
Adaptive difficultyPer-player gameplay patternsBuild speed, APM, unit composition → difficulty calibration
Community analyticsWorkshop + match metadataPopular resources, play patterns, mod adoption → recommendations

Privacy: Gameplay events are associated with anonymized player IDs (hashed). No PII in telemetry. Players opt in to telemetry export (default: local-only for debugging). Tournament/ranked play may require telemetry for anti-cheat and certified results. See 06-SECURITY.md.

Data format: Gameplay events export as structured OTEL log records → can be collected into Parquet/Arrow columnar format for operational analytics and balance analysis. The primary ML training pipeline is replay-first — deterministic replay files are the source of truth for training pairs (fog-filtered state + orders + outcome labels). Telemetry enriches replay-derived training data with contextual signals (camera attention, input habits, pacing snapshots) not present in the order stream. See research/ml-training-pipeline-design.md and D031/D031-analytics.md § “AI Training Data Export.”


Sub-Pages

SectionTopicFile
Analytics & ArchitectureProduct analytics client event taxonomy (10 categories), analytical power, architecture, implementation approach, self-hosting observabilityD031-analytics.md