Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

10 — Performance

Core Principle: Efficiency, Not Brute Force

Performance goal: a 2012 laptop with 2 cores and 4GB RAM runs a 500-unit battle smoothly. A modern machine handles 3000 units without sweating.

We don’t achieve this by throwing threads at the problem. We achieve it by wasting almost nothing — like Datadog Vector’s pipeline or Tokio’s runtime. Every cycle does useful work. Every byte of memory is intentional. Multi-core is a bonus that emerges naturally, not a crutch the engine depends on.

This is a first-class project goal and a primary differentiator over OpenRA.

Keywords: performance, efficiency-first, 2012 laptop target, 500 units, low-end hardware, Bevy/wgpu compatibility tiers, zero-allocation hot paths, ECS cache layout, simulation LOD, profiling

The Efficiency Pyramid

Ordered by impact. Each layer works on a single core. Only the top layer requires multiple cores.

                    ┌──────────────┐
                    │ Work-stealing │  Bonus: scales to N cores
                    │ (rayon/Bevy)  │  (automatic, zero config)
                  ┌─┴──────────────┴─┐
                  │  Zero-allocation  │  No heap churn in hot paths
                  │  hot paths        │  (scratch buffers, reuse)
                ┌─┴──────────────────┴─┐
                │  Amortized work       │  Spread cost across ticks
                │  (staggered updates)  │  (1/4 of units per tick)
              ┌─┴──────────────────────┴─┐
              │  Simulation LOD           │  Skip work that doesn't
              │  (adaptive detail)        │  affect the outcome
            ┌─┴──────────────────────────┴─┐
            │  Cache-friendly ECS layout    │  Data access patterns
            │  (hot/warm/cold separation)   │  that respect the hardware
          ┌─┴──────────────────────────────┴─┐
          │  Algorithmic efficiency            │  Better algorithms beat
          │  (O(n) beats O(n²) on any CPU)    │  more cores every time
          └────────────────────────────────────┘
              ▲ MOST IMPACT — start here

Layer 1: Algorithmic Efficiency

Better algorithms on one core beat bad algorithms on eight cores. This is where 90% of the performance comes from.

Pathfinding: Multi-Layer Hybrid Replaces Per-Unit A* (RA1 Pathfinder Implementation)

The RA1 game module implements the Pathfinder trait with IcPathfinder — a multi-layer hybrid combining JPS, flow field tiles, and local avoidance (see research/pathfinding-ic-default-design.md). The gains come from multiple layers:

JPS vs. A (small groups, <8 units):* JPS (Jump Point Search) prunes symmetric paths that A* explores redundantly. On uniform-cost grids (typical of open terrain in RA), JPS explores 10–100× fewer nodes than A*.

Flow field tiles vs. per-unit A (mass movement, ≥8 units sharing destination):* When 50 units move to the same area, OpenRA computes 50 separate A* paths.

OpenRA (per-unit A*):
  50 units × ~200 nodes explored × ~10 ops/node = ~100,000 operations

Flow field tile:
  1 field × ~2000 cells × ~5 ops/cell              = ~10,000 operations
  50 units × 1 lookup each                          =       50 operations
  Total                                             = ~10,050 operations

10x reduction. No threading involved.

The 51st unit ordered to the same area costs zero — the field already exists. Flow field tiles amortize across all units sharing a destination. The adaptive threshold (configurable, default 8 units) ensures flow fields are only computed when the amortization benefit exceeds the generation cost.

Hierarchical sector graph: O(1) reachability check (flood-fill domain IDs) eliminates pathfinding for unreachable destinations entirely. Coarse sector-level routing reduces the search space for detailed pathfinding.

Spatial Indexing: Grid Hash Replaces Brute-Force Range Checks (RA1 SpatialIndex Implementation)

“Which enemies are in range of this turret?”

Brute force: 1000 units × 1000 units = 1,000,000 distance checks/tick
Spatial hash: 1000 units × ~8 nearby   =     8,000 distance checks/tick

125x reduction. No threading involved.

A spatial hash divides the world into buckets. Each entity registers in its bucket. Range queries only check nearby buckets. O(1) lookup per bucket, O(k) per query where k is the number of nearby entities (typically < 20). The bucket size is a tunable parameter independent of any game grid — the same spatial hash structure works for grid-based and continuous-space games.

Hierarchical Pathfinding: Coarse Then Fine

IcPathfinder’s Layer 2 breaks the map into ~32×32 cell sectors. Path between sectors first (few nodes, fast), then path within the current sector only. Most of the map is never pathfinded at all. Units approaching a new sector compute the next fine-grained path just before entering. Combined with JPS (Layer 3), this reduces pathfinding cost by orders of magnitude compared to flat A*.

Layer 2: Cache-Friendly Data Layout

ECS Archetype Storage (Bevy provides this)

OOP (cache-hostile, typical C# pattern):
  Unit objects on heap: [pos, health, vel, name, sprite, audio, ...]
  Iterating 1000 positions touches 1000 scattered memory locations
  Cache miss rate: high — each unit object spans multiple cache lines

ECS archetype storage (cache-friendly):
  Positions:  [p0, p1, p2, ... p999]   ← 8KB contiguous, fits in L1 cache
  Healths:    [h0, h1, h2, ... h999]   ← 4KB contiguous, fits in L1 cache
  Movement system reads positions sequentially → perfect cache utilization

1000 units × 8-byte positions = 8KB. L1 cache on any CPU since ~2008 is at least 32KB. The entire position array fits in L1. Movement for 1000 units runs from the fastest memory on the chip.

Hot / Warm / Cold Separation

HOT (every tick, must be contiguous):
  Position (8B), Velocity (8B), Health (4B), SimLOD (1B), FogVisible (1B)
  → ~22 bytes per entity × 1000 = 22KB — fits in L1

WARM (some ticks, when relevant):
  Armament (16B), PathState (32B), BuildQueue (24B), HarvesterCargo (8B)
  → Separate archetype arrays, pulled into cache only when needed

COLD (rarely accessed, lives in Resources):
  UnitDef (name, icon, prereqs), SpriteSheet refs, AudioClip refs
  → Loaded once, referenced by ID, never iterated in hot loops

Design components to be small. A Position is 2 integers, not a struct with name, description, and sprite reference. The movement system pulls only positions and velocities — 16 bytes per entity, 16KB for 1000 units, pure L1.

Layer 3: Simulation LOD (Adaptive Detail)

Not all units need full processing every tick. A harvester driving across an empty map with no enemies nearby doesn’t need per-tick pathfinding, collision detection, or animation state updates.

#![allow(unused)]
fn main() {
pub enum SimLOD {
    /// Full processing: pathfinding, collision, precise targeting
    Full,
    /// Reduced: simplified pathing, broadphase collision only
    Reduced,
    /// Minimal: advance along pre-computed path, check arrival
    Minimal,
}

fn assign_sim_lod(
    unit_pos: WorldPos,
    in_combat: bool,
    near_enemy: bool,
    near_friendly_base: bool,  // deterministic — same on all clients
) -> SimLOD {
    if in_combat || near_enemy { SimLOD::Full }
    else if near_friendly_base { SimLOD::Reduced }
    else { SimLOD::Minimal }
}
}

Determinism requirement: LOD assignment must be based on game state (not camera position), so all clients assign the same LOD. “Near enemy” and “near base” are deterministic queries.

Impact: In a typical game, only 20-30% of units are in active combat at any moment. The other 70-80% use Reduced or Minimal processing. Effective per-tick cost drops proportionally.

Layer 4: Amortized Work (Staggered Updates)

Expensive systems don’t need to process all entities every tick. Spread the cost evenly.

#![allow(unused)]
fn main() {
fn pathfinding_system(
    tick: Res<CurrentTick>,
    query: Query<(Entity, &Position, &MoveTarget, &SimLOD), With<NeedsPath>>,
    pathfinder: Res<Box<dyn Pathfinder>>,  // D013/D045 trait seam
) {
    let group = tick.0 % 4;  // 4 groups, each updated every 4 ticks

    for (entity, pos, target, lod) in &query {
        let should_update = match lod {
            SimLOD::Full    => entity.index() % 4 == group,    // every 4 ticks
            SimLOD::Reduced => entity.index() % 8 == (group * 2) % 8,  // every 8 ticks
            SimLOD::Minimal => false,  // never replan, just follow existing path
        };

        if should_update {
            recompute_path(entity, pos, target, &*pathfinder);
        }
    }
}
}

API note: This is pseudocode for scheduling/amortization. The exact Pathfinder resource type depends on the game module’s dispatch strategy (D013/D045). Hot-path batch queries should prefer caller-owned scratch (*_into APIs) over allocation-returning helpers.

Result: Pathfinding cost per tick drops 75% for Full-LOD units, 87.5% for Reduced, 100% for Minimal. Combined with SimLOD, a 1000-unit game might recompute ~50 paths per tick instead of 1000.

Stagger Schedule

SystemFull LODReduced LODMinimal LOD
Pathfinding replanEvery 4 ticksEvery 8 ticksNever (follow path)
Fog visibilityEvery tickEvery 2 ticksEvery 4 ticks
AI re-evaluationEvery 2 ticksEvery 4 ticksEvery 8 ticks
Collision detectionEvery tickEvery 2 ticksBroadphase only

Determinism preserved: The stagger schedule is based on entity ID and tick number — deterministic on all clients.

AI Computation Budget

AI runs on the same stagger/amortization principles as the rest of the sim. The default PersonalityDrivenAi (D043) uses a priority-based manager hierarchy where each manager runs on its own tick-gated schedule — cheap decisions run often, expensive decisions run rarely (pattern used by EA Generals, 0 A.D. Petra, and MicroRTS). Full architectural detail in D043 (decisions/09d-gameplay.md); survey analysis in research/rts-ai-implementation-survey.md.

AI ComponentFrequencyTarget TimeApproach
Harvester assignmentEvery 4 ticks< 0.1msNearest-resource lookup
Defense responseEvery tick (reactive)< 0.1msEvent-driven, not polling
Unit productionEvery 8 ticks< 0.2msPriority queue evaluation
Building placementOn demand< 1.0msInfluence map lookup
Attack planningEvery 30 ticks< 2.0msComposition check + timing
Strategic reassessmentEvery 60 ticks< 5.0msFull state evaluation
Total per tick (amortized)< 0.5msBudget for 500 units

All AI working memory (influence maps, squad rosters, composition tallies, priority queues) is pre-allocated in AiScratch — analogous to TickScratch (Layer 5). Zero per-tick heap allocation. Influence maps are fixed-size arrays, cleared and rebuilt on their evaluation schedule. The AiStrategy::tick_budget_hint() method (D041) provides a hard microsecond cap — if the budget is exhausted mid-evaluation, the AI returns partial results and uses cached plans from the previous complete evaluation.

Layer 5: Zero-Allocation Hot Paths

Heap allocation is expensive: the allocator touches cold memory, fragments the heap, and (in C#) creates GC pressure. Rust eliminates GC, but allocation itself still costs cache misses.

#![allow(unused)]
fn main() {
/// Pre-allocated scratch space reused every tick.
/// Initialized once at game start, never reallocated.
/// Pathfinder and SpatialIndex implementations maintain their own scratch buffers
/// internally — pathfinding scratch is not in this struct.
pub struct TickScratch {
    damage_events: Vec<DamageEvent>,       // capacity: 4096
    spatial_results: Vec<EntityId>,        // capacity: 2048 (reused by SpatialIndex queries)
    visibility_dirty: Vec<EntityId>,       // capacity: 1024 (entities needing fog update)
    validated_orders: Vec<ValidatedOrder>,  // capacity: 256
    combat_pairs: Vec<(Entity, Entity)>,   // capacity: 2048
}

impl TickScratch {
    fn reset(&mut self) {
        // .clear() sets length to 0 but keeps allocated memory
        // Zero bytes allocated on heap during the hot loop
        self.damage_events.clear();
        self.spatial_results.clear();
        self.visibility_dirty.clear();
        self.validated_orders.clear();
        self.combat_pairs.clear();
    }
}
}

Per-tick allocation target: zero bytes. All temporary data goes into pre-allocated scratch buffers. clear() resets without deallocating. The hot loop touches only warm memory.

This is a fundamental advantage of Rust over C# for games. Idiomatic C# allocates many small objects per tick (iterators, LINQ results, temporary collections, event args), each of which contributes to GC pressure. Our engine targets zero per-tick allocations.

String Interning (Compile-Time Resolution for Runtime Strings)

IC is string-heavy by design — YAML keys, trait names, mod identifiers, weapon names, locomotor types, condition names, asset paths, Workshop package IDs. Comparing these strings at runtime (byte-by-byte, potentially cache-cold) in every tick is wasteful when the set of valid strings is known at load time.

String interning resolves all YAML/mod strings to integer IDs once during loading. All runtime comparisons use the integer — a single CPU instruction instead of a variable-length byte scan.

#![allow(unused)]
fn main() {
/// Interned string handle — 4 bytes, Copy, Eq is a single integer comparison.
/// Stable across save/load (the intern table is part of snapshot state, D010).
#[derive(Clone, Copy, PartialEq, Eq, Hash, Serialize, Deserialize)]
pub struct InternedId(u32);

/// String intern table — built during YAML rule loading, immutable during gameplay.
/// Part of the sim snapshot for deterministic save/resume.
pub struct StringInterner {
    id_to_string: Vec<String>,                  // index → string (display, debug, serialization)
    string_to_id: HashMap<String, InternedId>,  // string → index (used at load time only)
}

impl StringInterner {
    /// Resolve a string to its interned ID. Called during YAML loading — never in hot paths.
    pub fn intern(&mut self, s: &str) -> InternedId {
        if let Some(&id) = self.string_to_id.get(s) {
            return id;
        }
        let id = InternedId(self.id_to_string.len() as u32);
        self.id_to_string.push(s.to_owned());
        self.string_to_id.insert(s.to_owned(), id);
        id
    }

    /// Look up the original string for display/debug. Not used in hot paths.
    pub fn resolve(&self, id: InternedId) -> &str {
        &self.id_to_string[id.0 as usize]
    }
}
}

Where interning eliminates runtime string work:

SystemWithout interningWith interning
Condition checks (D028)String compare per condition per unit per tickInternedId == InternedId (1 instruction)
Trait alias resolution (D023/D027)HashMap lookup by string at rule evaluationPre-resolved at load time to canonical InternedId
WASM mod API boundaryString marshaling across host/guest (allocation + copy)u32 type IDs — already designed this way in 04-MODDING.md
Mod stacking namespace (D062)String-keyed path lookups in the virtual namespaceInternedId-keyed flat table
Versus table keysArmor/weapon type strings per damage calculationInternedId indices into flat [i32; N] array (already done for VersusTable)
Notification dedupString comparison for cooldown checksInternedId comparison

Interning generalizes the VersusTable principle. The VersusTable flat array (documented above in Layer 2) already converts armor/weapon type enums to integer indices for O(1) lookup. String interning extends this approach to every string-keyed system — conditions, traits, mod paths, asset names — without requiring hardcoded enums. The VersusTable uses compile-time enum indices; StringInterner provides the same benefit for data-driven strings loaded from YAML.

What NOT to intern: Player-facing display strings (chat messages, player names, localization text). These are genuinely dynamic and not used in hot-path comparisons. Interning targets the engine vocabulary — the fixed set of identifiers that YAML rules, conditions, and mod APIs reference repeatedly.

Snapshot integration (D010): The StringInterner is part of the sim snapshot. When saving/loading, the intern table serializes alongside game state, ensuring that InternedId values remain stable across save/resume. Replays record the intern table at keyframes. This is the same approach Factorio uses for its prototype string IDs — resolved once during data loading, stable for the session lifetime.

Global Allocator: mimalloc

The engine uses mimalloc (Microsoft, MIT license) as the global allocator on desktop and mobile targets. WASM uses Rust’s built-in dlmalloc (the default for wasm32-unknown-unknown).

Why mimalloc:

FactormimallocSystem allocatorjemalloc
Small-object speed5x faster than glibcBaselineGood but slower than mimalloc
Multi-threaded (Bevy/rayon)Per-thread free lists, single-CAS cross-thread freeContended on LinuxGood but higher RSS
Fragmentation (60+ min sessions)Good (temporal cadence, periodic coalescing)Varies by platformBest, but not enough to justify trade-offs
RSS overheadLow (~50% reduction vs glibc in some workloads)Platform-dependentModerate (arena-per-thread)
Windows supportNativeNativeWeak (caveats)
WASM supportNoYes (dlmalloc)No
LicenseMITN/ABSD 2-clause

Alternatives rejected:

  • jemalloc: Better fragmentation resistance but weaker Windows support, no WASM, higher RSS on many-core machines, slower for small objects (Bevy’s dominant allocation pattern). Only advantage is profiling, which mimalloc’s built-in stats + the counting wrapper replicate.
  • tcmalloc (Google): Modern version is Linux-only. Does not meet cross-platform requirements.
  • rpmalloc (Embark Studios): Viable but Embark wound down operations. Less community momentum. No WASM support.
  • System allocator: 5x slower on Linux multi-threaded workloads. Unacceptable for Bevy’s parallel ECS scheduling.

Per-target allocator selection:

TargetAllocatorRationale
Windows / macOS / LinuxmimallocBest small-object perf, low RSS, native cross-platform
WASMdlmalloc (Rust default)Built-in, adequate for single-threaded WASM context
iOS / Androidmimalloc (fallback: system)mimalloc builds for both; system is safe fallback if build issues arise
CI / Debug buildsCountingAllocator<MiMalloc>Wraps mimalloc with per-tick allocation counting (feature-gated)

Implementation pattern:

#![allow(unused)]
fn main() {
// ic-game/src/main.rs (or ic-app entry point)
#[cfg(not(target_arch = "wasm32"))]
#[global_allocator]
static GLOBAL: mimalloc::MiMalloc = mimalloc::MiMalloc;
// WASM targets fall through to Rust's default dlmalloc — no override needed.
}

Allocation-counting wrapper for CI regression detection:

In CI/debug builds (behind a counting-allocator feature flag), a thin wrapper around mimalloc tracks per-tick allocation counts:

#![allow(unused)]
fn main() {
/// Wraps the inner allocator with atomic counters.
/// Reset counters at tick boundary; assert both are 0 after tick_system() completes.
/// Enabled only in CI/debug builds via feature flag.
pub struct CountingAllocator<A: GlobalAlloc> {
    inner: A,
    alloc_count: AtomicU64,
    dealloc_count: AtomicU64,
}
}

This catches regressions where new code introduces heap allocations in the sim hot path. The benchmark bench_tick_zero_allocations() asserts that alloc_count == 0 after a full tick with 1000 units — if it fails, someone added a heap allocation to a hot path.

Why the allocator matters less than it seems for IC: The sim (ic-sim) targets zero allocations during tick processing (Layer 5). The allocator’s impact is primarily on the loading phase (asset parsing, ECS setup, mod compilation), Bevy internals (archetype storage, system scheduling, renderer), menu/UI, and networking buffers. None of these affect simulation determinism. The allocator is not deterministic (pointer values vary across runs), but since ic-sim performs zero allocations during ticks, this is irrelevant for lockstep determinism.

mimalloc built-in diagnostics: Enable via MI_STAT=2 environment variable for per-thread allocation statistics, peak RSS, segment usage. Useful for profiling the loading phase and identifying memory bloat without external tools.

Layer 6: Work-Stealing Parallelism (Bonus Scaling)

After layers 1-5, the engine is already fast on a single core. Parallelism scales it further on better hardware.

How Bevy + rayon Work-Stealing Operates

Rayon (used internally by Bevy) creates exactly one thread per CPU core. No more, no less. Work is distributed via lock-free work-stealing queues:

2-core laptop:
  Thread 0: [pathfind units 0-499]
  Thread 1: [pathfind units 500-999]
  → Both busy, no waste

8-core desktop:
  Thread 0: [pathfind units 0-124]
  Thread 1: [pathfind units 125-249]
  ...
  Thread 7: [pathfind units 875-999]
  → All busy, 4x faster than laptop

16-core workstation:
  → Same code, 16 threads, even faster
  → No configuration change

No thread is ever idle if work exists. No thread is ever created or destroyed during gameplay. This is the Tokio/Vector model applied to CPU-bound game logic.

Where Parallelism Actually Helps

Only systems where per-entity work is independent and costly:

#![allow(unused)]
fn main() {
// YES — pathfinding is expensive and independent per unit
fn pathfinding_system(query: Query<...>, pathfinder: Res<Box<dyn Pathfinder>>) {
    let results: Vec<_> = query.par_iter()
        .filter(|(_, _, _, lod)| lod.should_update_path(tick))
        .map(|(entity, pos, target, _)| {
            (entity, pathfinder.find_path(pos, &target.dest))
        })
        .collect();

    // Sort for determinism, then apply sequentially
    apply_sorted(results);
}

// NO — movement is cheap per unit, parallelism overhead not worth it
fn movement_system(mut query: Query<(&mut Position, &Velocity)>) {
    // Just iterate. Adding and subtracting integers.
    // Parallelism overhead would exceed the computation itself.
    for (mut pos, vel) in &mut query {
        pos.x += vel.dx;
        pos.y += vel.dy;
    }
}
}

API note: This parallel example illustrates where parallelism helps, not the exact final pathfinder interface. In IC, parallel work may happen either inside IcPathfinder or in a pathfinding system that batches deterministic requests/results through the selected Pathfinder implementation. In both cases, caller-owned scratch and deterministic result ordering still apply.

Rule of thumb: Only parallelize systems where per-entity work exceeds ~1 microsecond. Simple arithmetic on components is faster to iterate sequentially than to distribute.