Performance Simulator

Concept

Network Flow Simulator uses analytic queuing models and Monte Carlo simulation to evaluate network performance without packet-level discrete event simulation. Given a topology and traffic demands, it pushes billions of flow iterations through queuing models in seconds, identifying congestion bottlenecks probabilistically and projecting capacity headroom across carrier-scale networks (100k+ nodes).

The core tradeoff: sacrifice per-packet fidelity for orders-of-magnitude speed improvement. An M/G/1 queuing model with Pollaczek-Khinchine mean value analysis produces utilization, delay, and loss estimates that are analytically exact for the modeled traffic class — and runs in seconds where a packet simulator would take hours.

Usage

# Run a Monte Carlo simulation
netflowsim simulate --topology network.graphml --iterations 1000

# Compare queuing models side by side
netflowsim compare --topology network.graphml --models mm1,md1,mg1-pareto

# Generate routing matrix from FIBs
netflowsim generate-routing --fibs routing-tables/

# Run N-1 failure analysis
netflowsim n1-analysis --topology network.graphml

# Generate report with all analysis modules
netflowsim report --config simulation.json

Configuration is driven by JSON config files (--config), with CLI flags overriding config values, and defaults filling the rest.

Quick Facts


Status	Recently Updated
Stack	Rust

Core Value

netflowsim provides rapid, massive-scale network performance analysis by using analytic queuing models and Monte Carlo simulations instead of packet-level discrete event simulation. It enables network engineers to validate topologies and routing strategies against billions of flow iterations in seconds, identify bottlenecks probabilistically, test network resilience under failure scenarios, and project capacity headroom for carrier-scale networks (100k+ nodes).

Primary Objectives

Performance: Utilize Rust and Rayon to maximize multi-core hardware utilization.
Scalability: Handle massive carrier-scale topologies (100k+ nodes) via Petgraph and efficient data structures.
Decoupling: Clearly separate the Routing Matrix generation (packet-sim logic) from the Flow Simulation (queuing logic).
Visibility: Provide high-performance geographic visualization via MVT/Martin with multi-region support.

Milestones

✅ v1.0 MVP — Shipped 2026-02-20
✅ v1.1 Enhanced Queuing Models — Shipped 2026-02-22
✅ v2.0 Production-Grade Scale — Shipped 2026-03-01

See .planning/MILESTONES.md for full milestone history.

Current Milestone: v2.1 Performance Optimization

Goal: Reduce memory footprint, runtime overhead, and restore throughput while maintaining all v2.0 features.

Target features:

Memory optimization to <6GB at 100k nodes (from 8.37GB)
Feature overhead reduction to < vs baseline (from )
Throughput recovery to >500k flows/sec with all features (from 53.2k)

Current State

Version: v2.0 (shipped 2026-03-01) Codebase: 14,813 lines of Rust (+ from v1.1) Tech Stack: Rust, Petgraph (StableGraph), Rayon, Serde, Plotters, Criterion, approx, kolmogorov_smirnov, statrs CLI Commands: simulate, compare, generate-routing, report, n1-analysis

Shipped Features (v1.0 + v1.1 + v2.0):

High-performance Monte Carlo simulation (4.5M flows/sec baseline at 100k nodes)
M/G/1 queuing theory with Pollaczek-Khinchine formula
General service time distributions (Pareto, LogNormal, Weibull, Exponential, Deterministic)
Automatic CV² calculation from distribution parameters
Parallel comparison mode across multiple queuing models
Routing matrix generation from distributed FIBs
Path tracing with ECMP and loop detection
Advanced statistical analysis (percentiles, CDFs)
CDF plot rendering with theoretical overlays (PNG + SVG)
[v2.0] Time-series behavior tracking with bounded memory and adaptive sampling
[v2.0] Bottleneck detection and ranking with utilization threshold
[v2.0] Correlation analysis with time-lagged causality detection
[v2.0] Capacity planning projections with linear regression and saturation timeline
[v2.0] Cascading failure detection with BFS depth/breadth metrics
[v2.0] Multi-region geographic topology modeling with latency zones
[v2.0] Region-aware traffic generation with configurable locality bias
[v2.0] Region-filtered statistics and bottlenecks
[v2.0] GeoJSON export with region metadata for netvis visualization
[v2.0] Comprehensive validation suite (30 scenario feature matrix, 16 Little’s Law tests)
Dynamic simulation with events
N-1 failure analysis
Validation suite (Little’s Law, K-S tests, regression benchmarks)
Config-driven execution with --config flag
JSON error envelopes for CLI failures
Schema evolution with v1.0/v1.1 backward compatibility
Reproducible results (embedded + sidecar run_config)
Incremental routing cache for topology changes
Checkpointing system for long-running simulations

Performance Characteristics (100k nodes, 1000 iterations, all v2.0 features):

Runtime: 187.93s (well under 10min target)
Memory: 8.37GB (2.1× over 4GB target - feature overhead documented)
Throughput: 53.2k flows/sec ( overhead from v2.0 features vs baseline)

Known Limitations:

Dynamic simulation results use different schema than static (by design)
No interactive visualization (plots are static images)
v2.0 feature overhead: runtime increase, throughput reduction vs baseline (optimization opportunity for v2.1)
Placeholder link IDs in bottleneck/capacity analysis (infrastructure ready for per-link tracking)

User Feedback Themes:

Functional correctness: validation success (feature matrix, Little’s Law, backward compatibility)
Performance at scale: Runtime targets met, memory/throughput gaps documented and non-blocking

Technical Debt:

Memory optimization opportunities (8.37GB vs 4GB target at 100k nodes)
Feature overhead reduction ( runtime increase from time-series + analysis modules)
Per-link tracking enhancement (replace placeholder link IDs)
Statistical equivalence test requires v1.0 baseline archive

Requirements

# Validated (v1.0)

✓ Simulate 1M+ flows through 10k+ nodes in under 1 second — v1.0
✓ Support standard graph formats (GraphML/JSON) — v1.0
✓ Accurate analytic modeling (M/M/1, M/D/1) validated against theoretical benchmarks — v1.0
✓ Real-time visualization of congestion hotspots — v1.0
✓ FIB ingestion and routing matrix generation — v1.0
✓ Path tracing with ECMP support and loop detection — v1.0
✓ Integration with topogener and packet simulators — v1.0
✓ Statistical analysis with percentiles (p50/p90/p95/p99) — v1.0
✓ CDF plot generation for latency, throughput, queueing delay, link utilization — v1.0
✓ Automated bottleneck detection with Top-K ranking — v1.0
✓ Dynamic simulation with link/node failure events — v1.0
✓ Convergence tracking and success rate monitoring — v1.0
✓ N-1 failure analysis for critical component identification — v1.0

# Validated (v1.1)

✓ M/G/1 queuing model with Pollaczek-Khinchine formula — v1.1
✓ Heavy-tailed distributions (Pareto, LogNormal, Weibull) — v1.1
✓ Custom distribution framework with automatic CV² calculation — v1.1
✓ Queuing model validation suite (Little’s Law, K-S tests, regression benchmarks) — v1.1
✓ Performance comparison across queuing models (parallel execution) — v1.1
✓ Schema evolution with backward compatibility (v1.0 → v1.1) — v1.1
✓ CLI config-driven execution with reproducible results — v1.1

# Validated (v2.0)

✓ 100k+ nodes analyzed in <10min — v2.0
✓ Multi-scale profiling infrastructure (25k-125k nodes) — v2.0
✓ Incremental routing cache for topology changes — v2.0
✓ Post-execution checkpointing with deterministic resume — v2.0
✓ Time-series behavior tracking over simulation duration — v2.0
✓ Bounded memory streaming with adaptive sampling — v2.0
✓ Rolling window metrics (mean/max/count) — v2.0
✓ Trend plot generation (utilization, bottlenecks, latency percentiles) — v2.0
✓ Bottleneck detection with utilization threshold — v2.0
✓ Correlation analysis with time-lagged causality detection — v2.0
✓ Capacity planning projections with saturation timeline — v2.0
✓ Cascading failure detection with BFS depth/breadth metrics — v2.0
✓ Geographic zone definitions (region_id, availability_zone) — v2.0
✓ Inter-region latency zones (WAN link classification) — v2.0
✓ Region-aware traffic generation with locality bias — v2.0
✓ Region-filtered statistics and bottlenecks — v2.0
✓ GeoJSON export with region metadata — v2.0
✓ Feature matrix validation (30/30 scenarios passed) — v2.0
✓ Queuing theory correctness (16/16 Little’s Law tests) — v2.0
✓ Backward compatibility (v1.0/v1.1 → v2.0 schema migration) — v2.0

# Active (v2.1)

Memory optimization to <6GB at 100k nodes (currently 8.37GB)
Feature overhead reduction to < vs baseline (currently )
Throughput recovery to >500k flows/sec with features (currently 53.2k)

# Active (Future Work)

Per-link tracking to replace placeholder link IDs
Mid-execution checkpoint/resume (post-execution complete)
Enhanced visualization options (interactive plots, dashboards)
Additional M/G/1 distributions (Gamma, Beta, custom user-defined)

# Out of Scope

Real-time packet-level simulation — use dedicated packet simulators
GUI interface — CLI-first approach with programmatic access
Network configuration management — focus on analysis, not orchestration
4GB memory target — revised to 8-10GB for full feature set (architectural constraint)
1M flows/sec with all features — baseline achievable, feature overhead optimization deferred to v2.1

Tech Stack

Language: Rust
Graph Library: Petgraph (StableGraph for dynamic mutations)
Parallelism: Rayon (parallel Monte Carlo execution)
Serialization: Serde (JSON), Bincode (checkpoints), Postcard (future)
Visualization: Plotters (PNG/SVG), Martin (Tileserver), MVT (Mapbox Vector Tiles)
Profiling: DHAT (heap profiling with feature flag)
Math: statrs (distributions), manual implementations (Pearson correlation, linear regression)

Ecosystem Context

This project is part of a seven-tool network automation ecosystem. netflowsim provides flow-based traffic analysis — the “analyze” stage of the pipeline.

Role: Validate network capacity and performance at scale using analytic queuing models and Monte Carlo simulation. Consume topologies and traffic demands from topogen; consume FIBs from netsim for path tracing.

Key integration points:

Consumes GeoJSON topology and traffic CSV from topogen
Consumes (planned) FIB routing matrices from netsim for post-simulation traffic analysis
Exports GeoJSON with link utilization statistics for netvis geographic rendering
CLI: netflowsim simulate | generate-routing | verify-routing

Architecture documents:

Ecosystem Architecture Overview — full ecosystem design, data flow, workflows
Ecosystem Critical Review — maturity assessment, integration gaps, strategic priorities

Key Decisions

Decision	Rationale	Outcome	Status
Recursive path tracing with ECMP	Handles multi-path routing correctly	Works well, cycle detection robust	✓ Good
Interface-to-link resolution via subnet matching	Automates FIB-to-topology mapping	Eliminates manual configuration	✓ Good
Nearest-rank percentiles	Avoids interpolation complexity	Simple, robust, accurate	✓ Good
Node bottleneck scoring: 1.0 - ∏(1-p)	Captures “at least one incident link congested”	Identifies aggregate hotspots	✓ Good
StableGraph for dynamic mutations	Enables runtime topology changes	Zero breaking changes to earlier phases	✓ Good
Separate schemas for static/dynamic	Different modes track different metrics	Clean separation, documented limitation	✓ Good
Warn for Pareto α ≤ 2 (infinite variance)	Retains user flexibility for heavy-tailed exploration	Allows analysis with caveats	✓ Good
Automatic CV² calculation via distribution methods	Eliminates manual input errors	Correct queuing theory application	✓ Good
Deterministic seeded traffic for comparison mode	Ensures fair cross-model results	Reproducible performance comparisons	✓ Good
Config-first merge (config → CLI → defaults)	Deterministic merge order for errors	Clear validation feedback	✓ Good
Dual persistence (embedded + sidecar run_config)	Self-contained results + easy extraction	Perfect reproducibility	✓ Good
Additive v1.1 schema with serde defaults	v1.0 backward compatibility	Seamless version migration	✓ Good
[v2.0] DHAT profiling via feature flag	Avoids production overhead while enabling allocation tracking	No runtime penalty, profiling when needed	✓ Good
[v2.0] Tick(u64) as unified time index	Single time axis works across Monte Carlo and dynamic modes	Simplifies time-series collection	✓ Good
[v2.0] Adaptive sampling + fixed point budget	Min-interval plus change-triggered emission with downsampling	Prevents memory explosion at scale	✓ Good
[v2.0] Bottleneck threshold util>=0.80	Industry standard threshold balances sensitivity with actionability	Effective bottleneck detection	✓ Good
[v2.0] Manual Pearson correlation (~30 LOC)	Avoids external dependency bloat (linfa/polars)	Lightweight, maintainable	✓ Good
[v2.0] Adaptive max_lag using 2× median interval	Prevents false positives in correlation analysis	Robust causality detection	✓ Good
[v2.0] Manual linear regression (least squares)	Avoids linfa/polars for simple use case	~40 LOC, no external deps	✓ Good
[v2.0] RegionLocalityConfig with locality_factor (0.0-1.0)	Controls same-region traffic bias	Flexible geo-distributed patterns	✓ Good
[v2.0] Graceful degradation for nodes without region_id	Enables parallel plan execution and backward compatibility	Seamless v1.x → v2.0 migration	✓ Good
[v2.0] LatencyZone enum for WAN links	Type-safe representation prevents invalid values	Clear inter-region link classification	✓ Good
[v2.0] Optimized pairwise coverage (30 scenarios vs 540)	Keeps validation time under 40 minutes	Comprehensive validation without exhaustive tests	✓ Good
[v2.0] Document failures honestly	SCALE-02 and SCALE-03 marked FAILED based on empirical evidence	Transparent performance characteristics	⚠️ Revisit (v2.1 optimization)
[v2.0] Accept partial phase goal achievement	1/3 performance targets met (runtime), 2/3 failed (memory, throughput)	Functional completeness prioritized over perf	⚠️ Revisit (v2.1 optimization)
[v2.0] Arc for immutable flow fields	Eliminates 300k allocations in Monte Carlo hot path	Improved baseline performance	✓ Good
[v2.0] HashMap::with_capacity() for known sizes	Eliminates reallocation overhead at scale	Reduced allocation churn	✓ Good
[v2.0] Iteration-specific seed derivation (base_seed + index)	Deterministic parallel execution without rayon hooks	Reproducible parallel Monte Carlo	✓ Good
[v2.0] Restart-with-seed resume semantics	Simpler than incremental, avoids rayon hooks	Post-execution checkpoints complete	— Pending (mid-execution deferred)

Last updated: 2026-03-01 after v2.1 milestone start

Current Status

2026-03-08 — tests pass)