Brownfield Ingestion
Contents
Concept
Vendor translation layer that decouples network configuration from vendor-specific syntax. Uses LLM-powered RAG to extract network-level intent and topology relationships from vendor documentation and CLI configurations, normalizing them into a vendor-neutral topology graph model. The intermediate representation is topology-centric (protocol adjacencies, link roles, VLAN membership) rather than device-centric like YANG, enabling genuine vendor abstraction.
Architecture
The pipeline has four stages:
-
Document ingestion: PDF/HTML vendor manuals converted to Markdown, indexed into a vector database (ChromaDB) for RAG retrieval. Dual-engine parsing — pymupdf4llm for fast extraction, MinerU for layout-aware fallback.
-
Intent extraction: LLM + RAG extracts topology-level relationships from unstructured documentation and CLI configurations. Every extraction carries a confidence score and evidence citation.
-
Human-in-the-loop review: Low-confidence extractions routed to a web UI for operator review. Corrections feed back to improve model accuracy.
-
Configuration generation: Topology model compiled to vendor-specific CLI (Cisco IOS, Arista EOS). Batfish validates semantic correctness — compiled configs are simulated to verify they produce the intended forwarding behavior.
Built with Python 3.12, FastAPI, ChromaDB, LangChain.
Quick Facts
| Status | Recently Updated |
| Stack | Python |
What This Is
A network automation framework that decouples network configuration from vendor-specific syntax. It uses LLM-powered RAG to extract network-level intent and topology relationships from vendor documentation and CLI configurations, normalizing them into a vendor-neutral topology graph model inspired by AutoNetKit. The system enables cross-vendor configuration generation and validation through semantic simulation.
Core Value
Extract network-level topology relationships (protocol adjacencies, link roles, VLAN membership) from vendor-specific CLI and documentation with high accuracy, enabling truly vendor-independent network configuration.
Current Milestone: v2.0 Production-Grade Translation Layer
Goal: Become the universal, production-ready vendor translation layer for the network automation ecosystem
Target features:
- Broader vendor coverage (Juniper, Nokia, F5, Palo Alto) and protocols (MPLS, SR, SRv6)
- Higher extraction accuracy with reduced HITL burden
- Ecosystem integration (RFC-01 compliance, autonetkit interop)
- Production hardening (observability, batch processing, error handling)
Requirements
# Validated
v1.0 Core Pipeline (shipped 2026-02-22):
- ✓ PDF/HTML to Markdown conversion for vendor manuals — v1.0 Phase 1
- ✓ Vector database indexing for RAG retrieval — v1.0 Phase 1
- ✓ Layout-aware parsing for CLI syntax tables — v1.0 Phase 1
- ✓ Extract network-level intent and relationships from natural language using RAG — v1.0 Phase 2
- ✓ Parse vendor-specific CLI into vendor-neutral topology model — v1.0 Phase 2
- ✓ Validate LLM outputs against topology schema to prevent hallucinations — v1.0 Phase 2
- ✓ Calculate confidence scores for every extraction — v1.0 Phase 2
- ✓ Flag low-confidence extractions for human review — v1.0 Phase 3
- ✓ Web-based UI for viewing, editing, and approving extracted topology — v1.0 Phase 3
- ✓ Feedback loop from human corrections to improve model performance — v1.0 Phase 3
- ✓ Generate vendor-specific CLI from topology model — v1.0 Phase 4
- ✓ Support cross-vendor config generation (Cisco IOS, Arista EOS) — v1.0 Phase 4
- ✓ Integrate Batfish for semantic validation — v1.0 Phase 4
# Active
v2.0 (in progress):
- (To be defined during requirements gathering)
# Out of Scope
- Real-time streaming telemetry integration — defer to future milestone
- On-device configuration deployment/rollback — compilation only, not deployment
- Network discovery/topology mapping from live networks — focuses on configuration abstraction, not discovery
- Support for legacy/EoL vendor platforms — focuses on modern platforms with good documentation
Context
v1.0 Status (shipped 2026-02-22): Full pipeline working end-to-end. Proven that LLM-powered extraction with RAG and human-in-the-loop can successfully translate vendor CLI to/from topology IR. System validated with real-world configs. Built with Python 3.12, FastAPI, ChromaDB, LangChain, and LLM APIs (Claude/GPT-4).
Ecosystem Position: This tool is the vendor translation layer in a larger network automation ecosystem (automationarch). It consumes vendor documentation and CLI, produces topology IR that feeds into autonetkit for modeling/simulation/visualization. Complementary to (not overlapping with) tools like autonetkit-config (design/compilation), netsim (protocol simulation), and netvis (visualization).
Key architectural insight: The intermediate representation is a topology-centric graph model, NOT a device-centric model like YANG. Network-level relationships (OSPF adjacencies, BGP peerings) are genuinely vendor-independent, while device-level configuration varies wildly across vendors. This enables true vendor abstraction.
v1.0 Learnings: LLM + RAG is viable for extracting topology-level intent from unstructured documentation. Human-in-the-loop is essential to manage hallucination risks. Batfish provides semantic validation to ensure compiled configs behave correctly. Confidence scoring and evidence citation are critical for production use.
Constraints
- Tech stack: Python 3.12, FastAPI, ChromaDB, LangChain, MinerU — established in Phase 1
- LLM costs: Token usage must be monitored; consider local models for high-volume extraction
- Hallucination risk: LLM outputs must include confidence scores and require validation before use in production
- Vendor coverage: Initial focus on Cisco IOS/IOS-XE and Arista EOS (most common enterprise platforms)
Key Decisions
| Decision | Rationale | Outcome |
|---|---|---|
| Topology-centric IR (not YANG) | YANG is device-centric; network relationships are truly vendor-independent | ✓ Good — enables genuine vendor abstraction |
| RAG + LLM for extraction | Handles diverse, unstructured vendor documentation better than rule-based parsers | ✓ Good — v1.0 validated with real-world configs |
| Batfish for validation | Industry-standard network simulator, validates semantic correctness | ✓ Good — v1.0 integration working, optional behind flag |
| Dual-engine PDF parsing | pymupdf4llm fast path + MinerU layout-aware fallback | ✓ Good — handles diverse PDF formats |
| ChromaDB vector store | Lightweight, embedded, good for RAG workloads | ✓ Good — fast retrieval, stable |
| HITL for quality assurance | LLMs hallucinate; human review essential for production | ✓ Good — v1.0 demonstrated viability with review UI |
| Confidence + evidence citation | Every extraction needs confidence score and doc/config evidence | ✓ Good — enables intelligent routing to HITL |
| Ecosystem integration focus | Translation layer only, not orchestration/intent/deployment | ✓ Good — clear boundaries with automationarch tools |
Last updated: 2026-02-22 after v1.0 completion and v2.0 milestone initialization
Current Status
2026-03-05 — Completed 07-01-PLAN.md