System Architecture
The Semantic Router implements a sophisticated Mixture-of-Models (MoM) architecture using Envoy Proxy as the foundation, with an External Processor (ExtProc) service that provides intelligent routing capabilities. This design ensures high performance, scalability, and maintainability for production LLM deployments.
High-Level Architecture Overview​
Core Components​
1. Envoy Proxy - Traffic Management Layer​
Role: Acts as the entry point and traffic director for all LLM requests.
Key Responsibilities:
- Load Balancing: Distributes requests across backend model endpoints
- Health Checking: Monitors backend model availability and health
- Request/Response Processing: Handles HTTP protocol management
- Header Management: Manages routing headers set by the ExtProc service
- Timeout Management: Configures appropriate timeouts for different model types
Configuration Highlights:
# Envoy listener configuration
listeners:
- name: listener_0
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 8801  # Main entry point
http_filters:
- name: envoy.filters.http.ext_proc
  typed_config:
    grpc_service:
      envoy_grpc:
        cluster_name: extproc_service
    processing_mode:
      request_header_mode: "SEND"      # Send headers for routing decisions
      response_header_mode: "SEND"     # Process response headers  
      request_body_mode: "BUFFERED"    # Analyze request content
      response_body_mode: "BUFFERED"   # Process response content
2. Semantic Router ExtProc Service - Intelligence Layer​
Role: The brain of the system that makes intelligent routing decisions.
Architecture:
type OpenAIRouter struct {
    Config               *config.RouterConfig
    CategoryDescriptions []string
    Classifier           *classification.Classifier    // ModernBERT-based
    PIIChecker           *pii.PolicyChecker           // Privacy protection
    Cache                *cache.SemanticCache         // Performance optimization
    ToolsDatabase        *tools.ToolsDatabase         // Tool selection
    
    pendingRequests     map[string][]byte             // Request tracking
    pendingRequestsLock sync.Mutex                    // Thread safety
}
Processing Pipeline:
3. Classification System - Decision Engine​
The classification system uses ModernBERT models for multiple classification tasks:
Category Classification​
Multi-Task Architecture​
# Conceptual model architecture
class SemanticRouter:
    def __init__(self):
        self.category_classifier = ModernBERTForSequenceClassification(
            num_labels=10  # Math, Creative, Code, etc.
        )
        self.pii_detector = ModernBERTForTokenClassification(
            num_labels=6   # PERSON, EMAIL, PHONE, SSN, LOCATION, NO_PII
        )
        self.jailbreak_guard = ModernBERTForSequenceClassification(
            num_labels=2   # Benign, Jailbreak
        )
        
    def route_request(self, query):
        # Multi-task inference
        category = self.category_classifier(query)
        pii_entities = self.pii_detector(query)  
        safety_score = self.jailbreak_guard(query)
        
        return self.make_routing_decision(category, pii_entities, safety_score)
Data Flow Architecture​
Request Processing Flow​
Response Processing Flow​
Threading and Concurrency Model​
Go ExtProc Server Concurrency​
// Server handles multiple concurrent connections
func (s *Server) Start() error {
    lis, err := net.Listen("tcp", fmt.Sprintf(":%d", s.port))
    if err != nil {
        return fmt.Errorf("failed to listen on port %d: %w", s.port, err)
    }
    s.server = grpc.NewServer()
    ext_proc.RegisterExternalProcessorServer(s.server, s.router)
    
    // gRPC handles concurrency automatically
    // Each request gets its own goroutine
    return s.server.Serve(lis)
}
// Process handles individual request streams
func (r *OpenAIRouter) Process(stream ext_proc.ExternalProcessor_ProcessServer) error {
    // Each stream runs in its own goroutine
    ctx := &RequestContext{
        Headers: make(map[string]string),
    }
    
    for {
        req, err := stream.Recv()
        // Process request with thread-safe operations
        switch v := req.Request.(type) {
        case *ext_proc.ProcessingRequest_RequestHeaders:
            // Handle request headers
        case *ext_proc.ProcessingRequest_RequestBody:
            // Handle request body - where classification happens
        case *ext_proc.ProcessingRequest_ResponseHeaders:
            // Handle response headers
        }
    }
}
Thread Safety Considerations​
type OpenAIRouter struct {
    // Thread-safe components
    Classifier           *classification.Classifier    // Read-only after init
    PIIChecker           *pii.PolicyChecker           // Read-only after init
    Cache                *cache.SemanticCache         // Internally synchronized
    
    // Mutable state with protection
    pendingRequests     map[string][]byte
    pendingRequestsLock sync.Mutex                    // Protects pendingRequests
}
// Thread-safe request tracking
func (r *OpenAIRouter) trackRequest(id string, body []byte) {
    r.pendingRequestsLock.Lock()
    defer r.pendingRequestsLock.Unlock()
    r.pendingRequests[id] = body
}
Performance Characteristics​
Latency Analysis​
| Component | Typical Latency | Optimization | 
|---|---|---|
| Envoy Routing | 0.5-2ms | Optimized configuration | 
| ExtProc gRPC | 1-3ms | Local network communication | 
| PII Detection | 5-15ms | ModernBERT token classification | 
| Jailbreak Guard | 3-8ms | ModernBERT binary classification | 
| Category Classification | 8-20ms | ModernBERT sequence classification | 
| Cache Lookup | 0.1-0.5ms | Redis/in-memory cache | 
| Total Overhead | 15-50ms | Acceptable for most use cases | 
Throughput Optimization​
// Batch processing for efficiency
type BatchProcessor struct {
    batchSize    int
    batchTimeout time.Duration
    classifier   *classification.Classifier
}
func (bp *BatchProcessor) processBatch(queries []string) []Classification {
    // Process multiple queries together for better GPU utilization
    return bp.classifier.ClassifyBatch(queries)
}
Memory Usage​
| Component | Memory Usage | Notes | 
|---|---|---|
| ModernBERT Models | ~400MB each | Loaded once, shared across requests | 
| Envoy Process | ~100-200MB | Depends on configuration | 
| Go ExtProc Server | ~50-100MB | Scales with concurrent requests | 
| Semantic Cache | ~500MB-2GB | Configurable, depends on cache size | 
| Total System | ~1.5-3GB | Reasonable for production deployment | 
Configuration Management​
Router Configuration Structure​
# config/config.yaml
router:
  # Model endpoints configuration
  endpoints:
    endpoint1:
      url: "http://127.0.0.1:11434"
      model_type: "math"
      cost_per_token: 0.002
      max_tokens: 4096
      
    endpoint2:
      url: "http://127.0.0.1:11434" 
      model_type: "creative"
      cost_per_token: 0.003
      max_tokens: 8192
      
    endpoint3:
      url: "http://127.0.0.1:11434"
      model_type: "general"
      cost_per_token: 0.01
      max_tokens: 4096
  # Classification thresholds
  classification:
    confidence_threshold: 0.7
    fallback_model: "general"
    
  # Security settings
  security:
    enable_pii_detection: true
    enable_jailbreak_guard: true
    pii_action: "block"  # block, mask, or allow
    
  # Caching configuration
  cache:
    enabled: true
    similarity_threshold: 0.85
    ttl_seconds: 3600
    max_entries: 10000
  # Tools configuration
  tools:
    auto_selection: true
    max_tools: 5
    relevance_threshold: 0.6
Dynamic Configuration Updates​
// Configuration hot-reloading
type ConfigManager struct {
    config     *RouterConfig
    configLock sync.RWMutex
    watchers   []ConfigWatcher
}
func (cm *ConfigManager) UpdateConfig(newConfig *RouterConfig) error {
    cm.configLock.Lock()
    defer cm.configLock.Unlock()
    
    // Validate new configuration
    if err := newConfig.Validate(); err != nil {
        return err
    }
    
    // Apply configuration
    cm.config = newConfig
    
    // Notify all watchers
    for _, watcher := range cm.watchers {
        watcher.OnConfigUpdate(newConfig)
    }
    
    return nil
}
Error Handling and Resilience​
Circuit Breaker Pattern​
type CircuitBreaker struct {
    maxFailures   int
    resetTimeout  time.Duration
    state         CircuitState
    failures      int
    lastFailTime  time.Time
    mutex         sync.Mutex
}
func (cb *CircuitBreaker) Call(operation func() error) error {
    cb.mutex.Lock()
    defer cb.mutex.Unlock()
    
    if cb.state == StateOpen {
        if time.Since(cb.lastFailTime) > cb.resetTimeout {
            cb.state = StateHalfOpen
        } else {
            return errors.New("circuit breaker is open")
        }
    }
    
    err := operation()
    if err != nil {
        cb.onFailure()
    } else {
        cb.onSuccess()
    }
    
    return err
}
Fallback Strategies​
Monitoring and Observability​
Metrics Collection​
// Prometheus metrics
var (
    requestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "semantic_router_requests_total",
            Help: "Total number of requests processed",
        },
        []string{"endpoint", "category", "status"},
    )
    
    routingLatency = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "semantic_router_routing_duration_seconds", 
            Help: "Time spent on routing decisions",
            Buckets: prometheus.DefBuckets,
        },
        []string{"component"},
    )
    
    cacheHitRatio = prometheus.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "semantic_router_cache_hit_ratio",
            Help: "Cache hit ratio for semantic cache",
        },
        []string{"cache_type"},
    )
)
Structured Logging​
type RequestLogger struct {
    logger *logrus.Logger
}
func (rl *RequestLogger) LogRouting(ctx context.Context, decision *RoutingDecision) {
    rl.logger.WithFields(logrus.Fields{
        "request_id":         ctx.Value("request_id"),
        "category":          decision.Category,
        "confidence":        decision.Confidence,
        "selected_model":    decision.SelectedModel,
        "routing_time_ms":   decision.ProcessingTime.Milliseconds(),
        "pii_detected":      decision.PIIDetected,
        "jailbreak_risk":    decision.JailbreakRisk,
        "cache_hit":         decision.CacheHit,
        "tools_selected":    len(decision.SelectedTools),
    }).Info("Request routed")
}
This architecture provides a robust, scalable, and maintainable foundation for intelligent LLM routing. The next section covers the Envoy ExtProc Integration in detail, explaining how the ExtProc protocol works and how our router implements it.