Calafai Docs

ADR-004: Class-Based LLMRouter

ADR-004: Class-Based LLMRouter

Status: Accepted Date: 2024-12-01 Deciders: Platform architect

Context

The engine routes LLM requests to different providers based on agent tier (strategy → xAI, writing → Anthropic, analytical → OpenAI, etc.). The original single-user system used module-level functions for routing, which created global state that was difficult to test and impossible to configure per-request.

The multi-tenant platform needed LLM routing that:

  • Supports per-tenant BYOK (Bring Your Own Key) API key overrides
  • Is independently testable without making real LLM calls
  • Handles provider fallback chains (e.g., xAI unavailable → fall back to Claude Sonnet)
  • Can be instantiated multiple times in parallel tests without state conflicts

Decision

Implement the LLM router as a class (LLMRouter) that is instantiated with configuration and optional API key overrides. Each test creates its own router instance. In production, a singleton instance is shared via state.get_router().

The router accepts api_key_overrides: dict to support BYOK — when a tenant has their own API keys, the router uses those instead of platform keys.

Alternatives Considered

1. Module-level functions with global configuration

The original pattern. Works for single-user but creates problems in multi-tenant: global API keys can't be overridden per-tenant, tests pollute shared state, and parallel execution risks race conditions on global config.

2. Dependency injection framework (e.g., python-inject, dependency-injector)

Would formalize the pattern but adds a dependency for a problem solved by basic class instantiation. Rejected as over-engineering.

3. Configuration-based routing (YAML/JSON)

Route configuration lives in a file, loaded at startup. Doesn't support per-request API key overrides or dynamic fallback chains. Rejected because BYOK requires runtime configuration.

Consequences

Positive:

  • Each test creates its own LLMRouter with mock providers — zero global state pollution
  • BYOK support is clean: pass api_key_overrides at instantiation, router uses them transparently
  • Fallback chains are configurable per-instance
  • Production singleton via state.get_router() is lazy-loaded and thread-safe

Negative:

  • Every module that needs LLM access must receive a router instance (parameter threading)
  • The router instance is threaded through ~14 engine files — more plumbing than module-level functions
  • Singleton pattern in production means the shared router doesn't support per-request BYOK; instead, BYOK overrides are applied at the run level before execution starts

Risks:

  • Router instance lifecycle: if a router is created with BYOK keys and then reused for a different tenant's run, keys could leak. Mitigated by creating a fresh router per run when BYOK keys are present.

On this page