Runtime Architecture

LlmGateway owns the runtime components behind an internal shared state:

flowchart TD
    client[SDK Client] --> gateway[LlmGateway]
    gateway --> retry[RetryPipeline]
    retry --> cache[PromptCache]
    cache --> scheduler[ExecutionScheduler]
    scheduler --> runtime[InferenceRuntime]
    runtime --> cloud[autoagents-llm Provider]
    runtime --> local[autoagents-llamacpp Runtime]

Request Flow

chat enters RetryPipeline.
The cache checks a canonical key built from route, messages, tools, schema, and output-affecting parameters.
ExecutionScheduler applies queue capacity, per-route concurrency, and deadline checks.
The request dispatches to a registered runtime or provider.
Successful non-streaming responses are stored in PromptCache.

chat_stream currently returns normalized stream events from a completed chat response. It does not yet expose provider-native incremental streaming.

Runtime Boundary

InferenceRuntime is the scheduling boundary. Cloud providers use ProviderRuntime. Local llama.cpp models use the feature-gated LlamaCppRuntime.

This keeps scheduling, cache, retry, and telemetry independent from provider implementation details.

Request Flow​

Runtime Boundary​

Request Flow

Runtime Boundary