Runtime Architecture
LlmGateway owns the runtime components behind an internal shared state:
flowchart TD
client[SDK Client] --> gateway[LlmGateway]
gateway --> retry[RetryPipeline]
retry --> cache[PromptCache]
cache --> scheduler[ExecutionScheduler]
scheduler --> runtime[InferenceRuntime]
runtime --> cloud[autoagents-llm Provider]
runtime --> local[autoagents-llamacpp Runtime]
Request Flow
chatentersRetryPipeline.- The cache checks a canonical key built from route, messages, tools, schema, and output-affecting parameters.
ExecutionSchedulerapplies queue capacity, per-route concurrency, and deadline checks.- The request dispatches to a registered runtime or provider.
- Successful non-streaming responses are stored in
PromptCache.
chat_stream currently returns normalized stream events from a completed chat
response. It does not yet expose provider-native incremental streaming.
Runtime Boundary
InferenceRuntime is the scheduling boundary. Cloud providers use
ProviderRuntime. Local llama.cpp models use the feature-gated
LlamaCppRuntime.
This keeps scheduling, cache, retry, and telemetry independent from provider implementation details.