ADR-0012: Streaming support

Status: Accepted — 2026-05-23

Context

go-llm's provider interface has a Stream() method, and Ollama's native /api/chat streams token-by-token by default. The synchronous passthrough (ADR-0003) must not break streaming clients. Separately, the async /jobs surface (ADR-0004) reports progress via discrete state webhooks, which is a different granularity than token streaming.

Decision

Sync passthrough: support streaming. When a /api/chat request sets stream: true, foreman streams the target's token deltas back to the caller (SSE/chunked, matching Ollama's native streaming). A streamed job still moves through the queue; streaming begins once the job reaches working, so a job waiting behind the drain-by-model queue (ADR-0009) simply starts streaming when its turn comes. go-llm's Stream() works against foreman unchanged.
Async /jobs surface: no token streaming in v1. Webhooks carry coarse state transitions (ADR-0005) and the final result/artifacts, not per-token deltas. Token-level streaming over a fire-and-forget webhook job is deliberately deferred — it adds a transport (persistent connection or chunked webhook) whose complexity isn't justified yet.

Consequences

Interactive go-llm usage gets real streaming through the transparent surface.
Orchestration callers get state + final artifacts, which is what they need; they can use the sync streaming surface directly if they want tokens.
The job state machine and webhook protocol stay simple (no streaming transport to design or operate).

Alternatives considered

Stream tokens over the async surface too. Deferred: requires either a long-lived connection (defeats the point of async) or chunked-delta webhooks (complex, rarely needed). Revisit only on a concrete need.
No streaming at all. Would break go-llm's Stream() and interactive use on the very path that is the primary goal. Rejected.

1.9 KiB Raw Blame History

ADR-0012: Streaming support

Context

Decision

Consequences

Alternatives considered

1.9 KiB

Raw Blame History