1.9 KiB
1.9 KiB
ADR-0012: Streaming support
Status: Accepted — 2026-05-23
Context
go-llm's provider interface has a Stream() method, and Ollama's native
/api/chat streams token-by-token by default. The synchronous passthrough
(ADR-0003) must not break streaming clients. Separately, the async /jobs
surface (ADR-0004) reports progress via discrete state webhooks, which is a
different granularity than token streaming.
Decision
- Sync passthrough: support streaming. When a
/api/chatrequest setsstream: true, foreman streams the target's token deltas back to the caller (SSE/chunked, matching Ollama's native streaming). A streamed job still moves through the queue; streaming begins once the job reachesworking, so a job waiting behind the drain-by-model queue (ADR-0009) simply starts streaming when its turn comes. go-llm'sStream()works against foreman unchanged. - Async
/jobssurface: no token streaming in v1. Webhooks carry coarse state transitions (ADR-0005) and the final result/artifacts, not per-token deltas. Token-level streaming over a fire-and-forget webhook job is deliberately deferred — it adds a transport (persistent connection or chunked webhook) whose complexity isn't justified yet.
Consequences
- Interactive go-llm usage gets real streaming through the transparent surface.
- Orchestration callers get state + final artifacts, which is what they need; they can use the sync streaming surface directly if they want tokens.
- The job state machine and webhook protocol stay simple (no streaming transport to design or operate).
Alternatives considered
- Stream tokens over the async surface too. Deferred: requires either a long-lived connection (defeats the point of async) or chunked-delta webhooks (complex, rarely needed). Revisit only on a concrete need.
- No streaming at all. Would break go-llm's
Stream()and interactive use on the very path that is the primary goal. Rejected.