Files
foreman/docs/adr/0012-streaming.md
steve 0526bada90 docs: land prior ADR + prompt updates
Commit pre-existing uncommitted working-tree changes that predate the
license/public-readiness work — NOT authored in this session, just flushed so
they're not lost: ADR-0003/0005/0009/0012 edits, the new ADR-0013
(embeddings-bypass + two-slot residency, already referenced by CLAUDE.md), and
the phase-0..3 prompt revisions + prompts/README.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-26 20:33:39 -04:00

2.1 KiB

ADR-0012: Streaming support

Status: Accepted — 2026-05-23

Context

go-llm's provider interface has a Stream() method, and Ollama's native /api/chat streams token-by-token by default. The synchronous passthrough (ADR-0003) must not break streaming clients. Separately, the async /jobs surface (ADR-0004) reports progress via discrete state webhooks, which is a different granularity than token streaming.

Decision

  • Sync passthrough: support streaming. When a /api/chat request sets stream: true, foreman streams the target's token deltas back to the caller as NDJSON (application/x-ndjson, newline-delimited JSON chunks — Ollama's native streaming wire format, which go-llm reads with a bufio.Scanner). This is not SSE/text/event-stream. A streamed job still moves through the queue; streaming begins once the job reaches working, so a job waiting behind the drain-by-model queue (ADR-0009) simply starts streaming when its turn comes. go-llm's Stream() works against foreman unchanged.
  • Async /jobs surface: no token streaming in v1. Webhooks carry coarse state transitions (ADR-0005) and the final result/artifacts, not per-token deltas. Token-level streaming over a fire-and-forget webhook job is deliberately deferred — it adds a transport (persistent connection or chunked webhook) whose complexity isn't justified yet.

Consequences

  • Interactive go-llm usage gets real streaming through the transparent surface.
  • Orchestration callers get state + final artifacts, which is what they need; they can use the sync streaming surface directly if they want tokens.
  • The job state machine and webhook protocol stay simple (no streaming transport to design or operate).

Alternatives considered

  • Stream tokens over the async surface too. Deferred: requires either a long-lived connection (defeats the point of async) or chunked-delta webhooks (complex, rarely needed). Revisit only on a concrete need.
  • No streaming at all. Would break go-llm's Stream() and interactive use on the very path that is the primary goal. Rejected.