Files

T

steve 0526bada90 docs: land prior ADR + prompt updates

Commit pre-existing uncommitted working-tree changes that predate the
license/public-readiness work — NOT authored in this session, just flushed so
they're not lost: ADR-0003/0005/0009/0012 edits, the new ADR-0013
(embeddings-bypass + two-slot residency, already referenced by CLAUDE.md), and
the phase-0..3 prompt revisions + prompts/README.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-26 20:33:39 -04:00

2.1 KiB

Raw Permalink Blame History

ADR-0012: Streaming support

Status: Accepted — 2026-05-23

Context

go-llm's provider interface has a Stream() method, and Ollama's native /api/chat streams token-by-token by default. The synchronous passthrough (ADR-0003) must not break streaming clients. Separately, the async /jobs surface (ADR-0004) reports progress via discrete state webhooks, which is a different granularity than token streaming.

Decision

Sync passthrough: support streaming. When a /api/chat request sets stream: true, foreman streams the target's token deltas back to the caller as NDJSON (application/x-ndjson, newline-delimited JSON chunks — Ollama's native streaming wire format, which go-llm reads with a bufio.Scanner). This is not SSE/text/event-stream. A streamed job still moves through the queue; streaming begins once the job reaches working, so a job waiting behind the drain-by-model queue (ADR-0009) simply starts streaming when its turn comes. go-llm's Stream() works against foreman unchanged.
Async /jobs surface: no token streaming in v1. Webhooks carry coarse state transitions (ADR-0005) and the final result/artifacts, not per-token deltas. Token-level streaming over a fire-and-forget webhook job is deliberately deferred — it adds a transport (persistent connection or chunked webhook) whose complexity isn't justified yet.

Consequences

Interactive go-llm usage gets real streaming through the transparent surface.
Orchestration callers get state + final artifacts, which is what they need; they can use the sync streaming surface directly if they want tokens.
The job state machine and webhook protocol stay simple (no streaming transport to design or operate).

Alternatives considered

Stream tokens over the async surface too. Deferred: requires either a long-lived connection (defeats the point of async) or chunked-delta webhooks (complex, rarely needed). Revisit only on a concrete need.
No streaming at all. Would break go-llm's Stream() and interactive use on the very path that is the primary goal. Rejected.

2.1 KiB Raw Permalink Blame History