2.3 KiB
ADR-0004: Async job surface, job IDs, and queued execution
Status: Accepted — 2026-05-23
Context
The transparent passthrough (ADR-0003) is synchronous: the caller holds an HTTP connection until the completion returns. That is fine for interactive-length work and for go-llm, but two needs aren't served by it:
- Long-running jobs held open through Traefik risk idle-connection timeouts.
- Orchestration callers (mort/ratchet/werk-style) want fire-and-forget: submit, get an ID back immediately, and be told asynchronously when the work is done.
Decision
Add a distinct async surface: POST /jobs.
- The body carries a chat payload (native-Ollama-shaped, mirroring
/api/chat) plus optional extension fields, notablystate_webhook_url(ADR-0005). - foreman enqueues the job, assigns it a ULID (sortable, timestamped), and
immediately returns
202 Acceptedwith{ "job_id": "<ulid>" }. - The caller correlates later webhook callbacks to its request via
job_id. GET /jobs/{id}returns current state, result, and artifact references for polling-style callers or for recovery after a missed webhook.
Every unit of work is a row in the queue (ADR-0008) regardless of which surface
created it; the synchronous passthrough is simply a /jobs submission whose
handler blocks on the job's completion instead of returning the ID.
Job lifecycle
queued → loading → working → done, plus terminal failed. A job whose target
is unreachable re-enters queued with a backoff (it is retryable, never
auto-failed on a connection error — the target is a laptop, ADR-0002). A bounded
retry count guards against poison jobs; exceeding it moves the job to failed
with the last error recorded.
Consequences
- One queue, one execution engine, two entry points (sync passthrough, async
/jobs). - Job IDs are stable, sortable, and meaningful to correlate webhooks.
GET /jobs/{id}gives at-least-once webhook delivery a recovery path.
Alternatives considered
- Reuse the OpenAI response
idfield instead of a separate/jobssurface. Workable for sync, but doesn't give async callers an immediate handle before completion. The explicit/jobssurface is clearer. - UUIDv4 for IDs. Rejected in favor of ULID for natural time-ordering in the queue and logs.