# ADR-0004: Async job surface, job IDs, and queued execution **Status:** Accepted — 2026-05-23 ## Context The transparent passthrough (ADR-0003) is synchronous: the caller holds an HTTP connection until the completion returns. That is fine for interactive-length work and for go-llm, but two needs aren't served by it: - Long-running jobs held open through Traefik risk idle-connection timeouts. - Orchestration callers (mort/ratchet/werk-style) want fire-and-forget: submit, get an ID back immediately, and be told asynchronously when the work is done. ## Decision Add a distinct async surface: `POST /jobs`. - The body carries a chat payload (native-Ollama-shaped, mirroring `/api/chat`) plus optional extension fields, notably `state_webhook_url` (ADR-0005). - foreman enqueues the job, assigns it a **ULID** (sortable, timestamped), and immediately returns `202 Accepted` with `{ "job_id": "" }`. - The caller correlates later webhook callbacks to its request via `job_id`. - `GET /jobs/{id}` returns current state, result, and artifact references for polling-style callers or for recovery after a missed webhook. Every unit of work is a row in the queue (ADR-0008) regardless of which surface created it; the synchronous passthrough is simply a `/jobs` submission whose handler blocks on the job's completion instead of returning the ID. ### Job lifecycle `queued → loading → working → done`, plus terminal `failed`. A job whose target is unreachable re-enters `queued` with a backoff (it is retryable, never auto-failed on a connection error — the target is a laptop, ADR-0002). A bounded retry count guards against poison jobs; exceeding it moves the job to `failed` with the last error recorded. ## Consequences - One queue, one execution engine, two entry points (sync passthrough, async `/jobs`). - Job IDs are stable, sortable, and meaningful to correlate webhooks. - `GET /jobs/{id}` gives at-least-once webhook delivery a recovery path. ## Alternatives considered - **Reuse the OpenAI response `id` field instead of a separate `/jobs` surface.** Workable for sync, but doesn't give async callers an immediate handle before completion. The explicit `/jobs` surface is clearer. - **UUIDv4 for IDs.** Rejected in favor of ULID for natural time-ordering in the queue and logs.