Files
foreman/docs/adr/0004-async-job-surface.md
T
2026-05-23 16:41:20 -04:00

53 lines
2.3 KiB
Markdown

# ADR-0004: Async job surface, job IDs, and queued execution
**Status:** Accepted — 2026-05-23
## Context
The transparent passthrough (ADR-0003) is synchronous: the caller holds an HTTP
connection until the completion returns. That is fine for interactive-length work
and for go-llm, but two needs aren't served by it:
- Long-running jobs held open through Traefik risk idle-connection timeouts.
- Orchestration callers (mort/ratchet/werk-style) want fire-and-forget: submit,
get an ID back immediately, and be told asynchronously when the work is done.
## Decision
Add a distinct async surface: `POST /jobs`.
- The body carries a chat payload (native-Ollama-shaped, mirroring `/api/chat`)
plus optional extension fields, notably `state_webhook_url` (ADR-0005).
- foreman enqueues the job, assigns it a **ULID** (sortable, timestamped), and
immediately returns `202 Accepted` with `{ "job_id": "<ulid>" }`.
- The caller correlates later webhook callbacks to its request via `job_id`.
- `GET /jobs/{id}` returns current state, result, and artifact references for
polling-style callers or for recovery after a missed webhook.
Every unit of work is a row in the queue (ADR-0008) regardless of which surface
created it; the synchronous passthrough is simply a `/jobs` submission whose
handler blocks on the job's completion instead of returning the ID.
### Job lifecycle
`queued → loading → working → done`, plus terminal `failed`. A job whose target
is unreachable re-enters `queued` with a backoff (it is retryable, never
auto-failed on a connection error — the target is a laptop, ADR-0002). A bounded
retry count guards against poison jobs; exceeding it moves the job to `failed`
with the last error recorded.
## Consequences
- One queue, one execution engine, two entry points (sync passthrough, async
`/jobs`).
- Job IDs are stable, sortable, and meaningful to correlate webhooks.
- `GET /jobs/{id}` gives at-least-once webhook delivery a recovery path.
## Alternatives considered
- **Reuse the OpenAI response `id` field instead of a separate `/jobs` surface.**
Workable for sync, but doesn't give async callers an immediate handle before
completion. The explicit `/jobs` surface is clearer.
- **UUIDv4 for IDs.** Rejected in favor of ULID for natural time-ordering in the
queue and logs.