53 lines
2.3 KiB
Markdown
53 lines
2.3 KiB
Markdown
# ADR-0004: Async job surface, job IDs, and queued execution
|
|
|
|
**Status:** Accepted — 2026-05-23
|
|
|
|
## Context
|
|
|
|
The transparent passthrough (ADR-0003) is synchronous: the caller holds an HTTP
|
|
connection until the completion returns. That is fine for interactive-length work
|
|
and for go-llm, but two needs aren't served by it:
|
|
|
|
- Long-running jobs held open through Traefik risk idle-connection timeouts.
|
|
- Orchestration callers (mort/ratchet/werk-style) want fire-and-forget: submit,
|
|
get an ID back immediately, and be told asynchronously when the work is done.
|
|
|
|
## Decision
|
|
|
|
Add a distinct async surface: `POST /jobs`.
|
|
|
|
- The body carries a chat payload (native-Ollama-shaped, mirroring `/api/chat`)
|
|
plus optional extension fields, notably `state_webhook_url` (ADR-0005).
|
|
- foreman enqueues the job, assigns it a **ULID** (sortable, timestamped), and
|
|
immediately returns `202 Accepted` with `{ "job_id": "<ulid>" }`.
|
|
- The caller correlates later webhook callbacks to its request via `job_id`.
|
|
- `GET /jobs/{id}` returns current state, result, and artifact references for
|
|
polling-style callers or for recovery after a missed webhook.
|
|
|
|
Every unit of work is a row in the queue (ADR-0008) regardless of which surface
|
|
created it; the synchronous passthrough is simply a `/jobs` submission whose
|
|
handler blocks on the job's completion instead of returning the ID.
|
|
|
|
### Job lifecycle
|
|
|
|
`queued → loading → working → done`, plus terminal `failed`. A job whose target
|
|
is unreachable re-enters `queued` with a backoff (it is retryable, never
|
|
auto-failed on a connection error — the target is a laptop, ADR-0002). A bounded
|
|
retry count guards against poison jobs; exceeding it moves the job to `failed`
|
|
with the last error recorded.
|
|
|
|
## Consequences
|
|
|
|
- One queue, one execution engine, two entry points (sync passthrough, async
|
|
`/jobs`).
|
|
- Job IDs are stable, sortable, and meaningful to correlate webhooks.
|
|
- `GET /jobs/{id}` gives at-least-once webhook delivery a recovery path.
|
|
|
|
## Alternatives considered
|
|
|
|
- **Reuse the OpenAI response `id` field instead of a separate `/jobs` surface.**
|
|
Workable for sync, but doesn't give async callers an immediate handle before
|
|
completion. The explicit `/jobs` surface is clearer.
|
|
- **UUIDv4 for IDs.** Rejected in favor of ULID for natural time-ordering in the
|
|
queue and logs.
|