Files
foreman/docs/adr/0004-async-job-surface.md
T
2026-05-23 16:41:20 -04:00

2.3 KiB

ADR-0004: Async job surface, job IDs, and queued execution

Status: Accepted — 2026-05-23

Context

The transparent passthrough (ADR-0003) is synchronous: the caller holds an HTTP connection until the completion returns. That is fine for interactive-length work and for go-llm, but two needs aren't served by it:

  • Long-running jobs held open through Traefik risk idle-connection timeouts.
  • Orchestration callers (mort/ratchet/werk-style) want fire-and-forget: submit, get an ID back immediately, and be told asynchronously when the work is done.

Decision

Add a distinct async surface: POST /jobs.

  • The body carries a chat payload (native-Ollama-shaped, mirroring /api/chat) plus optional extension fields, notably state_webhook_url (ADR-0005).
  • foreman enqueues the job, assigns it a ULID (sortable, timestamped), and immediately returns 202 Accepted with { "job_id": "<ulid>" }.
  • The caller correlates later webhook callbacks to its request via job_id.
  • GET /jobs/{id} returns current state, result, and artifact references for polling-style callers or for recovery after a missed webhook.

Every unit of work is a row in the queue (ADR-0008) regardless of which surface created it; the synchronous passthrough is simply a /jobs submission whose handler blocks on the job's completion instead of returning the ID.

Job lifecycle

queued → loading → working → done, plus terminal failed. A job whose target is unreachable re-enters queued with a backoff (it is retryable, never auto-failed on a connection error — the target is a laptop, ADR-0002). A bounded retry count guards against poison jobs; exceeding it moves the job to failed with the last error recorded.

Consequences

  • One queue, one execution engine, two entry points (sync passthrough, async /jobs).
  • Job IDs are stable, sortable, and meaningful to correlate webhooks.
  • GET /jobs/{id} gives at-least-once webhook delivery a recovery path.

Alternatives considered

  • Reuse the OpenAI response id field instead of a separate /jobs surface. Workable for sync, but doesn't give async callers an immediate handle before completion. The explicit /jobs surface is clearer.
  • UUIDv4 for IDs. Rejected in favor of ULID for natural time-ordering in the queue and logs.