Files

T

Benson Wong 02e015fa49 Introduce new routing backend (#790 )

This is a huge backend change that essentially started with rewriting
the concurrency handling for processes and blew up to a refactor of the
entire application. In short these are the improvements:

**Better state and life cycle management:** 

Life cycle management of processes has always been the trickiest part of
the code. Juggling mutex locks between multiple locations to reduce race
conditions was complex. Too complex for my feeble brain to build a
simple mental model around as llama-swap gained more features. All of
that has been refactored. Most of the locks are gone, replaced with a
single run() that owns all state changes. There is one place to start
from now to understand and extend routing logic.

The improved life cycle management makes it easier to implement more
complex swap optimization strategies in the future like #727.

**Collation of requests:**

llama-swap previously handled requests and swapping in the order they
came in. For example requests for models in this order ABCABC would
result in 5 swaps. Now those requests are handled in this order AABBCC.
The result is less time waiting for swap under a high churn request
queue. This fixes #588 #612.

A possible future enhancement is to support a starvation parameter so
swap can be forced when models have been waiting too long.

**Shared base implementation for groups and swap matrix:** 

During the refactor it became clear that much of the swapping logic was
shared between these two implementations. That is not surprising
considering the swap matrix was added many moons after groups. Now they
share a common base and their specific swap strategies are implemented
into the swapPlanner interface.

Requests for bespoke or specific swapping scenarios is a common theme in
the issues. Now users can implement whatever bespoke and weird swapping
strategy they want in their own fork. Just ask your agent of choice to
implement swapPlanner. I'll still remaining more conservative on what
actually lands in core llama-swap and will continue to evaluate PRs if
the changes is good for everyone or just one specific use case.

**AI / Agentic Disclosure:** 

I paid very close attention to the low level swap concurrency design and
implementation. It's important to keep that essential part reliable,
boring and no surprises. Backwards compatibility was also maintained,
even the one way non-exclusive group model loading behaviour that people
have rightly pointed out be a weird design decision.

With the underlying swap core done the web server, api and UI sitting on
top were largely ported over with Claude Code and Opus 4.7 in multiple
phases. If you're curious I kept the changes in docs/newrouter-todo.md.
I did several passes to make sure things weren't left behind.

However, even frontier LLMs at the time of this PR still make small
decisions that don't make a lot of sense. They get shit wrong all the
time, just in small subtle way.

That said, there's likely to be some new bugs introduced with this
massive refactor. I'm fairly confident that there's no major
architectural flaws that would cause goal seeking agents to make dumb,
ugly code decisions.

For a little while the legacy llama-swap will be available under
cmd/legacy/llama-swap. The plan is to eventually delete that entry point
as well as the proxy package.

On a bit of a personal note, this PR is exciting and a bit sad for me. I
hand wrote much of the original code and this PR ultimately replaces
much of it. While the old code served as a good reference for the agent
to implement the new stuff it still a bit sad to eventually delete it
all.

2026-05-28 21:47:01 -07:00

15 KiB

Raw Blame History

New Router Migration TODO

This document tracks the work needed for cmd/newrouter/main.go and internal/router/ to reach feature parity with the legacy entrypoint at llama-swap.go plus proxy/proxymanager.go.

The work is split into phases so each can land and be tested independently. Earlier phases unblock later ones.

Current state (newrouter)

cmd/newrouter already supports:

Loading config via -config
Selecting Matrix vs Group router based on config
Peer routing fallback
Plain HTTP listen (-listen)
Graceful shutdown on SIGINT / SIGTERM
Model extraction from JSON body, query string, and form bodies (see router.go:88)
Server.ServeHTTP dispatches a single request to peer or local router based on the requested model

Everything below is missing or only partially implemented.

Phase 1 — Package relocation -- Completed.

Goal: move shared infrastructure packages out from under proxy/ so the new router does not depend on the legacy proxy tree. This is a prerequisite for retiring proxy/ in Phase 8.

Phase 2 — Server lifecycle parity -- Completed.

Goal: make cmd/newrouter a drop-in replacement for the legacy binary's process model, without yet adding any extra HTTP endpoints.

Phase 3 — `internal/chain` package -- Completed.

API: chain.New(mws...).Then(final) for ServeMux registration; Append returns an extended Chain without mutating the receiver, so a base stack (auth/CORS) can be reused across many routes with per-route additions.

Phase 4 — `internal/server` package scaffolding (ProxyManager replacement) -- Completed.

Goal: build the internal/server package so it can stand in for proxy.ProxyManager — the mux, lifecycle, model dispatch, custom endpoints, request filters, auth/CORS, and upstream passthrough. After this phase, cmd/newrouter/main.go constructs a server.Server instead of a bare router.Server.

The legacy ProxyManager collapses three concerns into one struct: the HTTP mux, the model→process router, and the cross-cutting services (loggers, metrics, perf, inflight counter, version). The new layout keeps the router.Router implementations focused on model dispatch and lets internal/server.Server own the mux and all cross-cutting middleware. server.Server builds the local and peer routers directly and dispatches between them itself, so it fully supersedes internal/router.Server — see the cleanup item below.

The phase is split into sub-phases that can land and be tested independently:

Sub-phase	Scope
4a	package scaffolding — struct, `New`, `ServeHTTP`, `Shutdown`, model routes
4b	custom (non-model-dispatched) HTTP endpoints
4c	request-body filter middleware
4d	auth & CORS middleware
4e	upstream passthrough

The package is split by concern across stub files already in place:

File	Responsibility	Filled in by
`server.go`	`Server` struct, `New`, `ServeHTTP`, `Shutdown`	4a
`log.go`	`muxlog` combined logger; `/logs` handlers	4a
`auth.go`	`CreateAuthMiddleware`	4d
`filters.go`	request-body filter middleware	4c
`api.go`	llama-swap-specific API handlers	4b / Phase 5 / Phase 6
`ui.go`	embedded UI serving	Phase 7

Phase 4a — package scaffolding -- Completed.

server.Server owns the mux, the local/peer routers, muxlog, and a shutdown context. New builds the routers, registers all model-dispatched routes on a stdlib http.ServeMux, and wraps the mux with the global CORS middleware. localPeerHandler resolves the model once via router.FetchModel and dispatches to local or peer. Shutdown stops both routers in parallel and is idempotent. cmd/newrouter/main.go now constructs server.New(...); internal/router/server.go and server_test.go were removed as dead code.

Phase 4b — Custom HTTP endpoints -- Completed.

GET /v1/models (local + peer models, aliases, metadata), GET /health, GET /wol-health, and GET / → /ui are registered. GET /favicon.ico is deferred to Phase 7 since it requires the embedded UI filesystem.

Phase 4c — Request-body filters -- Completed.

CreateFilterMiddleware (in filters.go) applies UseModelName, StripParams, SetParams, and SetParamsByID to JSON requests, then re-attaches the body with Content-Length / Transfer-Encoding cleanup.

Phase 4d — Auth & CORS -- Completed.

CreateAuthMiddleware validates API keys (Bearer / Basic / x-api-key) and strips the headers before upstream. CreateCORSMiddleware answers OPTIONS preflight; /v1/models echoes the Origin.

Phase 4e — Upstream passthrough -- Completed.

GET /upstream → /ui/models, and /upstream/<model>/<path> proxies to the resolved model with multi-segment name resolution, canonical-form redirect (301/308), and prefix stripping.

Phase 5 — Operations endpoints -- Completed.

A new router.LocalRouter interface embeds Router and adds RunningModels() and Unload(timeout, models...), both implemented once on baseRouter so Group and Matrix share them — the legacy matrix/group divergence at proxymanager.go:1167 collapses since baseRouter already unifies process storage. Peer does not implement it; Server.local is typed LocalRouter, Server.peer stays Router.

GET /unload stops every local process; GET /running lists non-stopped processes joined against config for cmd/proxy/ttl/name/description. startPreload fires a background GET / at each Hooks.OnStartup.Preload model and emits shared.ModelPreloadedEvent.

Phase 6 — Metrics, perf, and SSE -- Completed.

perf.Monitor is created and started in cmd/newrouter/main.go (it outlives config reloads via UpdateConfig) and passed into server.New. GET /metrics serves perf.Monitor.MetricsHandler() output, 503 when disabled.

internal/process emits shared.ProcessStateChangeEvent from setState. server.inflightCounter (atomic) + CreateInflightMiddleware track model-dispatched requests and emit InFlightRequestsEvent. metricsMonitor (in metrics.go) parses token usage from upstream responses via CreateMetricsMiddleware.

The /api group (API-key protected) is registered: POST /api/models/unload, POST /api/models/unload/{model...}, GET /api/events (SSE: modelStatus / logData / metrics / inflight), GET /api/metrics, GET /api/performance (?after= RFC3339 filter), GET /api/version. GET /api/captures/{id} returns 501 until 6f.

Phase 6f — Request/response captures -- Completed.

proxy/cache moved to internal/cache. metricsMonitor stores zstd+CBOR ReqRespCapture records in a sized cache.Cache (captureBuffer MB, 0 disables). CreateMetricsMiddleware buffers request body/headers before dispatch; record builds the capture per a captureFieldsByPath table (captures.go) that trims large audio/image payloads, defaulting JSON routes to captureAll. GET /api/captures/{id} decompresses and returns the capture; getMetrics resolves HasCapture against the cache.

Phase 7 — UI serving -- Completed.

internal/server/ui.go embeds ui_dist and serves it. GET /ui/ is brotli/gzip-aware via serveCompressedFile; unknown paths without a file extension fall back to index.html for SPA routing. GET /favicon.ico serves from the same embedded FS. The Makefile ui target copies the vite build into internal/server/ui_dist; a committed placeholder.txt keeps the embed valid before a build runs.

Phase 8a - Review Part I

All functionality from the proxy package has been migrated in the above phases — with the remaining gaps listed in Phase 8b
Test coverage at or exceeds the level from the proxy package — internal/server now at 76.6% vs 73.9% (proxy)

Findings

Gap 1 — Request logging middleware missing -- Resolved.

CreateRequestLogMiddleware (log.go) records one access-log line per request to s.proxylog in the legacy format clientIP "METHOD PATH PROTO" status bodySize "UA" duration, skipping /wol-health, /api/performance, and /metrics. A statusRecorder captures the status/body size (forwarding Flush for SSE) and clientIP honours X-Forwarded-For / X-Real-IP. It is wired as the outermost middleware in routes(), wrapping the CORS layer.

**Gap 2 — Per-model log streaming not supported -- Resolved **

Server.getLogger (log.go:50) only handles "", "proxy", and "upstream". The legacy ProxyManager.getLogger (proxymanager_loghandlers.go:92) additionally resolves a model ID against the active process groups / matrix and returns that process's logger. Callers of GET /logs/stream/<modelID> will get a 400 instead of the model's live log stream.

Gap 3 — UseModelName not applied to multipart form endpoints -- Resolved.

CreateFormFilterMiddleware (filters.go) parses multipart/form-data requests, rewrites the model field with UseModelName, reconstructs the body via rewriteMultipartModel, and re-attaches it with Content-Type / Content-Length cleanup. It runs in modelChain after the JSON filterMW; each is a no-op for the other's Content-Type. Audio transcription (/v1/audio/transcriptions) and image edit (/v1/images/edits) now honour use_model_name.

Coverage gaps (0 % functions) -- Resolved.

The functions previously at 0 % (handleListModels, handleMetrics, handleRootRedirect, handleUpstreamRedirect, handleUpstream, findModelInPath, handleAPICapture, handleAPIUnloadAll, handleAPIUnloadModel, CreateAuthMiddleware, extractAPIKey, handleLogStream, applyFilters, decompressBody, filterAcceptEncoding, handleUI, handleFavicon) now have tests across auth_test.go, api_test.go, filters_test.go, log_test.go, and extras_test.go.

Phase 8b - Fill gaps discovered in Phase 8a

Add request-log middleware — CreateRequestLogMiddleware (log.go) records clientIP "METHOD PATH PROTO" status bodySize "UA" duration to s.proxylog, skips /wol-health / /api/performance / /metrics, and is wired as the outermost middleware in routes().
Extend getLogger with model-ID resolution — add a default: branch to Server.getLogger (log.go:50) that resolves the ID via s.local (using a new LocalRouter.GetProcess(name) method or equivalent) and returns that process's Logger(). Match the fallback behaviour: return a 400 with "invalid logger. Use 'proxy', 'upstream' or a model's ID" when not found.
UseModelName rewrite for multipart endpoints — CreateFormFilterMiddleware parses multipart/form-data, rewrites the model field according to UseModelName, reconstructs the body, and updates Content-Type / Content-Length. It is wired into modelChain after the JSON filter.
Raise test coverage to ≥ 74 % — internal/server now at 76.1%; tests added for every 0 % function across auth_test.go, api_test.go, filters_test.go, log_test.go, and extras_test.go.

Phase 8c - Review Part II (entrypoint comparison)

A second pass comparing cmd/newrouter/main.go against the legacy llama-swap.go + proxy.New surfaced four more gaps, all in logger setup.

Gap 4 — LogToStdout config ignored -- Resolved.

cmd/newrouter/main.go previously hardcoded proxyLog / upstreamLog to os.Stdout, and the old muxlog() helper built a Monitor that nothing wrote into — so logToStdout had no effect and /logs (combined history) was always empty. server.NewLoggers (log.go) now replicates the legacy switch: proxy / upstream monitors feed muxLog (or io.Discard) per none / both / upstream / proxy, so muxLog accumulates the combined history. server.New takes muxlog as a parameter. The loggers outlive config reloads, so a LogToStdout change requires a restart to take effect.

Gap 5 — LogTimeFormat config ignored -- Resolved.

cmd/newrouter/main.go now maps cfg.LogTimeFormat to a Go time layout via the logTimeFormats table and applies it (alongside log level) to the proxy and upstream monitors in applyLogSettings, re-applied on config reload.

Gap 6 — LogRequests deprecation warning missing.

The legacy proxymanager.go:127 warns when the deprecated logRequests config key is set. cmd/newrouter does not. Low priority — left open.

Gap 7 — PID debug log missing -- Resolved.

cmd/newrouter/main.go now logs PID: %d at debug level after applyLogSettings, matching llama-swap.go:71.

Phase X (tbd) — Cutover

Swap llama-swap.go to delegate to cmd/newrouter (or rename newrouter to be the primary entrypoint)
Update Makefile build targets
Update docs / README references to the legacy binary
Remove proxy/proxymanager*.go and gin-gonic dependency once nothing imports them
Run make test-all and confirm concurrency suite still passes against the new entrypoint

Cross-cutting concerns to keep in mind

Single body read: legacy and newrouter both buffer the request body once. When adding filters (Phase 4c), make sure the buffered bytes flow through Content-Length / transfer-encoding cleanup as in proxymanager.go:872.
Streaming flag in context: legacy stashes streaming and model under proxyCtxKey. The new router uses ModelKey / ModelIDKey — pick one set of keys and use them consistently for metrics + log handlers.
Matrix vs Group divergence: any handler that calls swapProcessGroup or findGroupByModelName in the legacy needs a matrix branch too. The new router's Router interface already abstracts this — preserve that abstraction rather than reintroducing the branch in every handler.
Shutdown ordering: httpServer.Shutdown must drain inflight requests before Server.Shutdown tears down processes, otherwise inflight requests 502. Current newrouter ordering at main.go:87 is correct — keep it.

15 KiB Raw Blame History