llama-swap

Author	SHA1	Message	Date
Benson Wong	02e015fa49	Introduce new routing backend (#790 ) This is a huge backend change that essentially started with rewriting the concurrency handling for processes and blew up to a refactor of the entire application. In short these are the improvements: Better state and life cycle management: Life cycle management of processes has always been the trickiest part of the code. Juggling mutex locks between multiple locations to reduce race conditions was complex. Too complex for my feeble brain to build a simple mental model around as llama-swap gained more features. All of that has been refactored. Most of the locks are gone, replaced with a single run() that owns all state changes. There is one place to start from now to understand and extend routing logic. The improved life cycle management makes it easier to implement more complex swap optimization strategies in the future like #727. Collation of requests: llama-swap previously handled requests and swapping in the order they came in. For example requests for models in this order ABCABC would result in 5 swaps. Now those requests are handled in this order AABBCC. The result is less time waiting for swap under a high churn request queue. This fixes #588 #612. A possible future enhancement is to support a starvation parameter so swap can be forced when models have been waiting too long. Shared base implementation for groups and swap matrix: During the refactor it became clear that much of the swapping logic was shared between these two implementations. That is not surprising considering the swap matrix was added many moons after groups. Now they share a common base and their specific swap strategies are implemented into the swapPlanner interface. Requests for bespoke or specific swapping scenarios is a common theme in the issues. Now users can implement whatever bespoke and weird swapping strategy they want in their own fork. Just ask your agent of choice to implement swapPlanner. I'll still remaining more conservative on what actually lands in core llama-swap and will continue to evaluate PRs if the changes is good for everyone or just one specific use case. AI / Agentic Disclosure: I paid very close attention to the low level swap concurrency design and implementation. It's important to keep that essential part reliable, boring and no surprises. Backwards compatibility was also maintained, even the one way non-exclusive group model loading behaviour that people have rightly pointed out be a weird design decision. With the underlying swap core done the web server, api and UI sitting on top were largely ported over with Claude Code and Opus 4.7 in multiple phases. If you're curious I kept the changes in docs/newrouter-todo.md. I did several passes to make sure things weren't left behind. However, even frontier LLMs at the time of this PR still make small decisions that don't make a lot of sense. They get shit wrong all the time, just in small subtle way. That said, there's likely to be some new bugs introduced with this massive refactor. I'm fairly confident that there's no major architectural flaws that would cause goal seeking agents to make dumb, ugly code decisions. For a little while the legacy llama-swap will be available under cmd/legacy/llama-swap. The plan is to eventually delete that entry point as well as the proxy package. On a bit of a personal note, this PR is exciting and a bit sad for me. I hand wrote much of the original code and this PR ultimately replaces much of it. While the old code served as a good reference for the agent to implement the new stuff it still a bit sad to eventually delete it all.	2026-05-28 21:47:01 -07:00
Wim Vander Schelden	e261745c66	proxy: add versionless API endpoint (#733 ) Add versionless endpoints under v/ to support upstream peers that do not use the v1/ prefix. Fixes #728.	2026-05-03 13:47:38 -07:00
Benson Wong	fd3c28ffc5	Refactor Activity Page (#710 ) - inference handles to store an activity record for all inference endpoints - add path, status code, and content type to Activities page - toggle on/off columns no Activities page - add configurable capture level for inference endpoints so large binary blobs are not stored in memory - store captures in compressed binary format	2026-04-28 20:33:03 -07:00
Benson Wong	f6cf9f5844	proxy: Refactor tests (#660 ) - use YAML for test configurations - remove most uses of simple-responder, opting to use process.testHandler Fixes #655	2026-04-16 22:47:42 -07:00
Benson Wong	15bd55d3a9	proxy, ui-svelte: add /sdapi/v1 endpoint support (#587 ) Add proxy routes for stable-diffusion.cpp's /sdapi/v1/txt2img, /sdapi/v1/img2img, and /sdapi/v1/loras endpoints. POST endpoints use proxyInferenceHandler (model in JSON body), GET /loras uses proxyGETModelHandler (model in query param). Update the image playground with a dual-mode UI supporting both OpenAI and SDAPI backends. In SDAPI mode, loras are fetched first to prime the server-side cache, and all txt2img parameters are exposed (negative prompt, steps, cfg_scale, seed, batch_size, clip_skip, sampler, scheduler, lora selection with multipliers). - Add 3 sdapi route registrations in proxymanager.go - Add sdApi.ts client with generateSdImage and fetchSdLoras - Add SDAPI types (SdApiTxt2ImgRequest, SdApiResponse, etc.) - Add /sdapi to vite dev proxy config - Add backend tests for sdapi routing - Support batch image display in gallery grid https://claude.ai/code/session_0186MGX6NXdHVBTv2KH45fqn --------- Co-authored-by: Claude <noreply@anthropic.com>	2026-03-19 22:08:31 +09:00
Benson Wong	cc77139ff8	proxy,proxy/config: add global TTL feature (#554 ) Add a new configuration parameter globalTTL that all models will inherit. The default value is 0 which matches the currently functionality to never automatically unload a model. The model.ttl's default has changed to -1, which means use the global TTL value. Any model.ttl >=0 is now value with 0 meaning never unload. This allows a model to override a globalTTL > 0 and be configured to never unload. Fixes #459 Closes #512	2026-03-01 21:02:12 -08:00
Benson Wong	19fb5f35e9	proxy: implement setParamsByID filter (#535 ) Add setParamsByID filter that applies different request parameters based on the requested model ID, enabling per-alias behaviour for a single loaded model. - add SetParamsByID field to Filters struct and SanitizedSetParamsByID method - substitute ${MODEL_ID} and other macros in setParamsByID keys and values - validate no unknown macros remain in keys or values after substitution - apply setParamsByID in proxyInferenceHandler after setParams (can override it) - update config-schema.json with setParamsByID definition - update UI to show aliases and make them selectable in the Playground closes #534	2026-02-19 22:21:10 -08:00
Benson Wong	20738f3623	proxy,ui-svelte: replace old UI with svelte+playground Replace the legacy React UI with the new Svelte-based one. Introduce a Playground in the UI to quickly test out text, image, text to speech and speech to text models behind llama-swap. Key Changes New Svelte UI (ui-svelte/) - Multi-tab Playground with Chat, Image Generation, Audio Transcription, and Speech interfaces - Chat: message editing/regeneration, markdown rendering with LaTeX math support, image attachments, code syntax highlighting - Image: size selector, download/fullscreen viewing - Audio: transcription with peer support - Speech: voice caching with manual refresh, download button - Responsive mobile layout with collapsible navigation - XSS fixes and accessibility improvements Proxy Improvements - Add gzip/brotli compression for UI static assets (proxy/ui_compress.go) - Add GET /v1/audio/voices?model={model} endpoint for voice listing - Add peer support for /v1/audio/transcriptions	2026-01-31 22:49:13 -08:00
Benson Wong	205efd40a1	proxy: extend /running endpoint with additional process data (#474 ) Extend the /running endpoint to return more details about running processes beyond just model and state. - add cmd field to show the command being executed - add proxy field to show the proxy URL - add ttl (UnloadAfter) for automatic unloading configuration - add name and description for model metadata - update tests to verify new fields are returned correctly fixes #471	2026-01-19 17:37:00 -08:00
Benson Wong	eb5bfff0b0	proxy: unify filtering for local models and peers This unifies the filtering capabilities for models and peers - stripParams: removes params in the request - setParams: sets params in the request fixes #453	2026-01-15 18:59:43 -08:00
Benson Wong	3dc36032fb	proxy: skip very slow tests in -short test mode (#446 ) * proxy: skip very slow tests in -short test mode * CLAUDE.md: update testing instructions	2025-12-31 14:08:56 -08:00
Benson Wong	addb98646f	proxy: add support for basic authorization (#445 ) Fixes #444 where the UI with api keys did not work. The choice to use http basic authorization is for simple, automatic browser support. No changes to the UI were necessary. Just use an API key as the password, no user name is required.	2025-12-31 13:42:35 -08:00
Benson Wong	22e098ac8b	Add Peer Model Support (#438 ) This PR allows a single llama-swap to be the central proxy for models served by other inference servers. The peer servers can be another llama-swap or any API that supports the /v1/* inference endpoint. Updates: #433, #299 Closes: #296	2025-12-27 20:18:06 -08:00
Benson Wong	53b32f3601	proxy: add API key support (#436 ) Add configuration support for api keys that are enforced by llama-swap. Keys are stripped before sending them to upstream servers. Updates: #433, #50 and #251	2025-12-23 23:39:33 -08:00
Benson Wong	e6a9e210ba	proxy: fix path bug in /logs/stream/{model_id} (#431 ) A {model_id} containing a forward slash trips up gin's path param parsing. This updates /logs/stream to work like /upstream where the model_id is built up in parts and searched for in the configuration. Updates #421	2025-12-21 21:47:14 -08:00
Ryan Steed	86e9b93c37	proxy,ui: add version endpoint and display version info in UI (#395 ) - Add /api/version endpoint to ProxyManager that returns build date, commit hash, and version - Implement SetVersion method to configure version info in ProxyManager - Add version info fetching to APIProvider and display in ConnectionStatus component - Include version info in UI context and update dependencies - Add tests for version endpoint functionality	2025-11-17 10:43:47 -08:00
Ryan Steed	554d29e87d	feat: enhance model listing to include aliases (#400 ) introduce includeAliasesInList as a new configuration setting (default false) that includes aliases in v1/models Fixes #399	2025-11-15 14:35:26 -08:00
Benson Wong	e250e71e59	Include metrics from upstream chat requests (#361 ) * proxy: refactor metrics recording - remove metrics_middleware.go as this wrapper is no longer needed. This also eliminiates double body parsing for the modelID - move metrics parsing to be part of MetricsMonitor - refactor how metrics are recording in ProxyManager - add MetricsMonitor tests - improve mem efficiency of processStreamingResponse - add benchmarks for MetricsMonitor.addMetrics - proxy: refactor MetricsMonitor to be more safe handling errors	2025-10-25 17:38:18 -07:00
David Wen Riccardi-Zhu	d58a8b85bf	Refactor to use httputil.ReverseProxy (#342 ) * Refactor to use httputil.ReverseProxy Refactor manual HTTP proxying logic in Process.ProxyRequest to use the standard library's httputil.ReverseProxy. * Refactor TestProcess_ForceStopWithKill test Update to handle behavior with httputil.ReverseProxy. * Fix gin interface conversion panic	2025-10-13 16:47:04 -07:00
Benson Wong	caf9e98b1e	Fix race conditions in proxy.Process (#349 ) - Fix data races found in proxy.Process by go's race detector. - Add data race detection to the CI tests. Fixes #348	2025-10-13 16:42:49 -07:00
Benson Wong	70930e4e91	proxy: add support for user defined metadata in model configs (#333 ) Changes: - add Metadata key to ModelConfig - include metadata in /v1/models under meta.llamaswap key - add recursive macro substitution into Metadata - change macros at global and model level to be any scalar type Note: This is the first mostly AI generated change to llama-swap. See #333 for notes about the workflow and approach to AI going forward.	2025-10-04 19:56:41 -07:00
Benson Wong	216c40b951	proxy/config: create config package and migrate configuration (#329 ) * proxy/config: create config package and migrate configuration The configuration is become more complex as llama-swap adds more advanced features. This commit moves config to its own package so it can be developed independently of the proxy package. Additionally, enforcing a public API for a configuration will allow downstream usage to be more decoupled.	2025-09-28 16:50:06 -07:00
Benson Wong	1a84926505	proxy: add unload of single model (#318 ) This adds a new API endpoint, /api/models/unload/*model, that unloads a single model. In the UI when a model is in a ReadyState it will have a new button to unload it. Fixes #312	2025-09-24 20:53:48 -07:00
Artur Podsiadły	558801db1a	Fix nginx proxy buffering for streaming endpoints (#295 ) * Fix nginx proxy buffering for streaming endpoints - Add X-Accel-Buffering: no header to SSE endpoints (/api/events, /logs/stream) - Add X-Accel-Buffering: no header to proxied text/event-stream responses - Add nginx reverse proxy configuration section to README - Add tests for X-Accel-Buffering header on streaming endpoints Fixes #236 * Fix goroutine cleanup in streaming endpoints test Add context cancellation to TestProxyManager_StreamingEndpointsReturnNoBufferingHeader to ensure the goroutine is properly cleaned up when the test completes.	2025-09-09 16:07:46 -07:00
Yandrik	977f1856bb	add /completion endpoint (#275 ) * feat: add /completion endpoint * chore: reformat using gofmt	2025-08-28 21:41:02 -07:00
Benson Wong	52b329f7bc	Fix #277 race condition in ProcessGroup.ProxyRequest when swap=true	2025-08-28 21:38:40 -07:00
Benson Wong	5dc6b3e6d9	Add barebones but working implementation of model preload (#209 , #235 ) Add barebones but working implementation of model preload * add config test for Preload hook * improve TestProxyManager_StartupHooks * docs for new hook configuration * add a .dev to .gitignore	2025-08-14 10:27:28 -07:00
Benson Wong	10569ed546	Fix model alias usage in upstream path (#230 ) Model alias values are not properly resolved and work in upstream/ path. Related to #229.	2025-08-07 20:16:56 -07:00
Ben Greene	5c63e0066c	return models sorted by id in /v1/models (#222 )	2025-08-06 10:04:52 -07:00
Benson Wong	0f583163f7	add /health (#211 )	2025-07-30 10:37:10 -07:00
Benson Wong	01d4838fb3	Fix token metrics parsing (#199 ) Fix #198 - use llama-server's `timings` info if available in response body - send "-1" for token/sec when not able to accurately calculate performance - optimize streaming body search for metrics information	2025-07-22 23:10:14 -07:00
g2mt	87dce5f8f6	Add metrics logging for chat completion requests (#195 ) - Add token and performance metrics for v1/chat/completions - Add Activity Page in UI - Add /api/metrics endpoint Contributed by @g2mt	2025-07-21 22:19:55 -07:00
Benson Wong	c867a6c9a2	Add name and description to v1/models list (#179 ) * Add support for name and description in v1/models list * add configuration example for name and description	2025-06-30 23:02:44 -07:00
Benson Wong	4236cec03a	Add Filters to Model Configuration (#174 ) llama-swap can strip specific keys in JSON requests. This is useful for removing the ability for clients to set sampling parameters like temperature, top_k, top_p, etc.	2025-06-23 10:52:29 -07:00
Benson Wong	e3a0b013c1	add content length test for #131	2025-05-14 19:50:01 -07:00
Benson Wong	7f37bcc6eb	Improve testing around using SIGKILL (#127 ) * Add test for SIGKILL of process * silent TestProxyManager_RunningEndpoint debug output * Ref #125	2025-05-13 21:21:52 -07:00
Benson Wong	519c3a4d22	Change /unload to not wait for inflight requests (#125 ) Sometimes upstreams can accept HTTP but never respond causing requests to build up waiting for a response. This can block Process.Stop() as that waits for inflight requests to finish. This change refactors the code to not wait when attempting to shutdown the process.	2025-05-13 11:39:19 -07:00
Sam	bc652709a5	Add config hot-reload (#106 ) introduce --watch-config command line option to reload ProxyManager when configuration changes.	2025-05-11 17:37:00 -07:00
Benson Wong	21d7973d11	Improve content-length handling (#115 ) ref: See #114 * Improve content-length handling - Content length was not always being sent - Add tests for content-length	2025-05-05 10:46:26 -07:00
Benson Wong	448ccae959	Introduce Groups Feature (#107 ) Groups allows more control over swapping behaviour when a model is requested. The new groups feature provides three ways to control swapping: within the group, swapping out other groups or keep the models in the group loaded persistently (never swapped out). Closes #96, #99 and #106.	2025-05-02 22:35:38 -07:00
Benson Wong	5fad24c16f	Make checkHealthTimeout Interruptable during startup (#102 ) interrupt and exit Process.start() early if the upstream process exits prematurely or unexpectedly.	2025-04-24 14:39:33 -07:00
Benson Wong	192b2ae621	Remove no longer needed test	2025-04-04 14:46:01 -07:00
Benson Wong	84e2c07a7e	Refactor wildcard out of CORS headers (#81 ) Changes to CORS functionality: - `Access-Control-Allow-Origin: *` is set for all requests - for pre-flight OPTIONS requests - specify methods: `Access-Control-Allow-Methods: GET, POST, PUT, PATCH, DELETE, OPTIONS` - if the client sent `Access-Control-Request-Headers` then echo back the same value in `Access-Control-Allow-Headers`. If no `Access-Control-Request-Headers` were sent, then send back a default set - set `Access-Control-Max-Age: 86400` to that may improve performance - Add CORS tests to the proxy-manager	2025-03-25 15:24:43 -07:00
Benson Wong	5c97299e7b	Add support for sending a custom model name to upstream (#69 ) (#71 ) * add test for splitRequestedModel() * Add `useModelName` parameter to model configuration * add docs to README	2025-03-14 21:07:52 -07:00
Benson Wong	3201a68a04	Add /v1/audio/transcriptions support (#41 ) * add support for /v1/audio/transcriptions	2025-03-13 13:49:39 -07:00
Florin-Gabriel Dumitru	3ac94ad20e	Adds an endpoint '/running' (#61 ) * Adds an endpoint '/running' that returns either an empty JSON object if no model has been loaded so far, or the last model loaded (model key) and it's current state (state key). Possible state values are: stopped, starting, ready and stopping. * Improves the `/running` endpoint by allowing multiple entries under the `running` key within the JSON response. Refactors the `/running` method name (listRunningProcessesHandler). Removes the unlisted filter implementation. * Adds tests for: - no model loaded - one model loaded - multiple models loaded * Adds simple comments. * Simplified code structure as per 250313 comments on PR #65. --------- Co-authored-by: FGDumitru\|B <xelotx@gmail.com>	2025-03-13 13:42:59 -07:00
Benson Wong	b3d331da0d	Properly strip profile name slug from models fixes (#62 ) The profile slug in a model name, `profile:model`, is specific to llama-swap. This strips `profile:` out of the model name request so upstreams that expect just `model` work and do not require knowing about the profile slug.	2025-03-09 12:41:52 -07:00
Benson Wong	082d5d0fc5	Add /unload endpoint (#58 ) to unload all currently running models	2025-03-03 10:33:36 -08:00
Benson Wong	09bdd86b54	Improve shutdown behaviour (#47 ) (#49 ) Introduce `Process.Shutdown()` and `ProxyManager.Shutdown()`. These two function required a lot of internal process state management refactoring. A key benefit is that `Process.start()` is now interruptable. When `Shutdown()` is called it will break the long health check loop. State management within Process is also improved. Added `starting`, `stopping` and `shutdown` states. Additionally, introduced a simple finite state machine to manage transitions.	2025-02-05 17:19:59 -08:00
Benson Wong	abdc2bfdb3	Fix panic when requesting non-members of profiles A panic occurs when a request for an invalid profile:model pair is made. The edge case is that the profile exists and the model exists but they're not configured as a pair. This adds an additional check to make sure the profile:model pair is valid before attempting to swap the model.	2025-01-16 12:06:38 -08:00

1 2

54 Commits