llama-swap

Author	SHA1	Message	Date
Benson Wong	02e015fa49	Introduce new routing backend (#790 ) This is a huge backend change that essentially started with rewriting the concurrency handling for processes and blew up to a refactor of the entire application. In short these are the improvements: Better state and life cycle management: Life cycle management of processes has always been the trickiest part of the code. Juggling mutex locks between multiple locations to reduce race conditions was complex. Too complex for my feeble brain to build a simple mental model around as llama-swap gained more features. All of that has been refactored. Most of the locks are gone, replaced with a single run() that owns all state changes. There is one place to start from now to understand and extend routing logic. The improved life cycle management makes it easier to implement more complex swap optimization strategies in the future like #727. Collation of requests: llama-swap previously handled requests and swapping in the order they came in. For example requests for models in this order ABCABC would result in 5 swaps. Now those requests are handled in this order AABBCC. The result is less time waiting for swap under a high churn request queue. This fixes #588 #612. A possible future enhancement is to support a starvation parameter so swap can be forced when models have been waiting too long. Shared base implementation for groups and swap matrix: During the refactor it became clear that much of the swapping logic was shared between these two implementations. That is not surprising considering the swap matrix was added many moons after groups. Now they share a common base and their specific swap strategies are implemented into the swapPlanner interface. Requests for bespoke or specific swapping scenarios is a common theme in the issues. Now users can implement whatever bespoke and weird swapping strategy they want in their own fork. Just ask your agent of choice to implement swapPlanner. I'll still remaining more conservative on what actually lands in core llama-swap and will continue to evaluate PRs if the changes is good for everyone or just one specific use case. AI / Agentic Disclosure: I paid very close attention to the low level swap concurrency design and implementation. It's important to keep that essential part reliable, boring and no surprises. Backwards compatibility was also maintained, even the one way non-exclusive group model loading behaviour that people have rightly pointed out be a weird design decision. With the underlying swap core done the web server, api and UI sitting on top were largely ported over with Claude Code and Opus 4.7 in multiple phases. If you're curious I kept the changes in docs/newrouter-todo.md. I did several passes to make sure things weren't left behind. However, even frontier LLMs at the time of this PR still make small decisions that don't make a lot of sense. They get shit wrong all the time, just in small subtle way. That said, there's likely to be some new bugs introduced with this massive refactor. I'm fairly confident that there's no major architectural flaws that would cause goal seeking agents to make dumb, ugly code decisions. For a little while the legacy llama-swap will be available under cmd/legacy/llama-swap. The plan is to eventually delete that entry point as well as the proxy package. On a bit of a personal note, this PR is exciting and a bit sad for me. I hand wrote much of the original code and this PR ultimately replaces much of it. While the old code served as a good reference for the agent to implement the new stuff it still a bit sad to eventually delete it all.	2026-05-28 21:47:01 -07:00
Benson Wong	7e3e94a08a	proxy,ui: add performance monitoring with Prometheus metrics (#743 ) Add a comprehensive performance monitoring system that collects CPU, memory, swap, load average, network IO, and GPU stats. Provides both a REST API for the UI and a Prometheus /metrics endpoint. Backend changes: - New internal/perf package with configurable interval-based stats collection - GPU monitoring via LACT (Unix socket) and nvidia-smi fallback on Linux - Ring buffer (internal/ring) for time-series stat storage - Prometheus /metrics endpoint with all system and GPU metrics - Moved LogMonitor to internal/logmon package - New PerformanceConfig for hot-reloadable monitoring settings - REST /api/performance endpoint replacing SSE streaming UI changes: - New Performance page with real-time charts for CPU, memory, GPU, and network - Reusable PerformanceChart component - LLAMA_SWAP_URL environment variable support - Improved capture dialog display Other: - Example Grafana dashboard for Prometheus metrics - monitor-test standalone binary - Config schema and example updates fixes #596	2026-05-09 13:29:22 -07:00
Ron M	a37b4866d8	proxy: add configurable HTTP timeouts for models and peers (#619 ) Add configurable HTTP timeout settings to both models and peers to support installations that requires longer timeouts than the current hardcoded defaults. Closes #618	2026-04-06 19:30:27 +08:00
Benson Wong	cc77139ff8	proxy,proxy/config: add global TTL feature (#554 ) Add a new configuration parameter globalTTL that all models will inherit. The default value is 0 which matches the currently functionality to never automatically unload a model. The model.ttl's default has changed to -1, which means use the global TTL value. Any model.ttl >=0 is now value with 0 meaning never unload. This allows a model to override a globalTTL > 0 and be configured to never unload. Fixes #459 Closes #512	2026-03-01 21:02:12 -08:00
Benson Wong	3dc36032fb	proxy: skip very slow tests in -short test mode (#446 ) * proxy: skip very slow tests in -short test mode * CLAUDE.md: update testing instructions	2025-12-31 14:08:56 -08:00
Benson Wong	f852689104	proxy: add panic recovery to Process.ProxyRequest (#363 ) Switching to use httputil.ReverseProxy in #342 introduced a possible panic if a client disconnects while streaming the body. Since llama-swap does not use http.Server the recover() is not automatically there. - introduce a recover() in Process.ProxyRequest to recover and log the event - add TestProcess_ReverseProxyPanicIsHandled to reproduce and test the fix fixes: #362	2025-10-25 20:40:05 -07:00
David Wen Riccardi-Zhu	d58a8b85bf	Refactor to use httputil.ReverseProxy (#342 ) * Refactor to use httputil.ReverseProxy Refactor manual HTTP proxying logic in Process.ProxyRequest to use the standard library's httputil.ReverseProxy. * Refactor TestProcess_ForceStopWithKill test Update to handle behavior with httputil.ReverseProxy. * Fix gin interface conversion panic	2025-10-13 16:47:04 -07:00
Benson Wong	216c40b951	proxy/config: create config package and migrate configuration (#329 ) * proxy/config: create config package and migrate configuration The configuration is become more complex as llama-swap adds more advanced features. This commit moves config to its own package so it can be developed independently of the proxy package. Additionally, enforcing a public API for a configuration will allow downstream usage to be more decoupled.	2025-09-28 16:50:06 -07:00
Yathi	a906cd459b	Strip comments before macro expansion in config (#193 ) A bug fix that ensures comments don't interfere with macro expansion by removing them first. This prevents unwanted comment text from appearing in the final expanded command. Co-authored-by: Yathiraj Bollimbala G <yathi@yStudio.localdomain>	2025-07-15 10:14:16 -07:00
Benson Wong	49035e2e8e	Append custom env vars instead of replace in Process (#171 ) Append custom env vars instead of replace in Process (#168, #169) PR #162 refactored the default configuration code. This introduced a subtle bug where `env` became `[]string{}` instead of the default of `nil`. In golang, `exec.Cmd.Env == nil` means to use the "current process's environment". By setting it to `[]string{}` as a default the Process's environment was emptied out which caused an array of strange and difficult to troubleshoot behaviour. See issues #168 and #169 This commit changes the behaviour to append model configured environment variables to the default list rather than replace them.	2025-06-18 11:09:13 -07:00
Benson Wong	4fa12a429c	Refactor all default config values into config.go (#162 ) - Move all default values into one place. - Update tests to be more cross platform	2025-06-15 12:32:00 -07:00
Benson Wong	2dc0ca0663	improve llama-swap upstream process recovery and restarts (#155 ) Refactor internal upstream process life cycle management to recover better from unexpected situations. With this change llama-swap should never need to be restarted due to a crashed upstream child process. The `StateFailed` state was removed in favour of always trying to start/restart a process.	2025-06-05 16:24:55 -07:00
Benson Wong	a8b81f2799	Add stopCmd for custom stopping instructions (#136 ) Allow configuration of how a model is stopped before swapping. Setting `cmdStop` in the configuration will override the default behaviour and enables better integration with other process/container managers like docker or podman.	2025-05-16 13:48:42 -07:00
Benson Wong	d7b390df74	Add GH Action for Testing on Windows (#132 ) * Add windows specific test changes * Change the command line parsing library - Possible breaking changes for windows users!	2025-05-14 21:51:53 -07:00
Benson Wong	7f37bcc6eb	Improve testing around using SIGKILL (#127 ) * Add test for SIGKILL of process * silent TestProxyManager_RunningEndpoint debug output * Ref #125	2025-05-13 21:21:52 -07:00
Benson Wong	519c3a4d22	Change /unload to not wait for inflight requests (#125 ) Sometimes upstreams can accept HTTP but never respond causing requests to build up waiting for a response. This can block Process.Stop() as that waits for inflight requests to finish. This change refactors the code to not wait when attempting to shutdown the process.	2025-05-13 11:39:19 -07:00
Benson Wong	9dc4bcb46c	Add a concurrency limit to Process.ProxyRequest (#123 )	2025-05-12 18:12:52 -07:00
Benson Wong	06eda7f591	tag all process logs with its ID (#103 ) Makes identifying Process of log messages easier	2025-04-25 12:58:25 -07:00
Benson Wong	5fad24c16f	Make checkHealthTimeout Interruptable during startup (#102 ) interrupt and exit Process.start() early if the upstream process exits prematurely or unexpectedly.	2025-04-24 14:39:33 -07:00
Benson Wong	b8f888f864	Logging Improvements (#88 ) This change revamps the internal logging architecture to be more flexible and descriptive. Previously all logs from both llama-swap and upstream services were mixed together. This makes it harder to troubleshoot and identify problems. This PR adds these new endpoints: - `/logs/stream/proxy` - just llama-swap's logs - `/logs/stream/upstream` - stdout output from the upstream server	2025-04-04 21:01:33 -07:00
Benson Wong	d625ab8d92	Refactor process state management (#70 ) (#73 ) * add isValidStateTransition helper function * Replace Process.setState() with Process.swapState() * Refactor locking logic in Process	2025-03-15 17:14:03 -07:00
Benson Wong	9b2ed244e2	Improve Continuous integration and fix concurrency bugs (#66 ) - improvements to the continuous GH actions - fix edge case concurrency bugs with Process.start() and state transitions discovered setting up CI.	2025-03-11 10:39:14 -07:00
Benson Wong	09bdd86b54	Improve shutdown behaviour (#47 ) (#49 ) Introduce `Process.Shutdown()` and `ProxyManager.Shutdown()`. These two function required a lot of internal process state management refactoring. A key benefit is that `Process.start()` is now interruptable. When `Shutdown()` is called it will break the long health check loop. State management within Process is also improved. Added `starting`, `stopping` and `shutdown` states. Additionally, introduced a simple finite state machine to manage transitions.	2025-02-05 17:19:59 -08:00
Benson Wong	2833517eef	Improve handling of process that do not handle SIGTERM (#38 ) - Process TTL goroutine did not have a return after .Stop() - Improve logging - Add test TestProcess_LowTTLValue to measure SIGTERM error rate	2025-01-20 14:39:52 -08:00
Benson Wong	5fbd53c616	delay TTL check until after all requests are complete (#25 ) - fixes #25 where requests that last longer than the TTL will cause the process to be unloaded before the next request. - new behavior, TTL waits until all requests are complete before checking timeout	2024-12-09 19:08:03 -08:00
Benson Wong	da46545630	fix profile example in README	2024-12-01 10:13:31 -08:00
Benson Wong	cf82b3c633	Improve Concurrency and Parallel Request Handling (#19 ) Rewrite the swap behaviour so that in-flight requests block process swapping until they are completed. Additionally: - add tests for parallel requests with proxy.ProxyManager and proxy.Process - improve Process startup behaviour and simplified the code - stopping of processes are sent SIGTERM and have 5 seconds to terminate, before they are killed	2024-11-30 15:24:42 -08:00
Ikko Eltociear Ashimine	9a81c53664	chore: update process_test.go (#17 ) nonexistant -> nonexistent	2024-11-26 10:20:16 -08:00
Benson Wong	73ad85ea69	Implement Multi-Process Handling (#7 ) Refactor code to support starting of multiple back end llama.cpp servers. This functionality is exposed as `profiles` to create a simple configuration format. Changes: * refactor proxy tests to get ready for multi-process support * update proxy/ProxyManager to support multiple processes (#7) * Add support for Groups in configuration * improve handling of Model alias configs * implement multi-model swapping * improve code clarity for swapModel * improve docs, rename groups to profiles in config	2024-11-23 19:45:13 -08:00
Benson Wong	533162ce6a	add support for automatically unloading a model (#10 ) (#14 ) * Make starting upstream process on-demand (#10) * Add automatic unload of model after TTL is reached * add `ttl` configuration parameter to models in seconds, default is 0 (never unload)	2024-11-19 16:32:51 -08:00
Benson Wong	a33ac6f8fb	update README	2024-11-18 15:37:50 -08:00
Benson Wong	e5c909ddf7	add tests for proxy.Process	2024-11-17 20:49:14 -08:00

32 Commits