llama-swap

Author	SHA1	Message	Date
Benson Wong	eff9b60434	server: capture failed (non-200) LLM requests (#862 ) Store a request/response capture for non-200 responses so failed requests can be inspected in the activity log's Capture dialog, matching the existing behavior for successful requests. - extract storeCapture/decodeResponseBody helpers to share capture logic between the success and non-200 paths - record non-200 bodies (decompressed) so error details are viewable - the activity UI already gates the View button on has_capture, so it now appears for failed requests with no UI changes - add tests for capturing failed requests and the disabled-captures case closes #766	2026-06-20 11:50:35 -07:00
Wojciech	9bcddad91b	internal/server,ui: add new Acitivty page column - Drafted (#859 ) Add draft metrics to activity log	2026-06-18 20:55:02 -07:00
Benson Wong	a15e47922c	proxy: meter /upstream requests via metrics middleware (#858 ) Wrap /upstream/{upstreamPath...} in the metrics middleware so activity log entries are recorded for model-dispatched endpoints accessed through the upstream passthrough. - Move findModelInPath to shared.FindModelInPath and reuse it in handleUpstream, the log monitor lookup, and FetchContext. - Extend FetchContext to resolve the model from /upstream/<model>/... paths without consuming the request body. - Add isMetricsRecordPath to limit recording to the model-dispatched endpoints that produce token usage/timings. - Add tests for upstream metrics recording and FetchContext upstream path resolution. Fixes #855	2026-06-17 17:38:52 -07:00
George	0ab214d1c8	perf: add vendor-agnostic GPU monitoring for Windows (experimental) (#779 ) Add GPU monitoring support for AMD and Intel GPUs on Windows using D3DKMT (DirectX) and PDH performance counters. - Add PDH-based GPU utilization via \GPU Engine(*)\Utilization Percentage counter, summing all engine types per adapter (3D, Compute, Copy, Video). - Add D3DKMT bindings for adapter enumeration, memory segments, and adapter perf data. - Use PDH as primary utilization source (works on all vendors), with D3DKMT RunningTime as fallback for systems without PDH counters. - Prefer nvidia-smi when available, fall back to D3DKMT + PDH for AMD/Intel. - Backend priority: nvidia-smi -> D3DKMT + PDH -> ErrNoGpuTool. Verified on AMD 7900XTX GPU with llama.cpp Vulkan & ROCm backend: GPU utilization correctly shows ~99% during inference, ~0-2% when idle. --- LLM disclosure: GLM 5.1 & Kimi K2.6 have been used extensively during exploration and coding to the point that the LLM's wrote over 3/4 of the code, and I have done additional verification myself. As such, it should be considered experimental. Additional verification is needed. I have tested it on my 7900XTX system with Windows 11, and it works correctly, but as I only have this one rig, I cannot verify it everywhere.	2026-06-16 21:49:09 -07:00
Benson Wong	d07b063ab6	internal/server,shared: support request metadata (#850 ) - add support for http handlers in the request chain to append metadata to the request - metrics middleware will include metadata in the activity log - update Activity UI to support metadata, drag sort columns - update Activity UI capture dialog to use more screen space Updates #834	2026-06-16 21:44:55 -07:00
Benson Wong	6cf1317341	schedule,shared: move concurrency 429 limits into scheduler code (#849 ) - make concurrency limiting the scheduler.Scheduler's responsibility - eliminate the separate concurrency limit middleware - move concurrencyLimit logic into scheduler.FIFO to maintain backwards compatibility - add HTTPError from #834 Updates #834	2026-06-15 22:35:12 -07:00
Benson Wong	ed77385d08	ui: improve manual model load and cancel (#847 ) - When a model is manually loaded show a cancel buttton and a queued status - Implement cancellation in scheduler.Scheduler interface and FIFO scheduler - Add cache bust query parameter to bypass browser cache Fixes #844	2026-06-14 13:38:10 -07:00
Benson Wong	92b90447e8	Model capabilities 734 (#842 ) internal/config,server: implement model capabilities - define the capabilities of a model using a simple config block on the model - v1/models renders out capabilities to be compatible with openrouter, huggingface chat, and mistral formats for broader compatibility - add support for capabilities in UI Fixes #734	2026-06-13 23:23:19 -07:00
Benson Wong	62aea0e83d	internal/router,server,shared: refactor auth, libs (#839 ) - refactor shared http functionality into internal/shared/http.go - remove stripping of Authorization and x-api-key - add Request Context middleware to internal/server - add /ui and /metrics behind auth middleware, fixes #717 Fix #717 Updates: #834	2026-06-13 10:19:04 -07:00
Benson Wong	f6877b8175	main: show message when listening on network (#836 ) fixes: #739	2026-06-11 22:15:14 -07:00
Benson Wong	9b3a33d7b9	Implement new scheduler (#823 ) - introduce internal/router/scheduler to decouple routing, swapping and queuing into interface contracts. - introduce a new `routing` configuration section that supersedes `matrix` and `group` while maintaining backwards compatibility - add FIFO scheduler with prioritized queuing - add internal/router/design.md as developer documentation on implementing new schedulers and routers Fixes #797	2026-06-10 20:34:25 -07:00
Benson Wong	0cfe5a6639	Makefile,internal: fix websocket regression and other small things (#830 ) - fix websocket regression and add test to prevent in the future - fix staticheck errors - remove proxy package remnants from Makefile fix #829	2026-06-09 21:37:53 -07:00
Benson Wong	44e1501e81	internal/process,server: fix unload regression (#828 ) In v221 the shutdown behaviour was refactored so shutdown behaviour was more reliable in stopping a process group. This exposed an existing bug where the unload API had a timeout of 0 that snuck in during the big refactor. - set a default timeout of 10 seconds for unloads called via the API - add logging around shutdown routine updates: #807, #808 fixes: #827	2026-06-09 20:49:58 -07:00
Benson Wong	29d3d9ba20	perf: add macOS GPU monitoring via mactop and ioreg (#816 ) Implement performance monitoring on OSX for Apple Silicon hardware. The implementation uses a combination of mactop and ioreg. If mactop is installed (`brew install mactop`) it is used in a headless cli mode to stream usage metrics. mactop hooks into unpublished(?) C based APIs in OSX. Rather than introduce a cgo dependency into llama-swap's build chain only for darwin I opted to go the external process route. ioreg, which comes bundled with OSX is used as the fallback. It does not provide temperature and power usage data but is able to show accurate GPU and memory utilization. Updates #771, #814	2026-06-03 21:51:03 -07:00
Benson Wong	9be9a87fa0	internal/process: improve windows shutdown behaviour (#808 ) Add Windows specific shutdown code paths so stopping of child processes is more reliable: - stopping llama-swap won't leave behind any child processes it created - uses Job Objects in Windows so the whole llama-swap tree is closed by the os - add procCtx to baseRouter. It replaces shutdownCtx as a signal for managing lifetime state. - shutdownCtx is only used by the router to stop handling new requests during shutdown - improve debug logging to make it easier to trace source of issues Fixes #804 Updates #807	2026-06-01 00:45:30 -07:00
Benson Wong	6ea551362e	process,router: make model shutdown and load-streaming robust Note: The original proxy/process_unix.go had a noop for setProcAttributes so it also did not stop grandchildren processes. This patch adds that capability and improves reliability. -- Stop() no longer hangs on a shell wrapper that forks the real binary. The upstream is built with exec.CommandContext + cmd.Cancel + cmd.WaitDelay, so cmd.Wait() returns even when a forked grandchild inherits the stdout/stderr pipes. killProcess sends the stop signal directly (not by cancelling the context) so cmd.WaitDelay measures from process exit and never silently caps the caller's graceful timeout. The upstream is also started in its own process group (Setpgid) on Unix, so the graceful SIGTERM — and the SIGKILL escalation after the timeout — are delivered to the whole group via the negative PID. A forked grandchild is reaped with its parent instead of leaking as an orphan. The loading-spinner SSE goroutine can no longer panic when it outlives the request. net/http recycles the response writer via Reset(nil) once ServeHTTP returns; the orphaned goroutine then flushed against a nil-backed writer and crashed with a SIGSEGV. A release() fence on loadingWriter lets any in-flight write finish then short-circuits later writes/flushes, and all three ServeHTTP select branches run a finishLoading helper (cancelLoad, waitForCompletion, release) before the writer is reclaimed. - internal/process: exec.CommandContext + WaitDelay, Setpgid process groups, group-wide SIGTERM/SIGKILL teardown - internal/router: release() fence + finishLoading on loadingWriter fixes #804	2026-05-31 10:11:12 -07:00
Luiszzzor	c790d0ee03	fix: update the concurrency middleware to respond with a JSON payload (#798 ) update the concurrency middleware to respond with a JSON payload instead of plain text when the request limit is reached to be compatible with openai api standard --------- Co-authored-by: Ludwik <l.czarnota@samsung.com>	2026-05-29 23:59:32 -07:00
Benson Wong	4ca9c478a2	Makefile,internal/server: various release tweaks	2026-05-29 15:27:08 -07:00
Benson Wong	02e015fa49	Introduce new routing backend (#790 ) This is a huge backend change that essentially started with rewriting the concurrency handling for processes and blew up to a refactor of the entire application. In short these are the improvements: Better state and life cycle management: Life cycle management of processes has always been the trickiest part of the code. Juggling mutex locks between multiple locations to reduce race conditions was complex. Too complex for my feeble brain to build a simple mental model around as llama-swap gained more features. All of that has been refactored. Most of the locks are gone, replaced with a single run() that owns all state changes. There is one place to start from now to understand and extend routing logic. The improved life cycle management makes it easier to implement more complex swap optimization strategies in the future like #727. Collation of requests: llama-swap previously handled requests and swapping in the order they came in. For example requests for models in this order ABCABC would result in 5 swaps. Now those requests are handled in this order AABBCC. The result is less time waiting for swap under a high churn request queue. This fixes #588 #612. A possible future enhancement is to support a starvation parameter so swap can be forced when models have been waiting too long. Shared base implementation for groups and swap matrix: During the refactor it became clear that much of the swapping logic was shared between these two implementations. That is not surprising considering the swap matrix was added many moons after groups. Now they share a common base and their specific swap strategies are implemented into the swapPlanner interface. Requests for bespoke or specific swapping scenarios is a common theme in the issues. Now users can implement whatever bespoke and weird swapping strategy they want in their own fork. Just ask your agent of choice to implement swapPlanner. I'll still remaining more conservative on what actually lands in core llama-swap and will continue to evaluate PRs if the changes is good for everyone or just one specific use case. AI / Agentic Disclosure: I paid very close attention to the low level swap concurrency design and implementation. It's important to keep that essential part reliable, boring and no surprises. Backwards compatibility was also maintained, even the one way non-exclusive group model loading behaviour that people have rightly pointed out be a weird design decision. With the underlying swap core done the web server, api and UI sitting on top were largely ported over with Claude Code and Opus 4.7 in multiple phases. If you're curious I kept the changes in docs/newrouter-todo.md. I did several passes to make sure things weren't left behind. However, even frontier LLMs at the time of this PR still make small decisions that don't make a lot of sense. They get shit wrong all the time, just in small subtle way. That said, there's likely to be some new bugs introduced with this massive refactor. I'm fairly confident that there's no major architectural flaws that would cause goal seeking agents to make dumb, ugly code decisions. For a little while the legacy llama-swap will be available under cmd/legacy/llama-swap. The plan is to eventually delete that entry point as well as the proxy package. On a bit of a personal note, this PR is exciting and a bit sad for me. I hand wrote much of the original code and this PR ultimately replaces much of it. While the old code served as a good reference for the agent to implement the new stuff it still a bit sad to eventually delete it all.	2026-05-28 21:47:01 -07:00
Cr4xy	63bc266395	Add new power draw column header for rocm-smi monitoring (#788 ) # Overview This patch fixes https://github.com/mostlygeek/llama-swap/pull/775#issuecomment-4535303706 and removes some unnecessary `break` statements. ## The third variant now also works with power draw: ` device,Device Name,Device ID,Device Rev,Subsystem ID,GUID,Temperature (Sensor edge) (C),Temperature (Sensor junction) (C),Temperature (Sensor memory) (C),Average Graphics Package Power (W),GPU use (%),GPU Memory Allocated (VRAM%),GPU Memory Read/Write Activity (%),Memory Activity,Avg. Memory Bandwidth,VRAM Total Memory (B),VRAM Total Used Memory (B),Card Series,Card Model,Card Vendor,Card SKU,Node ID,GFX Version ` <img width="1121" height="315" alt="image" src="https://github.com/user-attachments/assets/4b908c4d-2401-4dfe-9bac-e7aa770cfb42" /> ## Old variants: ` device,Device Name,Device ID,Device Rev,Subsystem ID,GUID,Temperature (Sensor edge) (C),Temperature (Sensor junction) (C),Temperature (Sensor memory) (C),Fan speed (level),Fan speed (%),Fan RPM,Current Socket Graphics Package Power (W),GPU use (%),GPU Memory Allocated (VRAM%),GPU Memory Read/Write Activity (%),Memory Activity,VRAM Total Memory (B),VRAM Total Used Memory (B),Card Series,Card Model,Card Vendor,Card SKU,Node ID,GFX Version ` <img width="1118" height="308" alt="image" src="https://github.com/user-attachments/assets/b236e0cd-4505-42e5-b497-cff62c720e3d" /> ` device,Device Name,Device ID,Device Rev,Subsystem ID,GUID,Temperature (Sensor edge) (C),Current Socket Graphics Package Power (W),GPU use (%),GPU Memory Allocated (VRAM%),Memory Activity,VRAM Total Memory (B),VRAM Total Used Memory (B),Card Series,Card Model,Card Vendor,Card SKU,Node ID,GFX Version ` <img width="1120" height="312" alt="image" src="https://github.com/user-attachments/assets/1adde1c3-5f35-4db4-ba13-65751ac076e8" />	2026-05-25 11:36:16 -07:00
Cr4xy	636b53e70f	Improve rocm-smi performance monitoring (#775 ) Fix hardcoded indices for rocm-smi.	2026-05-20 17:59:49 -07:00
gatkisson	59cd3b690d	Added Windows performance monitoring using nvidia-smi (#773 ) updates: #596, #771	2026-05-18 11:02:03 -07:00
knguyen298	79dc87f881	Add ROCm stats via rocm-smi (#767 ) Add ROCm GPU stats support using `rocm-smi`.	2026-05-17 07:58:26 -07:00
cdwaage	6a9c4efc8f	fix: use --loop instead of -loop for nvidia-smi (driver 540+ compat) (#759 )	2026-05-15 13:20:29 -07:00
Benson Wong	a4b91e08cf	Changes and fixes before the release (docs/small tweaks) (#750 ) - update README.md with new docker instructions - update docs/configuration.md - update .github/workflows to have pinned action versions - gofmt events package - fix small bugs in CI scripts - reduce config options for internal/perf/monitor and config. A ring buffer is used to keep 1hr of entries at max 5s granularity. For long term stats use prometheus monitoring on /metrics Fixes #744	2026-05-13 21:18:19 -07:00
David Soušek	3e3646f9f9	perf: ignore LACT devices reporting zero VRAM (#753 ) Ignore LACT devices that report zero total VRAM. Some virtual GPUs on headless VMs report `MemTotalMB == 0` through LACT, which makes them appear in performance monitoring despite not providing useful memory data. Skip those entries so only usable GPU devices are reported. This makes performance monitoring cleaner on headless VMs with virtual GPUs that report zero VRAM. Co-authored-by: David Soušek <david.sousek@intelogy.co.uk>	2026-05-13 10:03:54 -07:00
Benson Wong	7e3e94a08a	proxy,ui: add performance monitoring with Prometheus metrics (#743 ) Add a comprehensive performance monitoring system that collects CPU, memory, swap, load average, network IO, and GPU stats. Provides both a REST API for the UI and a Prometheus /metrics endpoint. Backend changes: - New internal/perf package with configurable interval-based stats collection - GPU monitoring via LACT (Unix socket) and nvidia-smi fallback on Linux - Ring buffer (internal/ring) for time-series stat storage - Prometheus /metrics endpoint with all system and GPU metrics - Moved LogMonitor to internal/logmon package - New PerformanceConfig for hot-reloadable monitoring settings - REST /api/performance endpoint replacing SSE streaming UI changes: - New Performance page with real-time charts for CPU, memory, GPU, and network - Reusable PerformanceChart component - LLAMA_SWAP_URL environment variable support - Improved capture dialog display Other: - Example Grafana dashboard for Prometheus metrics - monitor-test standalone binary - Config schema and example updates fixes #596	2026-05-09 13:29:22 -07:00

27 Commits