llama-swap

Author	SHA1	Message	Date
Benson Wong	fe71e8a6ea	proxy,ui-svelte: improve support for v1/messages and v1/responses (#758 ) This improves the support for activity logging from the v1/responses and v1/messages endpoints. - add chat endpoint selection to Playground > Chat > Settings - improve metrics extraction for streaming v1/messages and v1/responses endpoints (tested with llama-server) Fixes #742	2026-05-14 21:53:57 -07:00
Benson Wong	a4b91e08cf	Changes and fixes before the release (docs/small tweaks) (#750 ) - update README.md with new docker instructions - update docs/configuration.md - update .github/workflows to have pinned action versions - gofmt events package - fix small bugs in CI scripts - reduce config options for internal/perf/monitor and config. A ring buffer is used to keep 1hr of entries at max 5s granularity. For long term stats use prometheus monitoring on /metrics Fixes #744	2026-05-13 21:18:19 -07:00
Abdulazez A.	085b54bc88	proxy: fix data race in /running endpoint and typo in error message (#748 ) ## Problem The `/running` endpoint in `listRunningProcessesHandler` reads `process.state` directly without holding `stateMutex`. Meanwhile, `swapState()` writes to `process.state` while holding the write lock. This is a data race flagged by the Go race detector. Also fixes a minor typo: "processes was in state" → "process was in state". ## Fix - `proxymanager.go`: Replace `process.state` with `process.CurrentState()` which acquires `stateMutex.RLock()` before reading. - `process.go`: Fix typo in error message. ## Verification - `gofmt -l` — clean - `go test -run "TestProcessGroup_\|TestProxyManager_" ./proxy/` — all pass - `go test ./proxy/config/... ./proxy/cache/... ./proxy/configwatcher/...` — all pass	2026-05-11 12:49:18 -07:00
Benson Wong	7e3e94a08a	proxy,ui: add performance monitoring with Prometheus metrics (#743 ) Add a comprehensive performance monitoring system that collects CPU, memory, swap, load average, network IO, and GPU stats. Provides both a REST API for the UI and a Prometheus /metrics endpoint. Backend changes: - New internal/perf package with configurable interval-based stats collection - GPU monitoring via LACT (Unix socket) and nvidia-smi fallback on Linux - Ring buffer (internal/ring) for time-series stat storage - Prometheus /metrics endpoint with all system and GPU metrics - Moved LogMonitor to internal/logmon package - New PerformanceConfig for hot-reloadable monitoring settings - REST /api/performance endpoint replacing SSE streaming UI changes: - New Performance page with real-time charts for CPU, memory, GPU, and network - Reusable PerformanceChart component - LLAMA_SWAP_URL environment variable support - Improved capture dialog display Other: - Example Grafana dashboard for Prometheus metrics - monitor-test standalone binary - Config schema and example updates fixes #596	2026-05-09 13:29:22 -07:00
Wim Vander Schelden	e261745c66	proxy: add versionless API endpoint (#733 ) Add versionless endpoints under v/ to support upstream peers that do not use the v1/ prefix. Fixes #728.	2026-05-03 13:47:38 -07:00
Marcus	c79114d40a	proxy: fix logger not checking matrix for processes Fix matrix not being used to search for a logger causing /logs/stream/model_name to return an error	2026-05-01 16:43:20 -07:00
Benson Wong	430166d5eb	proxy: fix zero duration for non streaming responses (#723 ) Updates #654	2026-04-30 19:51:28 -07:00
Marcus	5b4beaceef	fix: ?no-history flag and improve /logs monitoring docs (#721 ) - improve logging documentation - small tweaks for edge case issues in upstream and log requests	2026-04-30 00:50:36 -07:00
Benson Wong	fd3c28ffc5	Refactor Activity Page (#710 ) - inference handles to store an activity record for all inference endpoints - add path, status code, and content type to Activities page - toggle on/off columns no Activities page - add configurable capture level for inference endpoints so large binary blobs are not stored in memory - store captures in compressed binary format	2026-04-28 20:33:03 -07:00
Quentin Machu	a846c4f18c	config: remove hard cap on macro length (#718 ) Remove macro value limit of 1024 characters	2026-04-28 13:32:54 -07:00
Benson Wong	66639e83f7	proxy: replace fsnotify with stat-poll watcher and add SIGHUP reload (#685 ) The fsnotify-based config watcher does not work reliably when the config file is bind-mounted into a Docker container as an individual file, and mishandles k8s ConfigMap projections (atomically swapped symlinks). Replace it with a small os.Stat-polling watcher and add SIGHUP as an explicit reload signal. - new proxy/configwatcher package: 2s os.Stat poller, follows symlinks, fires on mtime/size change and on missing -> present transitions - SIGHUP triggers reload unconditionally (works without --watch-config) via the same ConfigFileChangedEvent pipeline so the UI sees identical state transitions - watcher goroutine now exits cleanly on shutdown via a context - drop github.com/fsnotify/fsnotify dependency fixes #682	2026-04-21 23:21:48 -07:00
Benson Wong	231e62291c	proxy: fix matrix race and process stop bug (#677 ) - matrix.go change logic to consider any proxy.Process not in StateStopped or StateShutdown - process.StopImmediately, and Stop() which called it had a subtle bug where it only handled state transitions from StateReady to StateStopping. StateStarting -> StateStopping was ignored completely. fix: #670	2026-04-20 00:21:11 -07:00
Benson Wong	5e3c646829	proxy: compress captures with zstd (#668 ) The previous captures were saved uncompressed in memory. In agentic workflows there can be many turns with each request containing the previous context in the body with a lot of redundant data. Use zstd to compress the request and response data before keeping a copy of memory. Results: - Average Percentage Saved: 73.19% - Average Compression Factor: ~6.77:1	2026-04-17 23:29:37 -07:00
Benson Wong	c3f0d43e6e	proxy: fix race conditions during swap (#667 ) I pointed Opus 4.7 (high effort) at proxy.ProcessGroup to identify any race conditions in the swapping code. It found a race condition where there is a small window in the fast path for routing a request to a loaded model. There is a very small window where: - model M1 is loaded and ready for requests - a request, R1, for M1 comes in - a request, R2, for M2 comes in almost immediately after - R1 acquires the lock, sees M1 is loaded (fast path), releases the lock `[race window]` and the request is ready to be forwarded - the race window occurs between the release of the lock and the request being forwarded - the lock is released so requests can be handled concurrently - R2 comes in within the `[race window]`, acquires the lock, triggers a model swap to M2. stopping M1 - R1 is forwarded to a model that is unloaded or in the process of shutting down creating an error response In deployed systems the race window is very small and doesn't happen often. However with #635 and PR #656 I though this deserved a bit more attention. It is not concluded that this race is the cause of #635 but the race is likely to happen more often under sustained or high load. AI Note: Opus 4.7 x-high effort took about an hour to write the original patch. With the pattern discovered the fix to matrix.go was very quick. GLM 5.1 using the previous established patterns was able to easily write the fix for ProcessGroup.StopProcesses(). Supersedes: #656 Updates: #277, #635	2026-04-17 21:23:17 -07:00
Benson Wong	f6cf9f5844	proxy: Refactor tests (#660 ) - use YAML for test configurations - remove most uses of simple-responder, opting to use process.testHandler Fixes #655	2026-04-16 22:47:42 -07:00
Benson Wong	35193f82f1	proxy: add swap matrix with solver-based model swapping (#646 ) Add a new swap matrix to supersede groups for running concurrent models. The matrix uses a solver that picks the lowest cost evictions to make a requested model available. This simple approach along with a very basic DSL grammar can enable very complex swapping scenarios. - add DSL parser for set expressions with & (AND), \| (OR), (), +ref - add MatrixConfig structs, validation, and topological sort for +ref - add MatrixSolver with cost-minimizing swap decisions - add Matrix runtime integrating solver with Process lifecycle - integrate matrix into ProxyManager with if-branches at all endpoints - update config.example.yaml and config-schema.json with matrix schema - config enforces groups XOR matrix (cannot use both) fixes #643	2026-04-14 21:55:30 -07:00
Benson Wong	a9d840ffd7	proxy,proxy/config: restore timeouts to pre PR 619 (#648 ) Reset the default ResponseHeader timeout to 0 (no timeout) which was set to 60 seconds in PR #619. Fixes #647	2026-04-11 20:42:13 -07:00
Leoy	06bc6a614c	proxy: preserve wall-clock duration in metrics (#629 ) Keep request duration from being underreported when upstream timings only cover part of the full request lifecycle. - compare wall-clock and upstream timing durations - keep token and throughput values from timings - add regression coverage for underreported timings fixes #602	2026-04-07 01:52:41 -07:00
Ron M	a37b4866d8	proxy: add configurable HTTP timeouts for models and peers (#619 ) Add configurable HTTP timeout settings to both models and peers to support installations that requires longer timeouts than the current hardcoded defaults. Closes #618	2026-04-06 19:30:27 +08:00
Benson Wong	15bd55d3a9	proxy, ui-svelte: add /sdapi/v1 endpoint support (#587 ) Add proxy routes for stable-diffusion.cpp's /sdapi/v1/txt2img, /sdapi/v1/img2img, and /sdapi/v1/loras endpoints. POST endpoints use proxyInferenceHandler (model in JSON body), GET /loras uses proxyGETModelHandler (model in query param). Update the image playground with a dual-mode UI supporting both OpenAI and SDAPI backends. In SDAPI mode, loras are fetched first to prime the server-side cache, and all txt2img parameters are exposed (negative prompt, steps, cfg_scale, seed, batch_size, clip_skip, sampler, scheduler, lora selection with multipliers). - Add 3 sdapi route registrations in proxymanager.go - Add sdApi.ts client with generateSdImage and fetchSdLoras - Add SDAPI types (SdApiTxt2ImgRequest, SdApiResponse, etc.) - Add /sdapi to vite dev proxy config - Add backend tests for sdapi routing - Support batch image display in gallery grid https://claude.ai/code/session_0186MGX6NXdHVBTv2KH45fqn --------- Co-authored-by: Claude <noreply@anthropic.com>	2026-03-19 22:08:31 +09:00
Benson Wong	c3c258a55d	proxy: fix metrics capture for v1/responses (#586 ) properly parse anthropic compatible usage data from streaming responses. closes: #577	2026-03-13 16:50:12 -07:00
Benson Wong	24efdb76b1	config: add macro support for name and description fields (#578 ) Extend macro substitution to the name and description fields of ModelConfig, matching the behavior already present for cmd, proxy, checkEndpoint, and filters. - substitute global/model macros (including MODEL_ID) in name and description - substitute PORT macro in name and description when allocated - validate no unknown macros remain in name and description after substitution - add tests for macro substitution, MODEL_ID, and unknown macro error	2026-03-10 08:27:05 -07:00
Benson Wong	cc77139ff8	proxy,proxy/config: add global TTL feature (#554 ) Add a new configuration parameter globalTTL that all models will inherit. The default value is 0 which matches the currently functionality to never automatically unload a model. The model.ttl's default has changed to -1, which means use the global TTL value. Any model.ttl >=0 is now value with 0 meaning never unload. This allows a model to override a globalTTL > 0 and be configured to never unload. Fixes #459 Closes #512	2026-03-01 21:02:12 -08:00
Benson Wong	19fb5f35e9	proxy: implement setParamsByID filter (#535 ) Add setParamsByID filter that applies different request parameters based on the requested model ID, enabling per-alias behaviour for a single loaded model. - add SetParamsByID field to Filters struct and SanitizedSetParamsByID method - substitute ${MODEL_ID} and other macros in setParamsByID keys and values - validate no unknown macros remain in keys or values after substitution - apply setParamsByID in proxyInferenceHandler after setParams (can override it) - update config-schema.json with setParamsByID definition - update UI to show aliases and make them selectable in the Playground closes #534	2026-02-19 22:21:10 -08:00
Brian Mendonca	1688bdd1e9	proxy, ui: add pending requests count to the main dashboard (#516 ) add a real time counter of pending (inflight) requests to the UI.	2026-02-16 09:41:15 -08:00
Benson Wong	8d6d949ec3	proxy: support timings for /infill from llama-server (#510 ) fixes: #463	2026-02-07 17:16:27 -08:00
Benson Wong	b5fde8eb6d	proxy,ui-svelte: add request/response capturing (#508 ) Add saving request and response headers and bodies that go through llama-swap in memory. - captureBuffer added to configuration. Captures are enabled by default. - 5MB of memory is allocated for req/response captures in a ring buffer. Setting captureBuffer to 0 will disable captures. - UI elements to view captured data added to Activity page. Includes some QOL features like json formatting and recombining SSE chat streams - capture saving is done at the byte level and has minimal impact on llama-swap performance Fixes #464 Ref #503	2026-02-07 15:40:01 -08:00
Benson Wong	20738f3623	proxy,ui-svelte: replace old UI with svelte+playground Replace the legacy React UI with the new Svelte-based one. Introduce a Playground in the UI to quickly test out text, image, text to speech and speech to text models behind llama-swap. Key Changes New Svelte UI (ui-svelte/) - Multi-tab Playground with Chat, Image Generation, Audio Transcription, and Speech interfaces - Chat: message editing/regeneration, markdown rendering with LaTeX math support, image attachments, code syntax highlighting - Image: size selector, download/fullscreen viewing - Audio: transcription with peer support - Speech: voice caching with manual refresh, download button - Responsive mobile layout with collapsible navigation - XSS fixes and accessibility improvements Proxy Improvements - Add gzip/brotli compression for UI static assets (proxy/ui_compress.go) - Add GET /v1/audio/voices?model={model} endpoint for voice listing - Add peer support for /v1/audio/transcriptions	2026-01-31 22:49:13 -08:00
Benson Wong	cdea7d16bd	proxy/config: skip env macros in YAML comment lines (#496 ) Fix a bug where ${env.macro_not_exist} in comments would trigger a non-substituted macro error. fixes #495	2026-01-30 20:10:29 -08:00
Ryan Voots	7493618fdc	Add count_tokens api proxying (#476 )	2026-01-20 09:34:42 -08:00
Benson Wong	205efd40a1	proxy: extend /running endpoint with additional process data (#474 ) Extend the /running endpoint to return more details about running processes beyond just model and state. - add cmd field to show the command being executed - add proxy field to show the proxy URL - add ttl (UnloadAfter) for automatic unloading configuration - add name and description for model metadata - update tests to verify new fields are returned correctly fixes #471	2026-01-19 17:37:00 -08:00
Benson Wong	4e850c2834	config: refactor macro substitution in configuration (#470 ) This commit simplifies substitution of environment variables into the configuration. There was a lot of repetitive code substituting ${env.VAR_NAME} into different fields after the configuration was parsed into a config.Config. This refactor uses a string substitution of env vars into the YAML config before it is fully parsed. This eliminates a lot of logic while maintaining backwards compatibility.	2026-01-18 21:52:34 -08:00
Benson Wong	75fced579e	config: support macros in peer apiKey and filters (#469 ) * config: support environment variable macros in peer apiKeys Add ${env.VAR_NAME} substitution for peer apiKey fields, consistent with existing env macro support for model fields and global apiKeys. - Add env macro substitution for peers.{name}.apiKey in LoadConfigFromReader - Add tests for peer apiKey env substitution - Update config.example.yaml to show env macro usage * config: support macros in peer apiKey and filters Extend macro substitution to peer configuration fields: - peers.{name}.apiKey supports both global macros and env macros - peers.{name}.filters.stripParams supports both macro types - peers.{name}.filters.setParams supports both macro types Also renamed validateMetadataForUnknownMacros to validateNestedForUnknownMacros for reuse across model metadata and peer filters validation.	2026-01-16 23:10:50 -08:00
Benson Wong	8f2137c72b	config: support environment variable macros in apiKeys (#467 ) Add substituteEnvMacros support for apiKeys configuration field, allowing API keys to be loaded from environment variables using the ${env.VAR_NAME} syntax. - Apply env macro substitution before validation - Add tests for env macro substitution in apiKeys	2026-01-16 22:41:14 -08:00
Benson Wong	124007cc98	config: add environment variable macros (#466 ) * config: add environment variable macros Add support for ${env.VAR_NAME} syntax to pull values from system environment variables during config loading. - env macros processed before regular macros (allows macros to reference env vars) - works in cmd, cmdStop, proxy, checkEndpoint, filters.stripParams, metadata - returns error if env var is not set - add comprehensive tests fixes #462 * docs: add env macro example to config.example.yaml	2026-01-16 22:25:20 -08:00
Benson Wong	eb5bfff0b0	proxy: unify filtering for local models and peers This unifies the filtering capabilities for models and peers - stripParams: removes params in the request - setParams: sets params in the request fixes #453	2026-01-15 18:59:43 -08:00
Benson Wong	4413881b2d	proxy: actually add /v1/responses endpoint (#449 ) ref: #448	2026-01-01 13:35:45 -08:00
Benson Wong	8df5e8563b	proxy: add /v1/responses and /v1/audio/voices endpoints (#448 ) Updates #433 Fixes #442 #226	2026-01-01 12:52:12 -08:00
Benson Wong	7931212d3e	proxy: add v1/images/edits API endpoint (#447 ) Updates #433	2026-01-01 12:43:06 -08:00
Benson Wong	3dc36032fb	proxy: skip very slow tests in -short test mode (#446 ) * proxy: skip very slow tests in -short test mode * CLAUDE.md: update testing instructions	2025-12-31 14:08:56 -08:00
Benson Wong	addb98646f	proxy: add support for basic authorization (#445 ) Fixes #444 where the UI with api keys did not work. The choice to use http basic authorization is for simple, automatic browser support. No changes to the UI were necessary. Just use an API key as the password, no user name is required.	2025-12-31 13:42:35 -08:00
Benson Wong	37d74efc2d	proxy: add /v1/images/generations (#443 ) Add support for the /v1/images/generations endpoint Updates #433 Closes #191	2025-12-30 21:04:58 -08:00
Benson Wong	22e098ac8b	Add Peer Model Support (#438 ) This PR allows a single llama-swap to be the central proxy for models served by other inference servers. The peer servers can be another llama-swap or any API that supports the /v1/* inference endpoint. Updates: #433, #299 Closes: #296	2025-12-27 20:18:06 -08:00
Benson Wong	53b32f3601	proxy: add API key support (#436 ) Add configuration support for api keys that are enforced by llama-swap. Keys are stripped before sending them to upstream servers. Updates: #433, #50 and #251	2025-12-23 23:39:33 -08:00
Benson Wong	565c44766d	config,proxy: add new configuration logToStdout (#432 ) The new logToStdout option controls what is logged to stdout. The default has been changed to just the proxy logs, which contain swap and http request logs. There are four supported settings: none, proxy, upstream, both. The "both" setting is the legacy setting where everything was spewed to stdout.	2025-12-21 22:23:31 -08:00
Benson Wong	e6a9e210ba	proxy: fix path bug in /logs/stream/{model_id} (#431 ) A {model_id} containing a forward slash trips up gin's path param parsing. This updates /logs/stream to work like /upstream where the model_id is built up in parts and searched for in the configuration. Updates #421	2025-12-21 21:47:14 -08:00
Benson Wong	d3f329f924	proxy: Improve logging performance and allow separate log streaming (#421 ) Replace container/ring.Ring with a custom circularBuffer that uses a single contiguous []byte slice. This fixes the original implementation which created 10,240 ring elements instead of 10KB of storage. GetHistory is now 139x faster (145μs → 1μs) and uses 117x less memory (1.2MB → 10KB). Allocations reduced from 2 to 1 per write operation. Create a LogMonitor per proxy.Process, replacing the usage of a shared one. The buffer in LogMonitor is lazy allocated on the first call to Write and freed when the Process is stopped. This reduces unnecessary memory usage when a model is not active. The /logs/stream/{model_id} endpoint was added to stream logs from a specific process.	2025-12-18 21:49:25 -08:00
Benson Wong	dea98733c3	proxy: extract metrics for v1/messages (#419 )	2025-11-29 23:51:20 -08:00
Benson Wong	c968da1b73	proxy: add support for anthropic v1/messages api (#417 ) * proxy: add support for anthropic v1/messages api * proxy: restrict loading message to /v1/chat/completions	2025-11-29 22:09:07 -08:00
Nikesh Parajuli	06523d8c1e	feat: add platform-specific process attributes support (#411 ) Fixes issues on Windows showing new windows for every process llama-swap spawns.	2025-11-24 21:39:56 -08:00

1 2 3 4 5

203 Commits