llama-swap

Author	SHA1	Message	Date
Benson Wong	f6cf9f5844	proxy: Refactor tests (#660 ) - use YAML for test configurations - remove most uses of simple-responder, opting to use process.testHandler Fixes #655	2026-04-16 22:47:42 -07:00
Benson Wong	121fd93ad8	Makefile: restore linux arm64 targets Fix #641	2026-04-14 22:05:39 -07:00
Benson Wong	17233e9278	docs: update configuration.md for matrix v202	2026-04-14 22:01:03 -07:00
Benson Wong	4866d16c3e	README.md: update to use matrix instead of groups	2026-04-14 21:57:49 -07:00
Benson Wong	35193f82f1	proxy: add swap matrix with solver-based model swapping (#646 ) Add a new swap matrix to supersede groups for running concurrent models. The matrix uses a solver that picks the lowest cost evictions to make a requested model available. This simple approach along with a very basic DSL grammar can enable very complex swapping scenarios. - add DSL parser for set expressions with & (AND), \| (OR), (), +ref - add MatrixConfig structs, validation, and topological sort for +ref - add MatrixSolver with cost-minimizing swap decisions - add Matrix runtime integrating solver with Process lifecycle - integrate matrix into ProxyManager with if-branches at all endpoints - update config.example.yaml and config-schema.json with matrix schema - config enforces groups XOR matrix (cannot use both) fixes #643	2026-04-14 21:55:30 -07:00
Benson Wong	40e39f7a86	ui-svelte: fix security issues (#649 )	2026-04-12 16:21:31 -07:00
Benson Wong	a9d840ffd7	proxy,proxy/config: restore timeouts to pre PR 619 (#648 ) Reset the default ResponseHeader timeout to 0 (no timeout) which was set to 60 seconds in PR #619. Fixes #647 v201	2026-04-11 20:42:13 -07:00
Benson Wong	7b2b82777f	docker/unified: derive rootless image from root container (#644 ) Build the root image once, then derive the rootless variant from it using a small inline Dockerfile that adds the non-root user and chowns the writable directories. This halves the number of CI jobs (4 → 2) and eliminates the redundant full CUDA compilation for the rootless variant. - remove RUN_UID build arg from build-image.sh - derive rootless image inline after root build completes - collapse variant matrix out of unified-docker.yml - push both root and rootless tags in a single CI job Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-10 22:59:54 -07:00
Benson Wong	d87f0ce2c5	docker/unified: publish rootless image variant (#630 ) v200	2026-04-07 03:05:53 -07:00
Leoy	06bc6a614c	proxy: preserve wall-clock duration in metrics (#629 ) Keep request duration from being underreported when upstream timings only cover part of the full request lifecycle. - compare wall-clock and upstream timing durations - keep token and throughput values from timings - add regression coverage for underreported timings fixes #602	2026-04-07 01:52:41 -07:00
Ron M	a37b4866d8	proxy: add configurable HTTP timeouts for models and peers (#619 ) Add configurable HTTP timeout settings to both models and peers to support installations that requires longer timeouts than the current hardcoded defaults. Closes #618	2026-04-06 19:30:27 +08:00
Benson Wong	981910d734	ci: validate config.example.yaml against config-schema.json (#627 ) Extend the existing config-schema workflow to also validate config.example.yaml against config-schema.json using check-jsonschema. - add config.example.yaml to PR and push path triggers - install check-jsonschema via pip - run validation of config.example.yaml against schema https://claude.ai/code/session_01Y1oqwE6mwNs9UTJgZRgXtG --------- Co-authored-by: Claude <noreply@anthropic.com>	2026-04-05 15:17:57 +08:00
Benson Wong	a185efe37e	docker: make CMAKE_CUDA_ARCHITECTURES configurable via build arg (#625 ) Expose CMAKE_CUDA_ARCHITECTURES as a Docker build ARG so users can customize CUDA architectures via --build-arg without editing the Dockerfile. - convert hardcoded ENV to ARG with default, feeding into ENV - replace silent fallback defaults (:-) in scripts with :? guards to fail fast if the env var is missing - add usage example to Dockerfile header Follow up to: #624 https://claude.ai/code/session_01EWiUe7jNABX7Uz95dUGJqK Co-authored-by: Claude <noreply@anthropic.com>	2026-04-04 08:49:59 +08:00
Benson Wong	1dd1aadf93	docker/unified: add ik_llama.cpp to CUDA container (#620 )	2026-04-03 15:16:30 +08:00
Benson Wong	955900972a	add /sdapi to list of supported endpoints	2026-04-01 12:01:38 +08:00
Benson Wong	c2c8cfaf81	docker/unified: build llama.cpp with static libraries (#616 )	2026-04-01 03:38:07 +08:00
Benson Wong	1e440770ea	ci: fix matrix exclude for scheduled docker workflow (#610 )	2026-03-29 20:04:28 +09:00
Benson Wong	c794273c83	docker/unified,.github: fix unified build (#606 )	2026-03-27 10:31:12 +09:00
dependabot[bot]	6574a52cbb	build(deps): bump picomatch from 4.0.3 to 4.0.4 in /ui-svelte (#605 )	2026-03-26 22:28:24 +09:00
Benson Wong	8fabc75634	docker/unified: vulkan build fixes (#600 ) multiple fixes to vulkan build: - use ubuntu 26.04 to be compatible with AMD 395+ (Strix halo) hardware - add home directory in container - fix stable-diffusion install to actually enable vulkan --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> v199	2026-03-25 23:26:13 +09:00
Benson Wong	e5e7391b6d	.github,docker/unified: include vulkan build (#599 ) Update docker/unified scripts to support building both cuda and vulkan unified images.	2026-03-25 06:58:28 +09:00
Benson Wong	2c282dccad	.github,docker/unified: improve caching and fix bugs (#598 ) - set up a GHA scheduled job to build the container nightly - enabling pushing a llama-swap:unified and a llama-swap:unified-Y-M-D image to ghcr.io - tidy up Dockerfile to use a non-root user and llama-swap as an entry point	2026-03-23 22:24:40 +09:00
Benson Wong	916d13f5bd	.github/workflows,docker/unified: add cuda based unified container (#597 ) Add Docker build scripts for a unified cuda docker container with llama-server, stable-diffusion.cpp, whisper.cpp.	2026-03-22 21:11:54 +09:00
Benson Wong	a3725e7d09	Update go.mod to 1.26.1 (#593 )	2026-03-20 16:09:58 +09:00
Benson Wong	15bd55d3a9	proxy, ui-svelte: add /sdapi/v1 endpoint support (#587 ) Add proxy routes for stable-diffusion.cpp's /sdapi/v1/txt2img, /sdapi/v1/img2img, and /sdapi/v1/loras endpoints. POST endpoints use proxyInferenceHandler (model in JSON body), GET /loras uses proxyGETModelHandler (model in query param). Update the image playground with a dual-mode UI supporting both OpenAI and SDAPI backends. In SDAPI mode, loras are fetched first to prime the server-side cache, and all txt2img parameters are exposed (negative prompt, steps, cfg_scale, seed, batch_size, clip_skip, sampler, scheduler, lora selection with multipliers). - Add 3 sdapi route registrations in proxymanager.go - Add sdApi.ts client with generateSdImage and fetchSdLoras - Add SDAPI types (SdApiTxt2ImgRequest, SdApiResponse, etc.) - Add /sdapi to vite dev proxy config - Add backend tests for sdapi routing - Support batch image display in gallery grid https://claude.ai/code/session_0186MGX6NXdHVBTv2KH45fqn --------- Co-authored-by: Claude <noreply@anthropic.com>	2026-03-19 22:08:31 +09:00
Benson Wong	c3c258a55d	proxy: fix metrics capture for v1/responses (#586 ) properly parse anthropic compatible usage data from streaming responses. closes: #577 v198	2026-03-13 16:50:12 -07:00
Benson Wong	29a38fde0d	ui-svelte: upgrade to vite 8 (#585 ) Upgrade vite and related dependencies to take advantage of Vite 8's improved build times via Rolldown and Oxc. - vite: ^6.3.5 → ^8.0.0 - @sveltejs/vite-plugin-svelte: ^5.0.3 → ^7.0.0 - svelte: ^5.19.0 → ^5.46.4 - vite-plugin-compression2: ^2.4.0 → ^2.5.1 - vitest: ^4.0.18 → ^4.1.0 --------- Co-authored-by: Claude <noreply@anthropic.com>	2026-03-13 08:45:59 -07:00
tesuri	d569681daa	Change model sorting to natural order (#582 ) Use natural sorting for model names. Previously the model list was sorted lexicographically, which resulted in unintuitive ordering when numbers were included in the name. Example: Before qwen3.5:2B qwen3.5:35B-3AB qwen3.5:9B After qwen3.5:2B qwen3.5:9B qwen3.5:35B-3AB This change sorts models using natural order so numeric parts are compared numerically.	2026-03-12 07:49:34 -07:00
Benson Wong	24efdb76b1	config: add macro support for name and description fields (#578 ) Extend macro substitution to the name and description fields of ModelConfig, matching the behavior already present for cmd, proxy, checkEndpoint, and filters. - substitute global/model macros (including MODEL_ID) in name and description - substitute PORT macro in name and description when allocated - validate no unknown macros remain in name and description after substitution - add tests for macro substitution, MODEL_ID, and unknown macro error	2026-03-10 08:27:05 -07:00
Benson Wong	cc77139ff8	proxy,proxy/config: add global TTL feature (#554 ) Add a new configuration parameter globalTTL that all models will inherit. The default value is 0 which matches the currently functionality to never automatically unload a model. The model.ttl's default has changed to -1, which means use the global TTL value. Any model.ttl >=0 is now value with 0 meaning never unload. This allows a model to override a globalTTL > 0 and be configured to never unload. Fixes #459 Closes #512 v197	2026-03-01 21:02:12 -08:00
Benson Wong	390a35bf93	ui-svelte: add copy button to markdown code blocks (#537 ) Add a copy-to-clipboard button that appears on hover for each code block rendered in the chat interface assistant messages. - Svelte action `codeBlockCopy` injects a button into every `<pre>` element - MutationObserver reattaches buttons as streaming content arrives - Button shows a check icon for 2 seconds after a successful copy - Uses clipboard API with execCommand fallback for non-secure contexts - CSS hides button by default and reveals it on pre:hover https://claude.ai/code/session_01PTA5ao5YQuFAS6a9juLeZW --------- Co-authored-by: Claude <noreply@anthropic.com>	2026-03-01 09:48:56 -08:00
pdscomp	181f71ca11	.github,docker: add cuda13 architecture support (#551 ) Add `cuda13` as a supported build architecture, targeting the `ghcr.io/ggml-org/llama.cpp:server-cuda13` upstream base image. The `server-cuda13` image ships with CUDA 13 libraries, providing improved performance on recent NVIDIA hardware compared to the existing `server-cuda` (CUDA 12) image. Users with newer GPUs (e.g., RTX 50-series) benefit from reduced model load latency and higher token throughput. - Add `cuda13` to the allowed architectures list in `docker/build-container.sh` - Add `cuda13` to the CI matrix in `.github/workflows/containers.yml` so the container is built and pushed automatically	2026-03-01 09:37:08 -08:00
Benson Wong	49546e2cf2	ui: fix text size svg v196	2026-02-27 23:47:52 -08:00
Benson Wong	2c078964f4	Update README with additional images Added new images for model loading and real-time log streaming sections.	2026-02-27 23:45:40 -08:00
Benson Wong	175bb36fb1	Revise README description for clarity and detail Updated description to clarify compatibility and usage.	2026-02-27 23:42:40 -08:00
Benson Wong	aedb640471	Enhance web UI section in README Updated README to enhance the description of the web interface and added details about features like token metrics, request inspection, model management, and real-time log streaming.	2026-02-27 23:40:31 -08:00
Benson Wong	2f377f6dc6	ui: add OGG audio format support to transcription playground (#544 )	2026-02-26 19:48:19 -08:00
Benson Wong	64e4c79fc3	ui: add Rerank tab to playground (#536 ) Add a new Rerank tab to the playground that lets users test /v1/rerank endpoints. Supports a visual table editor and a JSON editor mode that stay in sync when toggling between them. - add rerankApi.ts with typed wrapper for /v1/rerank - add RerankInterface.svelte with query input, sortable document table, color-coded scores, auto-add row, cancel/clear, and token usage - add rerankLoading store to playgroundActivity derived store - register Rerank tab in Playground.svelte Updates #481 v195	2026-02-21 21:59:14 -08:00
Benson Wong	19fb5f35e9	proxy: implement setParamsByID filter (#535 ) Add setParamsByID filter that applies different request parameters based on the requested model ID, enabling per-alias behaviour for a single loaded model. - add SetParamsByID field to Filters struct and SanitizedSetParamsByID method - substitute ${MODEL_ID} and other macros in setParamsByID keys and values - validate no unknown macros remain in keys or values after substitution - apply setParamsByID in proxyInferenceHandler after setParams (can override it) - update config-schema.json with setParamsByID definition - update UI to show aliases and make them selectable in the Playground closes #534 v194	2026-02-19 22:21:10 -08:00
Benson Wong	b45102bde8	ui: smart auto-scroll in LogPanel (#530 ) Pause auto-scroll when the user scrolls up to review logs, and resume when they scroll back to the bottom. - add `userScrolledUp` state variable - add `handleScroll` to detect scroll position with 40px threshold - guard the auto-scroll effect with `!userScrolledUp` closes #529 v193	2026-02-18 19:47:37 -08:00
Brian Mendonca	1688bdd1e9	proxy, ui: add pending requests count to the main dashboard (#516 ) add a real time counter of pending (inflight) requests to the UI. v192	2026-02-16 09:41:15 -08:00
Benson Wong	d33d51fa75	.coderabbit.yaml,AGENTS.md: small tweaks v191	2026-02-15 21:31:30 -08:00
Benson Wong	e3bf065574	ui: persist playground state across route navigation (#525 ) - Keep Playground component mounted when navigating away, preserving streaming/generating state - Add animated gradient effect on Playground nav link when activity is in progress	2026-02-15 21:30:52 -08:00
Benson Wong	3e52144058	ui-svelte: incremental rendering of chat messages in the Playground (#520 ) add incremental rendering to Playground > Chat	2026-02-15 11:00:44 -08:00
Benson Wong	d5e52d7d00	build: disable provenance attestations in container builds (#523 ) ## Summary - Add `--provenance=false` to docker build commands in `build-container.sh` - BuildKit attestation manifests are stored as untagged images in GHCR, and the `delete-untagged-containers` cleanup job deletes them, breaking the manifest list and causing `manifest unknown` errors on pull - ref: https://github.com/actions/delete-package-versions/issues/162	2026-02-14 10:23:08 -08:00
Benson Wong	17e5263a76	.github/workflows: fix expired token in publishing images (#522 ) Fixes: #517	2026-02-14 10:06:05 -08:00
Benson Wong	8d6d949ec3	proxy: support timings for /infill from llama-server (#510 ) fixes: #463	2026-02-07 17:16:27 -08:00
Benson Wong	b5fde8eb6d	proxy,ui-svelte: add request/response capturing (#508 ) Add saving request and response headers and bodies that go through llama-swap in memory. - captureBuffer added to configuration. Captures are enabled by default. - 5MB of memory is allocated for req/response captures in a ring buffer. Setting captureBuffer to 0 will disable captures. - UI elements to view captured data added to Activity page. Includes some QOL features like json formatting and recombining SSE chat streams - capture saving is done at the byte level and has minimal impact on llama-swap performance Fixes #464 Ref #503	2026-02-07 15:40:01 -08:00
Nuno	7eef5defb8	docs: add stable-diffusion.cpp references (#506 ) Signed-off-by: rare-magma <rare-magma@posteo.eu>	2026-02-04 20:20:39 -08:00
Benson Wong	bc01e6f539	build: add stable-diffusion server to musa and vulkan container images (#504 ) Add sd-server from stable-diffusion.cpp docker image for vulkan and musa containers. closes #450 v189	2026-02-01 16:17:26 -08:00

1 2 3 4 5 ...

428 Commits