Install uv after the cpp tool binaries are copied and before the
llama-swap binary, enabling `uv run` usage for Python-based inference
backends like vLLM.
- add python3-pip to runtime apt installs
- add `pip install uv --break-system-packages` after cpp installs
fixes#628
Co-authored-by: Claude <noreply@anthropic.com>
- matrix.go change logic to consider any proxy.Process not in
StateStopped or StateShutdown
- process.StopImmediately, and Stop() which called it had a subtle bug
where it only handled state transitions from StateReady to
StateStopping. StateStarting -> StateStopping was ignored completely.
fix: #670
The previous captures were saved uncompressed in memory. In agentic
workflows there can be many turns with each request containing the
previous context in the body with a lot of redundant data. Use zstd to
compress the request and response data before keeping a copy of memory.
Results:
- Average Percentage Saved: 73.19%
- Average Compression Factor: ~6.77:1
I pointed Opus 4.7 (high effort) at proxy.ProcessGroup to identify any
race conditions in the swapping code. It found a race condition where
there is a small window in the fast path for routing a request to a
loaded model. There is a very small window where:
- model M1 is loaded and ready for requests
- a request, R1, for M1 comes in
- a request, R2, for M2 comes in almost immediately after
- R1 acquires the lock, sees M1 is loaded (fast path), releases the lock
`[race window]` and the request is ready to be forwarded
- the race window occurs between the release of the lock and the request
being forwarded
- the lock is released so requests can be handled concurrently
- R2 comes in within the `[race window]`, acquires the lock, triggers a
model swap to M2. stopping M1
- R1 is forwarded to a model that is unloaded or in the process of
shutting down creating an error response
In deployed systems the race window is very small and doesn't happen
often. However with #635 and PR #656 I though this deserved a bit more
attention. It is not concluded that this race is the cause of #635 but
the race is likely to happen more often under sustained or high load.
AI Note: Opus 4.7 x-high effort took about an hour to write the original
patch. With the pattern discovered the fix to matrix.go was very quick.
GLM 5.1 using the previous established patterns was able to easily write
the fix for ProcessGroup.StopProcesses().
Supersedes: #656
Updates: #277, #635
Add a new swap matrix to supersede groups for running concurrent models.
The matrix uses a solver that picks the lowest cost evictions to make a
requested model available. This simple approach along with a very basic
DSL grammar can enable very complex swapping scenarios.
- add DSL parser for set expressions with & (AND), | (OR), (), +ref
- add MatrixConfig structs, validation, and topological sort for +ref
- add MatrixSolver with cost-minimizing swap decisions
- add Matrix runtime integrating solver with Process lifecycle
- integrate matrix into ProxyManager with if-branches at all endpoints
- update config.example.yaml and config-schema.json with matrix schema
- config enforces groups XOR matrix (cannot use both)
fixes#643
Build the root image once, then derive the rootless variant from it
using a small inline Dockerfile that adds the non-root user and chowns
the writable directories. This halves the number of CI jobs (4 → 2) and
eliminates the redundant full CUDA compilation for the rootless variant.
- remove RUN_UID build arg from build-image.sh
- derive rootless image inline after root build completes
- collapse variant matrix out of unified-docker.yml
- push both root and rootless tags in a single CI job
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Keep request duration from being underreported when upstream timings
only cover part of the full request lifecycle.
- compare wall-clock and upstream timing durations
- keep token and throughput values from timings
- add regression coverage for underreported timings
fixes#602
Add configurable HTTP timeout settings to both models and peers to support installations that requires longer timeouts than the current hardcoded defaults.
Closes#618
Extend the existing config-schema workflow to also validate
config.example.yaml against config-schema.json using check-jsonschema.
- add config.example.yaml to PR and push path triggers
- install check-jsonschema via pip
- run validation of config.example.yaml against schema
https://claude.ai/code/session_01Y1oqwE6mwNs9UTJgZRgXtG
---------
Co-authored-by: Claude <noreply@anthropic.com>
Expose CMAKE_CUDA_ARCHITECTURES as a Docker build ARG so users can
customize CUDA architectures via --build-arg without editing the
Dockerfile.
- convert hardcoded ENV to ARG with default, feeding into ENV
- replace silent fallback defaults (:-) in scripts with :? guards
to fail fast if the env var is missing
- add usage example to Dockerfile header
Follow up to: #624https://claude.ai/code/session_01EWiUe7jNABX7Uz95dUGJqK
Co-authored-by: Claude <noreply@anthropic.com>
multiple fixes to vulkan build:
- use ubuntu 26.04 to be compatible with AMD 395+ (Strix halo) hardware
- add home directory in container
- fix stable-diffusion install to actually enable vulkan
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
- set up a GHA scheduled job to build the container nightly
- enabling pushing a llama-swap:unified and a llama-swap:unified-Y-M-D
image to ghcr.io
- tidy up Dockerfile to use a non-root user and llama-swap as an entry
point
Add proxy routes for stable-diffusion.cpp's /sdapi/v1/txt2img,
/sdapi/v1/img2img, and /sdapi/v1/loras endpoints. POST endpoints
use proxyInferenceHandler (model in JSON body), GET /loras uses
proxyGETModelHandler (model in query param).
Update the image playground with a dual-mode UI supporting both
OpenAI and SDAPI backends. In SDAPI mode, loras are fetched first
to prime the server-side cache, and all txt2img parameters are
exposed (negative prompt, steps, cfg_scale, seed, batch_size,
clip_skip, sampler, scheduler, lora selection with multipliers).
- Add 3 sdapi route registrations in proxymanager.go
- Add sdApi.ts client with generateSdImage and fetchSdLoras
- Add SDAPI types (SdApiTxt2ImgRequest, SdApiResponse, etc.)
- Add /sdapi to vite dev proxy config
- Add backend tests for sdapi routing
- Support batch image display in gallery grid
https://claude.ai/code/session_0186MGX6NXdHVBTv2KH45fqn
---------
Co-authored-by: Claude <noreply@anthropic.com>
Upgrade vite and related dependencies to take advantage of Vite 8's
improved build times via Rolldown and Oxc.
- vite: ^6.3.5 → ^8.0.0
- @sveltejs/vite-plugin-svelte: ^5.0.3 → ^7.0.0
- svelte: ^5.19.0 → ^5.46.4
- vite-plugin-compression2: ^2.4.0 → ^2.5.1
- vitest: ^4.0.18 → ^4.1.0
---------
Co-authored-by: Claude <noreply@anthropic.com>
Use natural sorting for model names.
Previously the model list was sorted lexicographically, which resulted
in unintuitive ordering when numbers were included in the name.
Example:
Before
qwen3.5:2B
qwen3.5:35B-3AB
qwen3.5:9B
After
qwen3.5:2B
qwen3.5:9B
qwen3.5:35B-3AB
This change sorts models using natural order so numeric parts are
compared numerically.
Extend macro substitution to the name and description fields of
ModelConfig, matching the behavior already present for cmd, proxy,
checkEndpoint, and filters.
- substitute global/model macros (including MODEL_ID) in name and
description
- substitute PORT macro in name and description when allocated
- validate no unknown macros remain in name and description after
substitution
- add tests for macro substitution, MODEL_ID, and unknown macro error
Add a new configuration parameter globalTTL that all models will
inherit. The default value is 0 which matches the currently
functionality to never automatically unload a model.
The model.ttl's default has changed to -1, which means use the global
TTL value. Any model.ttl >=0 is now value with 0 meaning never unload.
This allows a model to override a globalTTL > 0 and be configured to
never unload.
Fixes#459Closes#512
Add a copy-to-clipboard button that appears on hover for each code block
rendered in the chat interface assistant messages.
- Svelte action `codeBlockCopy` injects a button into every `<pre>`
element
- MutationObserver reattaches buttons as streaming content arrives
- Button shows a check icon for 2 seconds after a successful copy
- Uses clipboard API with execCommand fallback for non-secure contexts
- CSS hides button by default and reveals it on pre:hover
https://claude.ai/code/session_01PTA5ao5YQuFAS6a9juLeZW
---------
Co-authored-by: Claude <noreply@anthropic.com>
Add `cuda13` as a supported build architecture, targeting the
`ghcr.io/ggml-org/llama.cpp:server-cuda13` upstream base image.
The `server-cuda13` image ships with CUDA 13 libraries, providing
improved performance on recent NVIDIA hardware compared to the existing
`server-cuda` (CUDA 12) image. Users with newer GPUs (e.g., RTX
50-series) benefit from reduced model load latency and higher token
throughput.
- Add `cuda13` to the allowed architectures list in
`docker/build-container.sh`
- Add `cuda13` to the CI matrix in `.github/workflows/containers.yml` so
the container is built and pushed automatically
Updated README to enhance the description of the web interface and added details about features like token metrics, request inspection, model management, and real-time log streaming.
Add a new Rerank tab to the playground that lets users test /v1/rerank
endpoints. Supports a visual table editor and a JSON editor mode that
stay in sync when toggling between them.
- add rerankApi.ts with typed wrapper for /v1/rerank
- add RerankInterface.svelte with query input, sortable document table,
color-coded scores, auto-add row, cancel/clear, and token usage
- add rerankLoading store to playgroundActivity derived store
- register Rerank tab in Playground.svelte
Updates #481
Add setParamsByID filter that applies different request parameters based
on the requested model ID, enabling per-alias behaviour for a single
loaded model.
- add SetParamsByID field to Filters struct and SanitizedSetParamsByID
method
- substitute ${MODEL_ID} and other macros in setParamsByID keys and
values
- validate no unknown macros remain in keys or values after substitution
- apply setParamsByID in proxyInferenceHandler after setParams (can
override it)
- update config-schema.json with setParamsByID definition
- update UI to show aliases and make them selectable in the Playground
closes#534
Pause auto-scroll when the user scrolls up to review logs, and resume
when they scroll back to the bottom.
- add `userScrolledUp` state variable
- add `handleScroll` to detect scroll position with 40px threshold
- guard the auto-scroll effect with `!userScrolledUp`
closes#529
- Keep Playground component mounted when navigating away, preserving
streaming/generating state
- Add animated gradient effect on Playground nav link when activity is
in progress