Benson Wong 4384315b44 ui-svelte: add Svelte port of React UI (#487)
Trying out svelte for the UI. The port was done by Claude Code on the iOS app w/ Opus 4.5. 

---

* ui: add Svelte port of React UI

Port the React-based UI to Svelte 5 with the following changes:

- Create new ui-svelte directory with complete Svelte 5 implementation
- Use Svelte stores instead of React contexts for state management
- Implement custom ResizablePanels component to replace react-resizable-panels
- Port all pages: LogViewer, Models, Activity
- Port all components: Header, ConnectionStatus, LogPanel, ModelsPanel, etc.
- Use svelte-spa-router for client-side routing
- Same build output directory (proxy/ui_dist) and base path (/ui/)
- Tailwind CSS 4 with same theme configuration

https://claude.ai/code/session_01F3xXLYsd62gePVSFv7aboP

* ui-svelte: simplify state management

- Remove redundant state syncing pattern in LogPanel and ModelsPanel
- Use store values directly with $ syntax instead of manual subscriptions
- Consolidate duplicate title sync logic in App.svelte
- Use existing syncTitleToDocument() from theme.ts

https://claude.ai/code/session_01F3xXLYsd62gePVSFv7aboP

* ui-svelte: use idiomatic Svelte 5 patterns

- Use $effect for document side effects (theme, title) instead of
  store subscriptions
- Use class: directive for active nav links in Header
- Remove SSR guards (unnecessary for client-only SPA)
- Remove leaked subscription in syncThemeToDocument
- Simplify theme.ts by removing sync functions

https://claude.ai/code/session_01F3xXLYsd62gePVSFv7aboP

* ui-svelte: fix build warnings and improve accessibility

Fix Svelte build warnings and add proper accessibility support
to interactive components.

- add aria-labels to buttons for screen readers
- implement keyboard navigation for resizable separator
- suppress intentional state initialization warnings
- update Makefile to use ui-svelte build directory
- add peer:true to package-lock.json dependencies

* ui-svelte: reorganize navigation and add log view toggle

Make Models the default landing page and add view mode toggle
to the Logs page with persistent state.

- set Models as default route at /
- move Logs to /logs route
- reorder navigation: Models, Activity, Logs
- add view toggle with three modes: Panels, Proxy only, Upstream only
- fix horizontal overflow with width constraints
2026-01-28 21:37:29 -08:00
2025-11-08 14:16:12 -08:00
2025-12-27 20:18:06 -08:00
2025-07-15 18:04:30 -07:00
2024-10-03 20:20:01 -07:00
2025-05-23 09:39:55 -07:00
2026-01-22 08:43:36 -08:00
.
2024-10-04 09:31:08 -07:00
2026-01-20 09:34:42 -08:00

llama-swap header image GitHub Downloads (all assets, all releases) GitHub Actions Workflow Status GitHub Repo stars

llama-swap

Run multiple LLM models on your machine and hot-swap between them as needed. llama-swap works with any OpenAI API-compatible server, giving you the flexibility to switch models without restarting your applications.

Built in Go for performance and simplicity, llama-swap has zero dependencies and is incredibly easy to set up. Get started in minutes - just one binary and one configuration file.

Features:

  • Easy to deploy and configure: one binary, one configuration file. no external dependencies
  • On-demand model switching
  • Use any local OpenAI compatible server (llama.cpp, vllm, tabbyAPI, etc.)
    • future proof, upgrade your inference servers at any time.
  • OpenAI API supported endpoints:
    • v1/completions
    • v1/chat/completions
    • v1/responses
    • v1/embeddings
    • v1/audio/speech (#36)
    • v1/audio/transcriptions (docs)
    • v1/audio/voices
    • v1/images/generations
    • v1/images/edits
  • Anthropic API supported endpoints:
    • v1/messages
    • v1/messages/count_tokens
  • llama-server (llama.cpp) supported endpoints
    • v1/rerank, v1/reranking, /rerank
    • /infill - for code infilling
    • /completion - for completion endpoint
  • llama-swap API
    • /ui - web UI
    • /upstream/:model_id - direct access to upstream server (demo)
    • /models/unload - manually unload running models (#58)
    • /running - list currently running models (#61)
    • /log - remote log monitoring
    • /health - just returns "OK"
  • API Key support - define keys to restrict access to API endpoints
  • Customizable
    • Run multiple models at once with Groups (#107)
    • Automatic unloading of models after timeout by setting a ttl
    • Reliable Docker and Podman support using cmd and cmdStop together
    • Preload models on startup with hooks (#235)

Web UI

llama-swap includes a real time web interface for monitoring logs and controlling models:

image

The Activity Page shows recent requests:

image

Installation

llama-swap can be installed in multiple ways

  1. Docker
  2. Homebrew (OSX and Linux)
  3. WinGet
  4. From release binaries
  5. From source

Docker Install (download images)

Nightly container images with llama-swap and llama-server are built for multiple platforms (cuda, vulkan, intel, etc.) including non-root variants with improved security.

$ docker pull ghcr.io/mostlygeek/llama-swap:cuda

# run with a custom configuration and models directory
$ docker run -it --rm --runtime nvidia -p 9292:8080 \
 -v /path/to/models:/models \
 -v /path/to/custom/config.yaml:/app/config.yaml \
 ghcr.io/mostlygeek/llama-swap:cuda

# configuration hot reload supported with a
# directory volume mount
$ docker run -it --rm --runtime nvidia -p 9292:8080 \
 -v /path/to/models:/models \
 -v /path/to/custom/config.yaml:/app/config.yaml \
 -v /path/to/config:/config \
 ghcr.io/mostlygeek/llama-swap:cuda -config /config/config.yaml -watch-config
more examples
# pull latest images per platform
docker pull ghcr.io/mostlygeek/llama-swap:cpu
docker pull ghcr.io/mostlygeek/llama-swap:cuda
docker pull ghcr.io/mostlygeek/llama-swap:vulkan
docker pull ghcr.io/mostlygeek/llama-swap:intel
docker pull ghcr.io/mostlygeek/llama-swap:musa

# tagged llama-swap, platform and llama-server version images
docker pull ghcr.io/mostlygeek/llama-swap:v166-cuda-b6795

# non-root cuda
docker pull ghcr.io/mostlygeek/llama-swap:cuda-non-root

Homebrew Install (macOS/Linux)

brew tap mostlygeek/llama-swap
brew install llama-swap
llama-swap --config path/to/config.yaml --listen localhost:8080

WinGet Install (Windows)

Note

WinGet is maintained by community contributor Dvd-Znf (#327). It is not an official part of llama-swap.

# install
C:\> winget install llama-swap

# upgrade
C:\> winget upgrade llama-swap

Pre-built Binaries

Binaries are available on the release page for Linux, Mac, Windows and FreeBSD.

Building from source

  1. Building requires Go and Node.js (for UI).
  2. git clone https://github.com/mostlygeek/llama-swap.git
  3. make clean all
  4. look in the build/ subdirectory for the llama-swap binary

Configuration

# minimum viable config.yaml

models:
  model1:
    cmd: llama-server --port ${PORT} --model /path/to/model.gguf

That's all you need to get started:

  1. models - holds all model configurations
  2. model1 - the ID used in API calls
  3. cmd - the command to run to start the server.
  4. ${PORT} - an automatically assigned port number

Almost all configuration settings are optional and can be added one step at a time:

  • Advanced features
    • groups to run multiple models at once
    • hooks to run things on startup
    • macros reusable snippets
  • Model customization
    • ttl to automatically unload models
    • aliases to use familiar model names (e.g., "gpt-4o-mini")
    • env to pass custom environment variables to inference servers
    • cmdStop gracefully stop Docker/Podman containers
    • useModelName to override model names sent to upstream servers
    • ${PORT} automatic port variables for dynamic port assignment
    • filters rewrite parts of requests before sending to the upstream server

See the configuration documentation for all options.

How does llama-swap work?

When a request is made to an OpenAI compatible endpoint, llama-swap will extract the model value and load the appropriate server configuration to serve it. If the wrong upstream server is running, it will be replaced with the correct one. This is where the "swap" part comes in. The upstream server is automatically swapped to handle the request correctly.

In the most basic configuration llama-swap handles one model at a time. For more advanced use cases, the groups feature allows multiple models to be loaded at the same time. You have complete control over how your system resources are used.

Reverse Proxy Configuration (nginx)

If you deploy llama-swap behind nginx, disable response buffering for streaming endpoints. By default, nginx buffers responses which breaks ServerSent Events (SSE) and streaming chat completion. (#236)

Recommended nginx configuration snippets:

# SSE for UI events/logs
location /api/events {
    proxy_pass http://your-llama-swap-backend;
    proxy_buffering off;
    proxy_cache off;
}

# Streaming chat completions (stream=true)
location /v1/chat/completions {
    proxy_pass http://your-llama-swap-backend;
    proxy_buffering off;
    proxy_cache off;
}

As a safeguard, llama-swap also sets X-Accel-Buffering: no on SSE responses. However, explicitly disabling proxy_buffering at your reverse proxy is still recommended for reliable streaming behavior.

Monitoring Logs on the CLI

# sends up to the last 10KB of logs
$ curl http://host/logs

# streams combined logs
curl -Ns http://host/logs/stream

# stream llama-swap's proxy status logs
curl -Ns http://host/logs/stream/proxy

# stream logs from upstream processes that llama-swap loads
curl -Ns http://host/logs/stream/upstream

# stream logs only from a specific model
curl -Ns http://host/logs/stream/{model_id}

# stream and filter logs with linux pipes
curl -Ns http://host/logs/stream | grep 'eval time'

# appending ?no-history will disable sending buffered history first
curl -Ns 'http://host/logs/stream?no-history'

Do I need to use llama.cpp's server (llama-server)?

Any OpenAI compatible server would work. llama-swap was originally designed for llama-server and it is the best supported.

For Python based inference servers like vllm or tabbyAPI it is recommended to run them via podman or docker. This provides clean environment isolation as well as responding correctly to SIGTERM signals for proper shutdown.

Star History

Note

Star this project to help others discover it!

Star History Chart

S
Description
Reliable model swapping for any local OpenAI/Anthropic compatible server - llama.cpp, vllm, etc
Readme 78 MiB
Languages
Go 68.4%
Svelte 19.5%
TypeScript 5.9%
Shell 4.2%
Dockerfile 1.1%
Other 0.9%