Compare commits

..

8 Commits

Author SHA1 Message Date
Benson Wong cc77139ff8 proxy,proxy/config: add global TTL feature (#554)
Add a new configuration parameter globalTTL that all models will
inherit. The default value is 0 which matches the currently
functionality to never automatically unload a model.

The model.ttl's default has changed to -1, which means use the global
TTL value. Any model.ttl >=0 is now value with 0 meaning never unload.
This allows a model to override a globalTTL > 0 and be configured to
never unload.

Fixes #459
Closes #512
2026-03-01 21:02:12 -08:00
Benson Wong 390a35bf93 ui-svelte: add copy button to markdown code blocks (#537)
Add a copy-to-clipboard button that appears on hover for each code block
rendered in the chat interface assistant messages.

- Svelte action `codeBlockCopy` injects a button into every `<pre>`
element
- MutationObserver reattaches buttons as streaming content arrives
- Button shows a check icon for 2 seconds after a successful copy
- Uses clipboard API with execCommand fallback for non-secure contexts
- CSS hides button by default and reveals it on pre:hover

https://claude.ai/code/session_01PTA5ao5YQuFAS6a9juLeZW

---------

Co-authored-by: Claude <noreply@anthropic.com>
2026-03-01 09:48:56 -08:00
pdscomp 181f71ca11 .github,docker: add cuda13 architecture support (#551)
Add `cuda13` as a supported build architecture, targeting the
`ghcr.io/ggml-org/llama.cpp:server-cuda13` upstream base image.

The `server-cuda13` image ships with CUDA 13 libraries, providing
improved performance on recent NVIDIA hardware compared to the existing
`server-cuda` (CUDA 12) image. Users with newer GPUs (e.g., RTX
50-series) benefit from reduced model load latency and higher token
throughput.

- Add `cuda13` to the allowed architectures list in
`docker/build-container.sh`
- Add `cuda13` to the CI matrix in `.github/workflows/containers.yml` so
the container is built and pushed automatically
2026-03-01 09:37:08 -08:00
Benson Wong 49546e2cf2 ui: fix text size svg 2026-02-27 23:47:52 -08:00
Benson Wong 2c078964f4 Update README with additional images
Added new images for model loading and real-time log streaming sections.
2026-02-27 23:45:40 -08:00
Benson Wong 175bb36fb1 Revise README description for clarity and detail
Updated description to clarify compatibility and usage.
2026-02-27 23:42:40 -08:00
Benson Wong aedb640471 Enhance web UI section in README
Updated README to enhance the description of the web interface and added details about features like token metrics, request inspection, model management, and real-time log streaming.
2026-02-27 23:40:31 -08:00
Benson Wong 2f377f6dc6 ui: add OGG audio format support to transcription playground (#544) 2026-02-26 19:48:19 -08:00
14 changed files with 211 additions and 38 deletions
+1 -1
View File
@@ -29,7 +29,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
platform: [intel, cuda, vulkan, cpu, musa, rocm]
platform: [intel, cuda, cuda13, vulkan, cpu, musa, rocm]
fail-fast: false
steps:
- name: Checkout code
+19 -5
View File
@@ -5,7 +5,7 @@
# llama-swap
Run multiple LLM models on your machine and hot-swap between them as needed. llama-swap works with any OpenAI API-compatible server, giving you the flexibility to switch models without restarting your applications.
Run multiple generative AI models on your machine and hot-swap between them on demand. llama-swap works with any OpenAI and Anthropic API compatible server and is used by thousands of people to power their local AI workflows.
Built in Go for performance and simplicity, llama-swap has zero dependencies and is incredibly easy to set up. Get started in minutes - just one binary and one configuration file.
@@ -48,13 +48,27 @@ Built in Go for performance and simplicity, llama-swap has zero dependencies and
### Web UI
llama-swap includes a real time web interface for monitoring logs and controlling models:
llama-swap includes a real time web interface with a playground for testing out all sorts of local models:
<img width="1164" height="745" alt="image" src="https://github.com/user-attachments/assets/bacf3f9d-819f-430b-9ed2-1bfaa8d54579" />
<img width="1125" height="876" alt="image" src="https://github.com/user-attachments/assets/8ee41947-97af-463d-b0f0-8e9c478fac07" />
The Activity Page shows recent requests:
View detailed token metrics:
<img width="1111" height="515" alt="image" src="https://github.com/user-attachments/assets/64bfb280-d7a3-4126-971a-a128fd40410c" />
Inspect request and responses:
<img width="1111" height="720" alt="image" src="https://github.com/user-attachments/assets/24fe4aca-1448-4d7c-b9e8-a967589bda6c" />
Manually load and unload models:
<img width="1109" height="719" alt="image" src="https://github.com/user-attachments/assets/02b1e1f2-abd0-4050-84ae-facd66ff01c4" />
Real time log streaming:
<img width="1107" height="559" alt="image" src="https://github.com/user-attachments/assets/39669a10-cff2-409e-836a-5bad8bd0140c" />
<img width="1360" height="963" alt="image" src="https://github.com/user-attachments/assets/5f3edee6-d03a-4ae5-ae06-b20ac1f135bd" />
## Installation
+10 -4
View File
@@ -48,6 +48,12 @@
"default": 120,
"description": "Number of seconds to wait for a model to be ready to serve requests."
},
"globalTTL": {
"type": "integer",
"minimum": 0,
"default": 0,
"description": "Default TTL for all models in seconds, 0 means no TTL and models will never be automatically unloaded"
},
"logLevel": {
"type": "string",
"enum": [
@@ -177,9 +183,9 @@
},
"ttl": {
"type": "integer",
"minimum": 0,
"default": 0,
"description": "Automatically unload the model after ttl seconds. 0 disables unloading. Must be >0 to enable."
"minimum": -1,
"default": -1,
"description": "Automatically unload the model after ttl seconds. -1 uses the global TTL value, 0 disables unloading. Must be >0 to enable."
},
"useModelName": {
"type": "string",
@@ -368,4 +374,4 @@
"description": "A dictionary of remote peers and models they provide. Peers can be another llama-swap or any server that provides the /v1/ generative API endpoints supported by llama-swap."
}
}
}
}
+9 -2
View File
@@ -75,6 +75,11 @@ sendLoadingState: true
# all fields except for Id so chat UIs can use the alias equivalent to the original.
includeAliasesInList: false
# globalTTL: the default TTL in seconds before unloading a model
# - optional, default: 0 (never automatically unload)
# - must be >= 0
globalTTL: 0
# macros: a dictionary of string substitutions
# - optional, default: empty dictionary
# - macros are reusable snippets
@@ -180,8 +185,10 @@ models:
checkEndpoint: /custom-endpoint
# ttl: automatically unload the model after ttl seconds
# - optional, default: 0
# - ttl values must be a value greater than 0
# - optional, default: -1 (use global default)
# - ttl values must be a value greater than or equal to 0
# - a ttl of -1 will use the global TTL value as the default
# - a ttl of 0 will mean never unload
# - a value of 0 disables automatic unloading of the model
ttl: 60
+1 -1
View File
@@ -27,7 +27,7 @@ ARCH=$1
PUSH_IMAGES=${2:-false}
# List of allowed architectures
ALLOWED_ARCHS=("intel" "vulkan" "musa" "cuda" "cpu" "rocm")
ALLOWED_ARCHS=("intel" "vulkan" "musa" "cuda" "cuda13" "cpu" "rocm")
# Check if ARCH is in the allowed list
if [[ ! " ${ALLOWED_ARCHS[@]} " =~ " ${ARCH} " ]]; then
+15
View File
@@ -124,6 +124,7 @@ type Config struct {
LogToStdout string `yaml:"logToStdout"`
MetricsMaxInMemory int `yaml:"metricsMaxInMemory"`
CaptureBuffer int `yaml:"captureBuffer"`
GlobalTTL int `yaml:"globalTTL"`
Models map[string]ModelConfig `yaml:"models"` /* key is model ID */
Profiles map[string][]string `yaml:"profiles"`
Groups map[string]GroupConfig `yaml:"groups"` /* key is group ID */
@@ -203,6 +204,7 @@ func LoadConfigFromReader(r io.Reader) (Config, error) {
LogToStdout: LogToStdoutProxy,
MetricsMaxInMemory: 1000,
CaptureBuffer: 5,
GlobalTTL: 0,
}
if err = yaml.Unmarshal([]byte(yamlStr), &config); err != nil {
return Config{}, err
@@ -216,6 +218,10 @@ func LoadConfigFromReader(r io.Reader) (Config, error) {
return Config{}, fmt.Errorf("startPort must be greater than 1")
}
if config.GlobalTTL < 0 {
return Config{}, fmt.Errorf("globalTTL must be >= 0")
}
switch config.LogToStdout {
case LogToStdoutProxy, LogToStdoutUpstream, LogToStdoutBoth, LogToStdoutNone:
default:
@@ -255,6 +261,15 @@ func LoadConfigFromReader(r io.Reader) (Config, error) {
modelConfig.Cmd = StripComments(modelConfig.Cmd)
modelConfig.CmdStop = StripComments(modelConfig.CmdStop)
// set model TTL to globalTTL it is the default value
if modelConfig.UnloadAfter == MODEL_CONFIG_DEFAULT_TTL {
modelConfig.UnloadAfter = config.GlobalTTL
}
if modelConfig.UnloadAfter < 0 {
return Config{}, fmt.Errorf("model %s: invalid TTL value %d", modelId, modelConfig.UnloadAfter)
}
// Validate model macros
for _, macro := range modelConfig.Macros {
if err = validateMacro(macro.Name, macro.Value); err != nil {
+65
View File
@@ -848,6 +848,71 @@ func TestConfig_APIKeys_EnvMacros(t *testing.T) {
})
}
func TestConfig_GlobalTTL(t *testing.T) {
t.Run("globalTTL sets default for models", func(t *testing.T) {
content := `
globalTTL: 300
models:
model1:
cmd: server --port ${PORT}
`
config, err := LoadConfigFromReader(strings.NewReader(content))
assert.NoError(t, err)
assert.Equal(t, 300, config.GlobalTTL)
assert.Equal(t, 300, config.Models["model1"].UnloadAfter)
})
t.Run("model ttl=0 overrides globalTTL", func(t *testing.T) {
content := `
globalTTL: 300
models:
model1:
cmd: server --port ${PORT}
ttl: 0
`
config, err := LoadConfigFromReader(strings.NewReader(content))
assert.NoError(t, err)
assert.Equal(t, 0, config.Models["model1"].UnloadAfter)
})
t.Run("model explicit ttl overrides globalTTL", func(t *testing.T) {
content := `
globalTTL: 300
models:
model1:
cmd: server --port ${PORT}
ttl: 600
`
config, err := LoadConfigFromReader(strings.NewReader(content))
assert.NoError(t, err)
assert.Equal(t, 600, config.Models["model1"].UnloadAfter)
})
t.Run("globalTTL defaults to 0", func(t *testing.T) {
content := `
models:
model1:
cmd: server --port ${PORT}
`
config, err := LoadConfigFromReader(strings.NewReader(content))
assert.NoError(t, err)
assert.Equal(t, 0, config.GlobalTTL)
assert.Equal(t, 0, config.Models["model1"].UnloadAfter)
})
t.Run("negative globalTTL rejected", func(t *testing.T) {
content := `
globalTTL: -1
models:
model1:
cmd: server --port ${PORT}
`
_, err := LoadConfigFromReader(strings.NewReader(content))
assert.Error(t, err)
assert.Contains(t, err.Error(), "globalTTL must be >= 0")
})
}
func TestConfig_EnvMacros(t *testing.T) {
t.Run("basic env substitution in cmd", func(t *testing.T) {
t.Setenv("TEST_MODEL_PATH", "/opt/models")
+5 -1
View File
@@ -5,6 +5,10 @@ import (
"runtime"
)
const (
MODEL_CONFIG_DEFAULT_TTL = -1
)
type ModelConfig struct {
Cmd string `yaml:"cmd"`
CmdStop string `yaml:"cmdStop"`
@@ -47,7 +51,7 @@ func (m *ModelConfig) UnmarshalYAML(unmarshal func(interface{}) error) error {
Aliases: []string{},
Env: []string{},
CheckEndpoint: "/health",
UnloadAfter: 0,
UnloadAfter: MODEL_CONFIG_DEFAULT_TTL, // use GlobalTTL
Unlisted: false,
UseModelName: "",
ConcurrencyLimit: 0,
+10 -10
View File
@@ -117,12 +117,12 @@ func TestProcess_UnloadAfterTTL(t *testing.T) {
}
expectedMessage := "I_sense_imminent_danger"
config := getTestSimpleResponderConfig(expectedMessage)
assert.Equal(t, 0, config.UnloadAfter)
config.UnloadAfter = 3 // seconds
assert.Equal(t, 3, config.UnloadAfter)
conf := getTestSimpleResponderConfig(expectedMessage)
assert.Equal(t, config.MODEL_CONFIG_DEFAULT_TTL, conf.UnloadAfter)
conf.UnloadAfter = 3 // seconds
assert.Equal(t, 3, conf.UnloadAfter)
process := NewProcess("ttl_test", 2, config, debugLogger, debugLogger)
process := NewProcess("ttl_test", 2, conf, debugLogger, debugLogger)
defer process.Stop()
// this should take 4 seconds
@@ -159,12 +159,12 @@ func TestProcess_LowTTLValue(t *testing.T) {
t.Skip("skipping test, edit process_test.go to run it ")
}
config := getTestSimpleResponderConfig("fast_ttl")
assert.Equal(t, 0, config.UnloadAfter)
config.UnloadAfter = 1 // second
assert.Equal(t, 1, config.UnloadAfter)
conf := getTestSimpleResponderConfig("fast_ttl")
assert.Equal(t, config.MODEL_CONFIG_DEFAULT_TTL, conf.UnloadAfter)
conf.UnloadAfter = 1 // second
assert.Equal(t, 1, conf.UnloadAfter)
process := NewProcess("ttl", 2, config, debugLogger, debugLogger)
process := NewProcess("ttl", 2, conf, debugLogger, debugLogger)
defer process.Stop()
for i := 0; i < 100; i++ {
+1 -1
View File
@@ -730,7 +730,7 @@ func TestProxyManager_RunningEndpoint(t *testing.T) {
// Verify extended fields are present
assert.NotEmpty(t, response.Running[0].Cmd, "cmd should be populated")
assert.NotEmpty(t, response.Running[0].Proxy, "proxy should be populated")
assert.Equal(t, 0, response.Running[0].TTL, "ttl should default to 0")
assert.Equal(t, -1, response.Running[0].TTL, "ttl should default to -1 (use globalTTL)")
})
}
-7
View File
@@ -925,7 +925,6 @@
"integrity": "sha512-Y1Cs7hhTc+a5E9Va/xwKlAJoariQyHY+5zBgCZg4PFWNYQ1nMN9sjK1zhw1gK69DuqVP++sht/1GZg1aRwmAXQ==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"@sveltejs/vite-plugin-svelte-inspector": "^4.0.1",
"debug": "^4.4.1",
@@ -1308,7 +1307,6 @@
"integrity": "sha512-t7frlewr6+cbx+9Ohpl0NOTKXZNV9xHRmNOvql47BFJKcEG1CxtxlPEEe+gR9uhVWM4DwhnvTF110mIL4yP9RA==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"undici-types": "~7.16.0"
}
@@ -1441,7 +1439,6 @@
"resolved": "https://registry.npmjs.org/acorn/-/acorn-8.15.0.tgz",
"integrity": "sha512-NZyJarBfL7nWwIq+FDL6Zp/yHEhePMNnnJ0y3qfieCrmNvYct8uvtiV41UvlSe6apAfk0fY1FbWx+NwfmpvtTg==",
"license": "MIT",
"peer": true,
"bin": {
"acorn": "bin/acorn"
},
@@ -3452,7 +3449,6 @@
"integrity": "sha512-e5lPJi/aui4TO1LpAXIRLySmwXSE8k3b9zoGfd42p67wzxog4WHjiZF3M2uheQih4DGyc25QEV4yRBbpueNiUA==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"@types/estree": "1.0.8"
},
@@ -3565,7 +3561,6 @@
"resolved": "https://registry.npmjs.org/svelte/-/svelte-5.48.5.tgz",
"integrity": "sha512-NB3o70OxfmnE5UPyLr8uH3IV02Q43qJVAuWigYmsSOYsS0s/rHxP0TF81blG0onF/xkhNvZw4G8NfzIX+By5ZQ==",
"license": "MIT",
"peer": true,
"dependencies": {
"@jridgewell/remapping": "^2.3.4",
"@jridgewell/sourcemap-codec": "^1.5.0",
@@ -3721,7 +3716,6 @@
"integrity": "sha512-p1diW6TqL9L07nNxvRMM7hMMw4c5XOo/1ibL4aAIGmSAt9slTE1Xgw5KWuof2uTOvCg9BY7ZRi+GaF+7sfgPeQ==",
"dev": true,
"license": "Apache-2.0",
"peer": true,
"bin": {
"tsc": "bin/tsc",
"tsserver": "bin/tsserver"
@@ -3900,7 +3894,6 @@
"integrity": "sha512-+Oxm7q9hDoLMyJOYfUYBuHQo+dkAloi33apOPP56pzj+vsdJDzr+j1NISE5pyaAuKL4A3UD34qd0lx5+kfKp2g==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"esbuild": "^0.25.0",
"fdir": "^6.4.4",
+1 -1
View File
@@ -89,7 +89,7 @@
<div class="flex gap-2 items-center">
<button class="btn border-0" onclick={toggleFontSize} title="Change font size">
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="currentColor" class="w-4 h-4">
<path fill-rule="evenodd" d="M10.5 3.75a6 6 0 0 0-5.98 6.496A5.25 5.25 0 0 0 6.75 20.25H18a4.5 4.5 0 0 0 2.206-8.423 3.75 3.75 0 0 0-4.133-4.303A6.001 6.001 0 0 0 10.5 3.75Zm2.25 6a.75.75 0 0 0-1.5 0v4.94l-1.72-1.72a.75.75 0 0 0-1.06 1.06l3 3a.75.75 0 0 0 1.06 0l3-3a.75.75 0 1 0-1.06-1.06l-1.72 1.72V9.75Z" clip-rule="evenodd" />
<path d="M2 4v3h5v12h3V7h5V4H2zm19 5h-9v3h3v7h3v-7h3V9z"/>
</svg>
</button>
<button class="btn border-0" onclick={toggleWrapText} title="Toggle text wrap">
@@ -16,7 +16,7 @@
let fileInput = $state<HTMLInputElement | null>(null);
let copied = $state(false);
const ACCEPTED_FORMATS = ['.mp3', '.wav'];
const ACCEPTED_FORMATS = ['.mp3', '.wav', '.ogg'];
const MAX_FILE_SIZE = 25 * 1024 * 1024; // 25MB
let hasModels = $derived($models.some((m) => !m.unlisted));
@@ -31,7 +31,7 @@
const ext = '.' + file.name.split('.').pop()?.toLowerCase();
if (!ACCEPTED_FORMATS.includes(ext)) {
return { valid: false, error: 'Invalid file type. Accepted: MP3, WAV' };
return { valid: false, error: 'Invalid file type. Accepted: MP3, WAV, OGG' };
}
if (file.size > MAX_FILE_SIZE) {
@@ -208,7 +208,7 @@
<div>
<p class="mb-2">Drag and drop an audio file here</p>
<p class="text-sm">or use the Browse button below</p>
<p class="text-xs mt-4">Accepted formats: MP3, WAV (max 25MB)</p>
<p class="text-xs mt-4">Accepted formats: MP3, WAV, OGG (max 25MB)</p>
</div>
</div>
{/if}
@@ -218,7 +218,7 @@
<div class="shrink-0 flex gap-2">
<input
type="file"
accept=".mp3,.wav"
accept=".mp3,.wav,.ogg"
class="hidden"
onchange={handleFileSelect}
bind:this={fileInput}
@@ -116,6 +116,47 @@
cancelEdit();
}
}
const COPY_SVG = `<svg xmlns="http://www.w3.org/2000/svg" width="14" height="14" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><rect width="14" height="14" x="8" y="8" rx="2" ry="2"/><path d="M4 16c-1.1 0-2-.9-2-2V4c0-1.1.9-2 2-2h10c1.1 0 2 .9 2 2"/></svg>`;
const CHECK_SVG = `<svg xmlns="http://www.w3.org/2000/svg" width="14" height="14" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><path d="M20 6 9 17l-5-5"/></svg>`;
function codeBlockCopy(node: HTMLElement) {
function attachButtons() {
node.querySelectorAll<HTMLPreElement>('pre:not([data-copy-btn])').forEach(pre => {
pre.setAttribute('data-copy-btn', 'true');
const btn = document.createElement('button');
btn.className = 'code-copy-btn';
btn.title = 'Copy code';
btn.innerHTML = COPY_SVG;
btn.addEventListener('click', async () => {
const text = pre.querySelector('code')?.textContent ?? pre.textContent ?? '';
try {
if (navigator.clipboard && window.isSecureContext) {
await navigator.clipboard.writeText(text);
} else {
const ta = document.createElement('textarea');
ta.value = text;
ta.style.cssText = 'position:fixed;left:-9999px';
document.body.appendChild(ta);
ta.select();
document.execCommand('copy');
document.body.removeChild(ta);
}
btn.innerHTML = CHECK_SVG;
btn.classList.add('copied');
setTimeout(() => { btn.innerHTML = COPY_SVG; btn.classList.remove('copied'); }, 2000);
} catch (e) {
console.error('copy failed', e);
}
});
pre.appendChild(btn);
});
}
attachButtons();
const mo = new MutationObserver(attachButtons);
mo.observe(node, { childList: true, subtree: true });
return { destroy: () => mo.disconnect() };
}
</script>
<div class="flex {role === 'user' ? 'justify-end' : 'justify-start'} mb-4">
@@ -174,7 +215,7 @@
{#if showRaw}
<div class="whitespace-pre-wrap font-mono text-sm">{textContent}</div>
{:else}
<div class="prose prose-sm dark:prose-invert max-w-none">
<div class="prose prose-sm dark:prose-invert max-w-none" use:codeBlockCopy>
{#each renderedParts.blocks as block (block.id)}
{@html block.html}
{/each}
@@ -299,14 +340,42 @@
<style>
.prose :global(pre) {
position: relative;
background-color: var(--color-surface);
border: 1px solid var(--color-border, rgba(128, 128, 128, 0.2));
border-radius: 0.375rem;
padding: 0.75rem;
padding-right: 2.5rem;
overflow-x: auto;
margin: 0.5rem 0;
}
.prose :global(.code-copy-btn) {
position: absolute;
top: 0.375rem;
right: 0.375rem;
display: flex;
align-items: center;
justify-content: center;
padding: 0.25rem;
border-radius: 0.25rem;
border: 1px solid var(--color-border);
background: var(--color-surface);
color: var(--color-txtsecondary);
cursor: pointer;
transition: background-color 0.15s;
line-height: 0;
}
.prose :global(.code-copy-btn:hover) {
background: var(--color-secondary);
}
.prose :global(.code-copy-btn.copied) {
color: var(--color-success);
opacity: 1;
}
.prose :global(code) {
font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, monospace;
font-size: 0.875em;