Compare commits
10 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 756193d0dd | |||
| a6b2e930d8 | |||
| 9e02c22ff8 | |||
| 0bdbf2fdc1 | |||
| 49035e2e8e | |||
| 9963ae18bf | |||
| 2ae48c713b | |||
| 54c519e365 | |||
| 3fce9ee0e9 | |||
| 5899ae7966 |
+5
-3
@@ -17,14 +17,16 @@ builds:
|
|||||||
- goos: windows
|
- goos: windows
|
||||||
goarch: arm64
|
goarch: arm64
|
||||||
|
|
||||||
# use zip format for windows
|
|
||||||
archives:
|
archives:
|
||||||
- id: default
|
- id: default
|
||||||
format: tar.gz
|
formats:
|
||||||
|
- tar.gz
|
||||||
name_template: "{{ .ProjectName }}_{{ .Version }}_{{ .Os }}_{{ .Arch }}"
|
name_template: "{{ .ProjectName }}_{{ .Version }}_{{ .Os }}_{{ .Arch }}"
|
||||||
builds_info:
|
builds_info:
|
||||||
group: root
|
group: root
|
||||||
owner: root
|
owner: root
|
||||||
format_overrides:
|
format_overrides:
|
||||||
|
# use zip format for windows
|
||||||
- goos: windows
|
- goos: windows
|
||||||
format: zip
|
formats:
|
||||||
|
- zip
|
||||||
@@ -29,9 +29,13 @@ test: proxy/ui_dist/placeholder.txt
|
|||||||
test-all: proxy/ui_dist/placeholder.txt
|
test-all: proxy/ui_dist/placeholder.txt
|
||||||
go test -v -count=1 ./proxy
|
go test -v -count=1 ./proxy
|
||||||
|
|
||||||
|
ui/node_modules:
|
||||||
|
cd ui && npm install
|
||||||
|
|
||||||
# build react UI
|
# build react UI
|
||||||
ui:
|
ui: ui/node_modules
|
||||||
cd ui && npm run build
|
cd ui && npm run build
|
||||||
|
|
||||||
# Build OSX binary
|
# Build OSX binary
|
||||||
mac: ui
|
mac: ui
|
||||||
@echo "Building Mac binary..."
|
@echo "Building Mac binary..."
|
||||||
|
|||||||
@@ -22,6 +22,7 @@ Written in golang, it is very easy to install (single binary with no dependencie
|
|||||||
- `v1/audio/speech` ([#36](https://github.com/mostlygeek/llama-swap/issues/36))
|
- `v1/audio/speech` ([#36](https://github.com/mostlygeek/llama-swap/issues/36))
|
||||||
- `v1/audio/transcriptions` ([docs](https://github.com/mostlygeek/llama-swap/issues/41#issuecomment-2722637867))
|
- `v1/audio/transcriptions` ([docs](https://github.com/mostlygeek/llama-swap/issues/41#issuecomment-2722637867))
|
||||||
- ✅ llama-swap custom API endpoints
|
- ✅ llama-swap custom API endpoints
|
||||||
|
- `/ui` - web UI
|
||||||
- `/log` - remote log monitoring
|
- `/log` - remote log monitoring
|
||||||
- `/upstream/:model_id` - direct access to upstream HTTP server ([demo](https://github.com/mostlygeek/llama-swap/pull/31))
|
- `/upstream/:model_id` - direct access to upstream HTTP server ([demo](https://github.com/mostlygeek/llama-swap/pull/31))
|
||||||
- `/unload` - manually unload running models ([#58](https://github.com/mostlygeek/llama-swap/issues/58))
|
- `/unload` - manually unload running models ([#58](https://github.com/mostlygeek/llama-swap/issues/58))
|
||||||
@@ -40,36 +41,40 @@ In the most basic configuration llama-swap handles one model at a time. For more
|
|||||||
|
|
||||||
## config.yaml
|
## config.yaml
|
||||||
|
|
||||||
llama-swap's configuration is purposefully simple:
|
llama-swap is managed entirely through a yaml configuration file.
|
||||||
|
|
||||||
|
It can be very minimal to start:
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
models:
|
models:
|
||||||
"qwen2.5":
|
"qwen2.5":
|
||||||
cmd: |
|
cmd: |
|
||||||
/app/llama-server
|
/path/to/llama-server
|
||||||
-hf bartowski/Qwen2.5-0.5B-Instruct-GGUF:Q4_K_M
|
-hf bartowski/Qwen2.5-0.5B-Instruct-GGUF:Q4_K_M
|
||||||
--port ${PORT}
|
--port ${PORT}
|
||||||
|
|
||||||
"smollm2":
|
|
||||||
cmd: |
|
|
||||||
/app/llama-server
|
|
||||||
-hf bartowski/SmolLM2-135M-Instruct-GGUF:Q4_K_M
|
|
||||||
--port ${PORT}
|
|
||||||
```
|
```
|
||||||
|
|
||||||
.. but also supports many advanced features:
|
However, there are many more capabilities that llama-swap supports:
|
||||||
|
|
||||||
- `groups` to run multiple models at once
|
- `groups` to run multiple models at once
|
||||||
- `macros` for reusable snippets
|
|
||||||
- `ttl` to automatically unload models
|
- `ttl` to automatically unload models
|
||||||
|
- `macros` for reusable snippets
|
||||||
- `aliases` to use familiar model names (e.g., "gpt-4o-mini")
|
- `aliases` to use familiar model names (e.g., "gpt-4o-mini")
|
||||||
- `env` variables to pass custom environment to inference servers
|
- `env` to pass custom environment variables to inference servers
|
||||||
|
- `cmdStop` for to gracefully stop Docker/Podman containers
|
||||||
- `useModelName` to override model names sent to upstream servers
|
- `useModelName` to override model names sent to upstream servers
|
||||||
- `healthCheckTimeout` to control model startup wait times
|
- `healthCheckTimeout` to control model startup wait times
|
||||||
- `${PORT}` automatic port variables for dynamic port assignment
|
- `${PORT}` automatic port variables for dynamic port assignment
|
||||||
- `cmdStop` for to gracefully stop Docker/Podman containers
|
|
||||||
|
|
||||||
Check the [configuration documentation](https://github.com/mostlygeek/llama-swap/wiki/Configuration) in the wiki for all options.
|
See the [configuration documentation](https://github.com/mostlygeek/llama-swap/wiki/Configuration) in the wiki all options and examples.
|
||||||
|
|
||||||
|
## Web UI
|
||||||
|
|
||||||
|
llama-swap ships with a web based interface to make it easier to monitor logs and check the status of models.
|
||||||
|
|
||||||
|
<img width="1854" alt="image" src="https://github.com/user-attachments/assets/ee0025f0-f031-4158-9b5d-cd98b2b9fe4d" />
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Docker Install ([download images](https://github.com/mostlygeek/llama-swap/pkgs/container/llama-swap))
|
## Docker Install ([download images](https://github.com/mostlygeek/llama-swap/pkgs/container/llama-swap))
|
||||||
|
|
||||||
@@ -120,11 +125,11 @@ $ docker run -it --rm --runtime nvidia -p 9292:8080 \
|
|||||||
|
|
||||||
## Bare metal Install ([download](https://github.com/mostlygeek/llama-swap/releases))
|
## Bare metal Install ([download](https://github.com/mostlygeek/llama-swap/releases))
|
||||||
|
|
||||||
Pre-built binaries are available for Linux, FreeBSD and Darwin (OSX). These are automatically published and are likely a few hours ahead of the docker releases. The baremetal install works with any OpenAI compatible server, not just llama-server.
|
Pre-built binaries are available for Linux, Mac, Windows and FreeBSD. These are automatically published and are likely a few hours ahead of the docker releases. The baremetal install works with any OpenAI compatible server, not just llama-server.
|
||||||
|
|
||||||
1. Create a configuration file, see [config.example.yaml](config.example.yaml)
|
|
||||||
1. Download a [release](https://github.com/mostlygeek/llama-swap/releases) appropriate for your OS and architecture.
|
1. Download a [release](https://github.com/mostlygeek/llama-swap/releases) appropriate for your OS and architecture.
|
||||||
1. Run the binary with `llama-swap --config path/to/config.yaml`.
|
1. Create a configuration file, see the [configuration documentation](https://github.com/mostlygeek/llama-swap/wiki/Configuration).
|
||||||
|
1. Run the binary with `llama-swap --config path/to/config.yaml --listen localhost:8080`.
|
||||||
Available flags:
|
Available flags:
|
||||||
- `--config`: Path to the configuration file (default: `config.yaml`).
|
- `--config`: Path to the configuration file (default: `config.yaml`).
|
||||||
- `--listen`: Address and port to listen on (default: `:8080`).
|
- `--listen`: Address and port to listen on (default: `:8080`).
|
||||||
@@ -133,16 +138,16 @@ Pre-built binaries are available for Linux, FreeBSD and Darwin (OSX). These are
|
|||||||
|
|
||||||
### Building from source
|
### Building from source
|
||||||
|
|
||||||
1. Install golang for your system
|
1. Build requires golang and nodejs for the user interface.
|
||||||
1. `git clone git@github.com:mostlygeek/llama-swap.git`
|
1. `git clone git@github.com:mostlygeek/llama-swap.git`
|
||||||
1. `make clean all`
|
1. `make clean all`
|
||||||
1. Binaries will be in `build/` subdirectory
|
1. Binaries will be in `build/` subdirectory
|
||||||
|
|
||||||
## Monitoring Logs
|
## Monitoring Logs
|
||||||
|
|
||||||
Open the `http://<host>/logs` with your browser to get a web interface with streaming logs.
|
Open the `http://<host>:<port>/` with your browser to get a web interface with streaming logs.
|
||||||
|
|
||||||
Of course, CLI access is also supported:
|
CLI access is also supported:
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
# sends up to the last 10KB of logs
|
# sends up to the last 10KB of logs
|
||||||
|
|||||||
+5
-3
@@ -189,17 +189,19 @@ func (p *Process) start() error {
|
|||||||
p.waitStarting.Add(1)
|
p.waitStarting.Add(1)
|
||||||
defer p.waitStarting.Done()
|
defer p.waitStarting.Done()
|
||||||
cmdContext, ctxCancelUpstream := context.WithCancel(context.Background())
|
cmdContext, ctxCancelUpstream := context.WithCancel(context.Background())
|
||||||
|
|
||||||
p.cmd = exec.CommandContext(cmdContext, args[0], args[1:]...)
|
p.cmd = exec.CommandContext(cmdContext, args[0], args[1:]...)
|
||||||
p.cmd.Stdout = p.processLogger
|
p.cmd.Stdout = p.processLogger
|
||||||
p.cmd.Stderr = p.processLogger
|
p.cmd.Stderr = p.processLogger
|
||||||
p.cmd.Env = p.config.Env
|
p.cmd.Env = append(p.cmd.Environ(), p.config.Env...)
|
||||||
|
|
||||||
p.cmd.Cancel = p.cmdStopUpstreamProcess
|
p.cmd.Cancel = p.cmdStopUpstreamProcess
|
||||||
p.cmd.WaitDelay = p.gracefulStopTimeout
|
p.cmd.WaitDelay = p.gracefulStopTimeout
|
||||||
p.cancelUpstream = ctxCancelUpstream
|
p.cancelUpstream = ctxCancelUpstream
|
||||||
p.cmdWaitChan = make(chan struct{})
|
p.cmdWaitChan = make(chan struct{})
|
||||||
|
|
||||||
p.failedStartCount++ // this will be reset to zero when the process has successfully started
|
p.failedStartCount++ // this will be reset to zero when the process has successfully started
|
||||||
|
|
||||||
|
p.proxyLogger.Debugf("<%s> Executing start command: %s, env: %s", p.ID, strings.Join(args, " "), strings.Join(p.config.Env, ", "))
|
||||||
err = p.cmd.Start()
|
err = p.cmd.Start()
|
||||||
|
|
||||||
// Set process state to failed
|
// Set process state to failed
|
||||||
@@ -530,7 +532,7 @@ func (p *Process) cmdStopUpstreamProcess() error {
|
|||||||
stopCmd := exec.Command(stopArgs[0], stopArgs[1:]...)
|
stopCmd := exec.Command(stopArgs[0], stopArgs[1:]...)
|
||||||
stopCmd.Stdout = p.processLogger
|
stopCmd.Stdout = p.processLogger
|
||||||
stopCmd.Stderr = p.processLogger
|
stopCmd.Stderr = p.processLogger
|
||||||
stopCmd.Env = p.config.Env
|
stopCmd.Env = p.cmd.Env
|
||||||
|
|
||||||
if err := stopCmd.Run(); err != nil {
|
if err := stopCmd.Run(); err != nil {
|
||||||
p.proxyLogger.Errorf("<%s> Failed to exec stop command: %v", p.ID, err)
|
p.proxyLogger.Errorf("<%s> Failed to exec stop command: %v", p.ID, err)
|
||||||
|
|||||||
+27
-1
@@ -394,6 +394,9 @@ func TestProcess_StopImmediately(t *testing.T) {
|
|||||||
// Test that SIGKILL is sent when gracefulStopTimeout is reached and properly terminates
|
// Test that SIGKILL is sent when gracefulStopTimeout is reached and properly terminates
|
||||||
// the upstream command
|
// the upstream command
|
||||||
func TestProcess_ForceStopWithKill(t *testing.T) {
|
func TestProcess_ForceStopWithKill(t *testing.T) {
|
||||||
|
if runtime.GOOS == "windows" {
|
||||||
|
t.Skip("skipping SIGTERM test on Windows ")
|
||||||
|
}
|
||||||
|
|
||||||
expectedMessage := "test_sigkill"
|
expectedMessage := "test_sigkill"
|
||||||
binaryPath := getSimpleResponderPath()
|
binaryPath := getSimpleResponderPath()
|
||||||
@@ -405,7 +408,6 @@ func TestProcess_ForceStopWithKill(t *testing.T) {
|
|||||||
Cmd: fmt.Sprintf("%s --port %d --respond %s --silent --ignore-sig-term", binaryPath, port, expectedMessage),
|
Cmd: fmt.Sprintf("%s --port %d --respond %s --silent --ignore-sig-term", binaryPath, port, expectedMessage),
|
||||||
Proxy: fmt.Sprintf("http://127.0.0.1:%d", port),
|
Proxy: fmt.Sprintf("http://127.0.0.1:%d", port),
|
||||||
CheckEndpoint: "/health",
|
CheckEndpoint: "/health",
|
||||||
CmdStop: "taskkill /f /t /pid ${PID}",
|
|
||||||
}
|
}
|
||||||
|
|
||||||
process := NewProcess("stop_immediate", 2, config, debugLogger, debugLogger)
|
process := NewProcess("stop_immediate", 2, config, debugLogger, debugLogger)
|
||||||
@@ -465,3 +467,27 @@ func TestProcess_StopCmd(t *testing.T) {
|
|||||||
process.StopImmediately()
|
process.StopImmediately()
|
||||||
assert.Equal(t, process.CurrentState(), StateStopped)
|
assert.Equal(t, process.CurrentState(), StateStopped)
|
||||||
}
|
}
|
||||||
|
|
||||||
|
func TestProcess_EnvironmentSetCorrectly(t *testing.T) {
|
||||||
|
expectedMessage := "test_env_not_emptied"
|
||||||
|
config := getTestSimpleResponderConfig(expectedMessage)
|
||||||
|
|
||||||
|
// ensure that the the default config does not blank out the inherited environment
|
||||||
|
configWEnv := config
|
||||||
|
|
||||||
|
// ensure the additiona variables are appended to the process' environment
|
||||||
|
configWEnv.Env = append(configWEnv.Env, "TEST_ENV1=1", "TEST_ENV2=2")
|
||||||
|
|
||||||
|
process1 := NewProcess("env_test", 2, config, debugLogger, debugLogger)
|
||||||
|
process2 := NewProcess("env_test", 2, configWEnv, debugLogger, debugLogger)
|
||||||
|
|
||||||
|
process1.start()
|
||||||
|
defer process1.Stop()
|
||||||
|
process2.start()
|
||||||
|
defer process2.Stop()
|
||||||
|
|
||||||
|
assert.NotZero(t, len(process1.cmd.Environ()))
|
||||||
|
assert.NotZero(t, len(process2.cmd.Environ()))
|
||||||
|
assert.Equal(t, len(process1.cmd.Environ())+2, len(process2.cmd.Environ()), "process2 should have 2 more environment variables than process1")
|
||||||
|
|
||||||
|
}
|
||||||
|
|||||||
@@ -12,6 +12,7 @@ interface APIProviderType {
|
|||||||
models: Model[];
|
models: Model[];
|
||||||
listModels: () => Promise<Model[]>;
|
listModels: () => Promise<Model[]>;
|
||||||
unloadAllModels: () => Promise<void>;
|
unloadAllModels: () => Promise<void>;
|
||||||
|
loadModel: (model: string) => Promise<void>;
|
||||||
enableProxyLogs: (enabled: boolean) => void;
|
enableProxyLogs: (enabled: boolean) => void;
|
||||||
enableUpstreamLogs: (enabled: boolean) => void;
|
enableUpstreamLogs: (enabled: boolean) => void;
|
||||||
enableModelUpdates: (enabled: boolean) => void;
|
enableModelUpdates: (enabled: boolean) => void;
|
||||||
@@ -139,11 +140,26 @@ export function APIProvider({ children }: APIProviderProps) {
|
|||||||
}
|
}
|
||||||
}, []);
|
}, []);
|
||||||
|
|
||||||
|
const loadModel = useCallback(async (model: string) => {
|
||||||
|
try {
|
||||||
|
const response = await fetch(`/upstream/${model}/`, {
|
||||||
|
method: "GET",
|
||||||
|
});
|
||||||
|
if (!response.ok) {
|
||||||
|
throw new Error(`Failed to load model: ${response.status}`);
|
||||||
|
}
|
||||||
|
} catch (error) {
|
||||||
|
console.error("Failed to load model:", error);
|
||||||
|
throw error; // Re-throw to let calling code handle it
|
||||||
|
}
|
||||||
|
}, []);
|
||||||
|
|
||||||
const value = useMemo(
|
const value = useMemo(
|
||||||
() => ({
|
() => ({
|
||||||
models,
|
models,
|
||||||
listModels,
|
listModels,
|
||||||
unloadAllModels,
|
unloadAllModels,
|
||||||
|
loadModel,
|
||||||
enableProxyLogs,
|
enableProxyLogs,
|
||||||
enableUpstreamLogs,
|
enableUpstreamLogs,
|
||||||
enableModelUpdates,
|
enableModelUpdates,
|
||||||
@@ -154,6 +170,7 @@ export function APIProvider({ children }: APIProviderProps) {
|
|||||||
models,
|
models,
|
||||||
listModels,
|
listModels,
|
||||||
unloadAllModels,
|
unloadAllModels,
|
||||||
|
loadModel,
|
||||||
enableProxyLogs,
|
enableProxyLogs,
|
||||||
enableUpstreamLogs,
|
enableUpstreamLogs,
|
||||||
enableModelUpdates,
|
enableModelUpdates,
|
||||||
|
|||||||
@@ -143,6 +143,10 @@
|
|||||||
@apply bg-surface p-2 px-4 text-sm rounded-full border border-2 transition-colors duration-200 border-btn-border;
|
@apply bg-surface p-2 px-4 text-sm rounded-full border border-2 transition-colors duration-200 border-btn-border;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
.btn:hover {
|
||||||
|
cursor: pointer;
|
||||||
|
}
|
||||||
|
|
||||||
.btn--sm {
|
.btn--sm {
|
||||||
@apply px-2 py-0.5 text-xs;
|
@apply px-2 py-0.5 text-xs;
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -3,7 +3,7 @@ import { useAPI } from "../contexts/APIProvider";
|
|||||||
import { LogPanel } from "./LogViewer";
|
import { LogPanel } from "./LogViewer";
|
||||||
|
|
||||||
export default function ModelsPage() {
|
export default function ModelsPage() {
|
||||||
const { models, enableModelUpdates, unloadAllModels, upstreamLogs, enableUpstreamLogs } = useAPI();
|
const { models, enableModelUpdates, unloadAllModels, loadModel, upstreamLogs, enableUpstreamLogs } = useAPI();
|
||||||
const [isUnloading, setIsUnloading] = useState(false);
|
const [isUnloading, setIsUnloading] = useState(false);
|
||||||
|
|
||||||
useEffect(() => {
|
useEffect(() => {
|
||||||
@@ -43,6 +43,7 @@ export default function ModelsPage() {
|
|||||||
<thead>
|
<thead>
|
||||||
<tr className="border-b border-primary">
|
<tr className="border-b border-primary">
|
||||||
<th className="text-left p-2">Name</th>
|
<th className="text-left p-2">Name</th>
|
||||||
|
<th className="text-left p-2"></th>
|
||||||
<th className="text-left p-2">State</th>
|
<th className="text-left p-2">State</th>
|
||||||
</tr>
|
</tr>
|
||||||
</thead>
|
</thead>
|
||||||
@@ -50,10 +51,13 @@ export default function ModelsPage() {
|
|||||||
{models.map((model) => (
|
{models.map((model) => (
|
||||||
<tr key={model.id} className="border-b hover:bg-secondary-hover border-border">
|
<tr key={model.id} className="border-b hover:bg-secondary-hover border-border">
|
||||||
<td className="p-2">
|
<td className="p-2">
|
||||||
<a href={`/upstream/${model.id}/`} className="underline" target="top">
|
<a href={`/upstream/${model.id}/`} className="underline" target="_blank">
|
||||||
{model.id}
|
{model.id}
|
||||||
</a>
|
</a>
|
||||||
</td>
|
</td>
|
||||||
|
<td className="p-2">
|
||||||
|
<button className="btn btn--sm" disabled={model.state !== "stopped"} onClick={() => loadModel(model.id)}>Load</button>
|
||||||
|
</td>
|
||||||
<td className="p-2">
|
<td className="p-2">
|
||||||
<span className={`status status--${model.state}`}>{model.state}</span>
|
<span className={`status status--${model.state}`}>{model.state}</span>
|
||||||
</td>
|
</td>
|
||||||
|
|||||||
Reference in New Issue
Block a user