Compare commits

..

19 Commits

Author SHA1 Message Date
Benson Wong 831a90d3b0 Add different timeout scenarios to Process.checkHealthEndpoint #276 (#278)
- add a TCP connection timeout of 500ms
- increase HTTP client timeout to 5000ms

In this new behaviour the upstream has 500ms to accept a tcp connection
and 5000ms to respond to the HTTP request.
2025-08-28 22:03:14 -07:00
Yandrik 977f1856bb add /completion endpoint (#275)
* feat: add /completion endpoint
* chore: reformat using gofmt
2025-08-28 21:41:02 -07:00
Benson Wong 52b329f7bc Fix #277 race condition in ProcessGroup.ProxyRequest when swap=true 2025-08-28 21:38:40 -07:00
Benson Wong 57803fd3aa Support llama-server's /infill endpoint (#272)
Add support for llama-server's /infill endpoint and metrics gathering on the Activities page.
2025-08-27 08:36:05 -07:00
Benson Wong c55d0cc842 Add docs for model.concurrencyLimit #263 [skip ci] 2025-08-22 16:08:37 -07:00
Benson Wong 7acbaf4712 Add connection status indicator in UI (#260)
* show connection status as icon in UI title
* make connection status event driven
2025-08-20 13:58:24 -07:00
Benson Wong fcc5ad135a UI: Allow editing of title (#246)
- make <h1> title contentEditable
- title setting persists across reloads in localStorage
2025-08-17 09:42:06 -07:00
Benson Wong 305e5a0031 improve example config [skip ci] 2025-08-17 09:19:04 -07:00
Benson Wong 04fc67354a Improve Activity event handling in the UI (#254)
Improve Activity event handling in the UI

- fixes #252 found that the Activity page showed activity inconsistent
  with /api/metrics
- Change data structure for event metrics to array.
- Add Event stream connections status indicator
2025-08-15 21:44:08 -07:00
Benson Wong 4662cf7699 add 'unconfirmed bug' as default label in bug-report.md 2025-08-15 15:38:12 -07:00
Benson Wong 5dc6b3e6d9 Add barebones but working implementation of model preload (#209, #235)
Add barebones but working implementation of model preload

* add config test for Preload hook
* improve TestProxyManager_StartupHooks
* docs for new hook configuration
* add a .dev to .gitignore
2025-08-14 10:27:28 -07:00
Benson Wong 74c69f39ef Add prompt processing metrics (#250)
- capture prompt processing metrics
- display prompt processing metrics on UI Activity page
2025-08-14 10:02:16 -07:00
Benson Wong a186318892 Update Readme, Add screenshot for Activities page [skip ci] 2025-08-08 13:39:46 -07:00
Benson Wong c4e4d5e1e9 Update Readme UI Screenshot [skip ci] 2025-08-08 13:33:47 -07:00
Benson Wong 7985e94ba4 add tokens processed to ui models page 2025-08-08 13:28:39 -07:00
Benson Wong 74556c3a36 Update bug-report.md [skip ci] 2025-08-08 09:52:05 -07:00
Benson Wong 5c381e4b30 Add gofmt linting to ci 2025-08-07 20:29:18 -07:00
Benson Wong 10569ed546 Fix model alias usage in upstream path (#230)
Model alias values are not properly resolved and work in upstream/ path.

Related to #229.
2025-08-07 20:16:56 -07:00
Benson Wong 5b10b3c23f UI Tweaks (#228)
* sort model names in UI

* add toggle to show model id/name on UI model page
2025-08-07 11:07:03 -07:00
29 changed files with 829 additions and 230 deletions
+4 -2
View File
@@ -1,11 +1,13 @@
--- ---
name: Bug Report name: Bug Report
about: Something is not working as expected... about: I found a defect
title: '' title: ''
labels: bug labels: 'unconfirmed bug'
assignees: '' assignees: ''
--- ---
> [!IMPORTANT]
> If you have questions about llama-swap please post in the Q&A in Discussions. Use bug reports when you've found a defect and wish to discuss a fix.
**Describe the bug** **Describe the bug**
A clear and concise description of what the bug is. A clear and concise description of what the bug is.
+7
View File
@@ -22,6 +22,13 @@ jobs:
with: with:
go-version: '1.23' go-version: '1.23'
# Only run in this linux based runner
- name: Check Formatting
run: |
if [ "$(gofmt -l . | grep -v 'event/.*_test.go' | wc -l)" -gt 0 ]; then
gofmt -l . | grep -v 'event/.*_test.go'
exit 1
fi
# cache simple-responder to save the build time # cache simple-responder to save the build time
- name: Restore Simple Responder - name: Restore Simple Responder
id: restore-simple-responder id: restore-simple-responder
+1
View File
@@ -4,3 +4,4 @@ build/
dist/ dist/
.vscode .vscode
.DS_Store .DS_Store
.dev/
+18 -10
View File
@@ -18,9 +18,12 @@ Written in golang, it is very easy to install (single binary with no dependencie
- `v1/completions` - `v1/completions`
- `v1/chat/completions` - `v1/chat/completions`
- `v1/embeddings` - `v1/embeddings`
- `v1/rerank`, `v1/reranking`, `rerank`
- `v1/audio/speech` ([#36](https://github.com/mostlygeek/llama-swap/issues/36)) - `v1/audio/speech` ([#36](https://github.com/mostlygeek/llama-swap/issues/36))
- `v1/audio/transcriptions` ([docs](https://github.com/mostlygeek/llama-swap/issues/41#issuecomment-2722637867)) - `v1/audio/transcriptions` ([docs](https://github.com/mostlygeek/llama-swap/issues/41#issuecomment-2722637867))
- ✅ llama-server (llama.cpp) supported endpoints:
- `v1/rerank`, `v1/reranking`, `/rerank`
- `/infill` - for code infilling
- `/completion` - for completion endpoint
- ✅ llama-swap custom API endpoints - ✅ llama-swap custom API endpoints
- `/ui` - web UI - `/ui` - web UI
- `/log` - remote log monitoring - `/log` - remote log monitoring
@@ -31,8 +34,9 @@ Written in golang, it is very easy to install (single binary with no dependencie
- ✅ Run multiple models at once with `Groups` ([#107](https://github.com/mostlygeek/llama-swap/issues/107)) - ✅ Run multiple models at once with `Groups` ([#107](https://github.com/mostlygeek/llama-swap/issues/107))
- ✅ Automatic unloading of models after timeout by setting a `ttl` - ✅ Automatic unloading of models after timeout by setting a `ttl`
- ✅ Use any local OpenAI compatible server (llama.cpp, vllm, tabbyAPI, etc) - ✅ Use any local OpenAI compatible server (llama.cpp, vllm, tabbyAPI, etc)
- ✅ Docker and Podman support - Reliable Docker and Podman support with `cmdStart` and `cmdStop`
- ✅ Full control over server settings per model - ✅ Full control over server settings per model
- ✅ Preload models on startup with `hooks` ([#235](https://github.com/mostlygeek/llama-swap/pull/235))
## How does llama-swap work? ## How does llama-swap work?
@@ -42,9 +46,9 @@ In the most basic configuration llama-swap handles one model at a time. For more
## config.yaml ## config.yaml
llama-swap is managed entirely through a yaml configuration file. llama-swap is managed entirely through a yaml configuration file.
It can be very minimal to start: It can be very minimal to start:
```yaml ```yaml
models: models:
@@ -55,7 +59,7 @@ models:
--port ${PORT} --port ${PORT}
``` ```
However, there are many more capabilities that llama-swap supports: However, there are many more capabilities that llama-swap supports:
- `groups` to run multiple models at once - `groups` to run multiple models at once
- `ttl` to automatically unload models - `ttl` to automatically unload models
@@ -71,9 +75,13 @@ See the [configuration documentation](https://github.com/mostlygeek/llama-swap/w
## Web UI ## Web UI
llama-swap ships with a real time web interface to monitor logs and status of models: llama-swap includes a real time web interface for monitoring logs and models:
<img width="1786" height="1334" alt="image" src="https://github.com/user-attachments/assets/d6258cb9-1dad-40db-828f-2be860aec8fe" /> <img width="1360" height="963" alt="image" src="https://github.com/user-attachments/assets/adef4a8e-de0b-49db-885a-8f6dedae6799" />
The Activity Page shows recent requests:
<img width="1360" height="963" alt="image" src="https://github.com/user-attachments/assets/5f3edee6-d03a-4ae5-ae06-b20ac1f135bd" />
## Installation ## Installation
@@ -86,7 +94,7 @@ llama-swap can be installed in multiple ways
### Docker Install ([download images](https://github.com/mostlygeek/llama-swap/pkgs/container/llama-swap)) ### Docker Install ([download images](https://github.com/mostlygeek/llama-swap/pkgs/container/llama-swap))
Docker images with llama-swap and llama-server are built nightly. Docker images with llama-swap and llama-server are built nightly.
```shell ```shell
# use CPU inference comes with the example config above # use CPU inference comes with the example config above
@@ -133,10 +141,10 @@ $ docker run -it --rm --runtime nvidia -p 9292:8080 \
### Homebrew Install (macOS/Linux) ### Homebrew Install (macOS/Linux)
The latest release of `llama-swap` can be installed via [Homebrew](https://brew.sh). The latest release of `llama-swap` can be installed via [Homebrew](https://brew.sh).
```shell ```shell
# Set up tap and install formula # Set up tap and install formula
brew tap mostlygeek/llama-swap brew tap mostlygeek/llama-swap
brew install llama-swap brew install llama-swap
# Run llama-swap # Run llama-swap
+63 -23
View File
@@ -1,9 +1,17 @@
# llama-swap YAML configuration example # llama-swap YAML configuration example
# ------------------------------------- # -------------------------------------
# #
# 💡 Tip - Use an LLM with this file!
# ====================================
# This example configuration is written to be LLM friendly. Try
# copying this file into an LLM and asking it to explain or generate
# sections for you.
# ====================================
# Usage notes:
# - Below are all the available configuration options for llama-swap. # - Below are all the available configuration options for llama-swap.
# - Settings with a default value, or noted as optional can be omitted. # - Settings noted as "required" must be in your configuration file
# - Settings that are marked required must be in your configuration file # - Settings noted as "optional" can be omitted
# healthCheckTimeout: number of seconds to wait for a model to be ready to serve requests # healthCheckTimeout: number of seconds to wait for a model to be ready to serve requests
# - optional, default: 120 # - optional, default: 120
@@ -27,9 +35,9 @@ metricsMaxInMemory: 1000
# - it is automatically incremented for every model that uses it # - it is automatically incremented for every model that uses it
startPort: 10001 startPort: 10001
# macros: sets a dictionary of string:string pairs # macros: a dictionary of string substitutions
# - optional, default: empty dictionary # - optional, default: empty dictionary
# - these are reusable snippets # - macros are reusable snippets
# - used in a model's cmd, cmdStop, proxy and checkEndpoint # - used in a model's cmd, cmdStop, proxy and checkEndpoint
# - useful for reducing common configuration settings # - useful for reducing common configuration settings
macros: macros:
@@ -92,44 +100,55 @@ models:
# checkEndpoint: URL path to check if the server is ready # checkEndpoint: URL path to check if the server is ready
# - optional, default: /health # - optional, default: /health
# - use "none" to skip endpoint ready checking
# - endpoint is expected to return an HTTP 200 response # - endpoint is expected to return an HTTP 200 response
# - all requests wait until the endpoint is ready (or fails) # - all requests wait until the endpoint is ready or fails
# - use "none" to skip endpoint health checking
checkEndpoint: /custom-endpoint checkEndpoint: /custom-endpoint
# ttl: automatically unload the model after this many seconds # ttl: automatically unload the model after ttl seconds
# - optional, default: 0 # - optional, default: 0
# - ttl values must be a value greater than 0 # - ttl values must be a value greater than 0
# - a value of 0 disables automatic unloading of the model # - a value of 0 disables automatic unloading of the model
ttl: 60 ttl: 60
# useModelName: overrides the model name that is sent to upstream server # useModelName: override the model name that is sent to upstream server
# - optional, default: "" # - optional, default: ""
# - useful when the upstream server expects a specific model name or format # - useful for when the upstream server expects a specific model name that
# is different from the model's ID
useModelName: "qwen:qwq" useModelName: "qwen:qwq"
# filters: a dictionary of filter settings # filters: a dictionary of filter settings
# - optional, default: empty dictionary # - optional, default: empty dictionary
# - only strip_params is currently supported
filters: filters:
# strip_params: a comma separated list of parameters to remove from the request # strip_params: a comma separated list of parameters to remove from the request
# - optional, default: "" # - optional, default: ""
# - useful for preventing overriding of default server params by requests # - useful for server side enforcement of sampling parameters
# - `model` parameter is never removed # - the `model` parameter can never be removed
# - can be any JSON key in the request body # - can be any JSON key in the request body
# - recommended to stick to sampling parameters # - recommended to stick to sampling parameters
strip_params: "temperature, top_p, top_k" strip_params: "temperature, top_p, top_k"
# concurrencyLimit: overrides the allowed number of active parallel requests to a model
# - optional, default: 0
# - useful for limiting the number of active parallel requests a model can process
# - must be set per model
# - any number greater than 0 will override the internal default value of 10
# - any requests that exceeds the limit will receive an HTTP 429 Too Many Requests response
# - recommended to be omitted and the default used
concurrencyLimit: 0
# Unlisted model example: # Unlisted model example:
"qwen-unlisted": "qwen-unlisted":
# unlisted: true or false # unlisted: boolean, true or false
# - optional, default: false # - optional, default: false
# - unlisted models do not show up in /v1/models or /upstream lists # - unlisted models do not show up in /v1/models api requests
# - can be requested as normal through all apis # - can be requested as normal through all apis
unlisted: true unlisted: true
cmd: llama-server --port ${PORT} -m Llama-3.2-1B-Instruct-Q4_K_M.gguf -ngl 0 cmd: llama-server --port ${PORT} -m Llama-3.2-1B-Instruct-Q4_K_M.gguf -ngl 0
# Docker example: # Docker example:
# container run times like Docker and Podman can also be used with a # container run times like Docker and Podman can be used reliably with a
# a combination of cmd and cmdStop. # a combination of cmd and cmdStop.
"docker-llama": "docker-llama":
proxy: "http://127.0.0.1:${PORT}" proxy: "http://127.0.0.1:${PORT}"
@@ -142,24 +161,26 @@ models:
# cmdStop: command to run to stop the model gracefully # cmdStop: command to run to stop the model gracefully
# - optional, default: "" # - optional, default: ""
# - useful for stopping commands managed by another system # - useful for stopping commands managed by another system
# - on POSIX systems: a SIGTERM is sent for graceful shutdown
# - on Windows, taskkill is used
# - processes are given 5 seconds to shutdown until they are forcefully killed
# - the upstream's process id is available in the ${PID} macro # - the upstream's process id is available in the ${PID} macro
#
# When empty, llama-swap has this default behaviour:
# - on POSIX systems: a SIGTERM signal is sent
# - on Windows, calls taskkill to stop the process
# - processes have 5 seconds to shutdown until forceful termination is attempted
cmdStop: docker stop dockertest cmdStop: docker stop dockertest
# groups: a dictionary of group settings # groups: a dictionary of group settings
# - optional, default: empty dictionary # - optional, default: empty dictionary
# - provide advanced controls over model swapping behaviour. # - provides advanced controls over model swapping behaviour
# - Using groups some models can be kept loaded indefinitely, while others are swapped out. # - using groups some models can be kept loaded indefinitely, while others are swapped out
# - model ids must be defined in the Models section # - model IDs must be defined in the Models section
# - a model can only be a member of one group # - a model can only be a member of one group
# - group behaviour is controlled via the `swap`, `exclusive` and `persistent` fields # - group behaviour is controlled via the `swap`, `exclusive` and `persistent` fields
# - see issue #109 for details # - see issue #109 for details
# #
# NOTE: the example below uses model names that are not defined above for demonstration purposes # NOTE: the example below uses model names that are not defined above for demonstration purposes
groups: groups:
# group1 is same as the default behaviour of llama-swap where only one model is allowed # group1 works the same as the default behaviour of llama-swap where only one model is allowed
# to run a time across the whole llama-swap instance # to run a time across the whole llama-swap instance
"group1": "group1":
# swap: controls the model swapping behaviour in within the group # swap: controls the model swapping behaviour in within the group
@@ -181,10 +202,13 @@ groups:
- "qwen-unlisted" - "qwen-unlisted"
# Example: # Example:
# - in this group all the models can run at the same time # - in group2 all models can run at the same time
# - when a different group loads all running models in this group are unloaded # - when a different group is loaded it causes all running models in this group to unload
"group2": "group2":
swap: false swap: false
# exclusive: false does not unload other groups when a model in group2 is requested
# - the models in group2 will be loaded but will not unload any other groups
exclusive: false exclusive: false
members: members:
- "docker-llama" - "docker-llama"
@@ -207,3 +231,19 @@ groups:
- "forever-modelA" - "forever-modelA"
- "forever-modelB" - "forever-modelB"
- "forever-modelc" - "forever-modelc"
# hooks: a dictionary of event triggers and actions
# - optional, default: empty dictionary
# - the only supported hook is on_startup
hooks:
# on_startup: a dictionary of actions to perform on startup
# - optional, default: empty dictionary
# - the only supported action is preload
on_startup:
# preload: a list of model ids to load on startup
# - optional, default: empty list
# - model names must match keys in the models sections
# - when preloading multiple models at once, define a group
# otherwise models will be loaded and swapped out
preload:
- "llama"
+1 -1
View File
@@ -133,7 +133,7 @@ func main() {
ReloadingState: proxy.ReloadingStateStart, ReloadingState: proxy.ReloadingStateStart,
}) })
} else if changeEvent.Name == filepath.Join(configDir, "..data") && changeEvent.Has(fsnotify.Create) { } else if changeEvent.Name == filepath.Join(configDir, "..data") && changeEvent.Has(fsnotify.Create) {
// the change for k8s configmap // the change for k8s configmap
event.Emit(proxy.ConfigFileChangedEvent{ event.Emit(proxy.ConfigFileChangedEvent{
ReloadingState: proxy.ReloadingStateStart, ReloadingState: proxy.ReloadingStateStart,
}) })
+159
View File
@@ -0,0 +1,159 @@
package main
// created for issue: #252 https://github.com/mostlygeek/llama-swap/issues/252
// this simple benchmark tool sends a lot of small chat completion requests to llama-swap
// to make sure all the requests are accounted for.
//
// requests can be sent in parallel, and the tool will report the results.
// usage: go run main.go -baseurl http://localhost:8080/v1 -model llama3 -requests 1000 -par 5
import (
"bytes"
"flag"
"fmt"
"io"
"log"
"net/http"
"os"
"sync"
"time"
)
func main() {
// ----- CLI arguments ----------------------------------------------------
var (
baseurl string
modelName string
totalRequests int
parallelization int
)
flag.StringVar(&baseurl, "baseurl", "http://localhost:8080/v1", "Base URL of the API (e.g., https://api.example.com)")
flag.StringVar(&modelName, "model", "", "Model name to use")
flag.IntVar(&totalRequests, "requests", 1, "Total number of requests to send")
flag.IntVar(&parallelization, "par", 1, "Maximum number of concurrent requests")
flag.Parse()
if baseurl == "" || modelName == "" {
fmt.Println("Error: both -baseurl and -model are required.")
flag.Usage()
os.Exit(1)
}
if totalRequests <= 0 {
fmt.Println("Error: -requests must be greater than 0.")
os.Exit(1)
}
if parallelization <= 0 {
fmt.Println("Error: -parallelization must be greater than 0.")
os.Exit(1)
}
// ----- HTTP client -------------------------------------------------------
client := &http.Client{
Timeout: 30 * time.Second,
}
// ----- Tracking response codes -------------------------------------------
statusCounts := make(map[int]int) // map[statusCode]count
var mu sync.Mutex // protects statusCounts
// ----- Request queue (buffered channel) ----------------------------------
requests := make(chan int, 10) // Buffered channel with capacity 10
// Goroutine to fill the request queue
go func() {
for i := 0; i < totalRequests; i++ {
requests <- i + 1
}
close(requests)
}()
// ----- Worker pool -------------------------------------------------------
var wg sync.WaitGroup
for i := 0; i < parallelization; i++ {
wg.Add(1)
go func(workerID int) {
defer wg.Done()
for reqID := range requests {
// Build request payload as a single line JSON string
payload := `{"model":"` + modelName + `","max_tokens":100,"stream":false,"messages":[{"role":"user","content":"write a snake game in python"}]}`
// Send POST request
req, err := http.NewRequest(http.MethodPost,
fmt.Sprintf("%s/chat/completions", baseurl),
bytes.NewReader([]byte(payload)))
if err != nil {
log.Printf("[worker %d][req %d] request creation error: %v", workerID, reqID, err)
mu.Lock()
statusCounts[-1]++
mu.Unlock()
continue
}
req.Header.Set("Content-Type", "application/json")
resp, err := client.Do(req)
if err != nil {
log.Printf("[worker %d][req %d] HTTP request error: %v", workerID, reqID, err)
mu.Lock()
statusCounts[-1]++
mu.Unlock()
continue
}
io.Copy(io.Discard, resp.Body)
resp.Body.Close()
// Record status code
mu.Lock()
statusCounts[resp.StatusCode]++
mu.Unlock()
}
}(i + 1)
}
// ----- Status ticker (prints every second) -------------------------------
done := make(chan struct{})
tickerDone := make(chan struct{})
go func() {
ticker := time.NewTicker(1 * time.Second)
startTime := time.Now()
for {
select {
case <-ticker.C:
mu.Lock()
// Compute how many requests have completed so far
completed := 0
for _, cnt := range statusCounts {
completed += cnt
}
// Calculate duration and progress
duration := time.Since(startTime)
progress := completed * 100 / totalRequests
fmt.Printf("Duration: %v, Completed: %d%% requests\n", duration, progress)
mu.Unlock()
case <-done:
duration := time.Since(startTime)
fmt.Printf("Duration: %v, Completed: %d%% requests\n", duration, 100)
close(tickerDone)
return
}
}
}()
// Wait for all workers to finish
wg.Wait()
close(done) // stops the status-update goroutine
<-tickerDone // give ticker time to finish / print
// ----- Summary ------------------------------------------------------------
fmt.Println("\n\n=== HTTP response code summary ===")
mu.Lock()
for code, cnt := range statusCounts {
if code == -1 {
fmt.Printf("Client-side errors (no HTTP response): %d\n", cnt)
} else {
fmt.Printf("%d : %d\n", code, cnt)
}
}
mu.Unlock()
}
+13
View File
@@ -153,6 +153,19 @@ func main() {
}) })
// llama-server compatibility: /completion
r.POST("/completion", func(c *gin.Context) {
c.Header("Content-Type", "application/json")
c.JSON(http.StatusOK, gin.H{
"responseMessage": *responseMessage,
"usage": gin.H{
"completion_tokens": 10,
"prompt_tokens": 25,
"total_tokens": 35,
},
})
})
// issue #41 // issue #41
r.POST("/v1/audio/transcriptions", func(c *gin.Context) { r.POST("/v1/audio/transcriptions", func(c *gin.Context) {
// Parse the multipart form // Parse the multipart form
+27
View File
@@ -138,6 +138,14 @@ func (c *GroupConfig) UnmarshalYAML(unmarshal func(interface{}) error) error {
return nil return nil
} }
type HooksConfig struct {
OnStartup HookOnStartup `yaml:"on_startup"`
}
type HookOnStartup struct {
Preload []string `yaml:"preload"`
}
type Config struct { type Config struct {
HealthCheckTimeout int `yaml:"healthCheckTimeout"` HealthCheckTimeout int `yaml:"healthCheckTimeout"`
LogRequests bool `yaml:"logRequests"` LogRequests bool `yaml:"logRequests"`
@@ -155,6 +163,9 @@ type Config struct {
// automatic port assignments // automatic port assignments
StartPort int `yaml:"startPort"` StartPort int `yaml:"startPort"`
// hooks, see: #209
Hooks HooksConfig `yaml:"hooks"`
} }
func (c *Config) RealModelName(search string) (string, bool) { func (c *Config) RealModelName(search string) (string, bool) {
@@ -330,6 +341,22 @@ func LoadConfigFromReader(r io.Reader) (Config, error) {
} }
} }
// clean up hooks preload
if len(config.Hooks.OnStartup.Preload) > 0 {
var toPreload []string
for _, modelID := range config.Hooks.OnStartup.Preload {
modelID = strings.TrimSpace(modelID)
if modelID == "" {
continue
}
if real, found := config.RealModelName(modelID); found {
toPreload = append(toPreload, real)
}
}
config.Hooks.OnStartup.Preload = toPreload
}
return config, nil return config, nil
} }
+8
View File
@@ -100,6 +100,9 @@ func TestConfig_LoadPosix(t *testing.T) {
content := ` content := `
macros: macros:
svr-path: "path/to/server" svr-path: "path/to/server"
hooks:
on_startup:
preload: ["model1", "model2"]
models: models:
model1: model1:
cmd: path/to/cmd --arg1 one cmd: path/to/cmd --arg1 one
@@ -163,6 +166,11 @@ groups:
Macros: map[string]string{ Macros: map[string]string{
"svr-path": "path/to/server", "svr-path": "path/to/server",
}, },
Hooks: HooksConfig{
OnStartup: HookOnStartup{
Preload: []string{"model1", "model2"},
},
},
Models: map[string]ModelConfig{ Models: map[string]ModelConfig{
"model1": { "model1": {
Cmd: "path/to/cmd --arg1 one", Cmd: "path/to/cmd --arg1 one",
+27
View File
@@ -0,0 +1,27 @@
package proxy
import "net/http"
// Custom discard writer that implements http.ResponseWriter but just discards everything
type DiscardWriter struct {
header http.Header
status int
}
func (w *DiscardWriter) Header() http.Header {
if w.header == nil {
w.header = make(http.Header)
}
return w.header
}
func (w *DiscardWriter) Write(data []byte) (int, error) {
return len(data), nil
}
func (w *DiscardWriter) WriteHeader(code int) {
w.status = code
}
// Satisfy the http.Flusher interface for streaming responses
func (w *DiscardWriter) Flush() {}
+10
View File
@@ -7,6 +7,7 @@ const ChatCompletionStatsEventID = 0x02
const ConfigFileChangedEventID = 0x03 const ConfigFileChangedEventID = 0x03
const LogDataEventID = 0x04 const LogDataEventID = 0x04
const TokenMetricsEventID = 0x05 const TokenMetricsEventID = 0x05
const ModelPreloadedEventID = 0x06
type ProcessStateChangeEvent struct { type ProcessStateChangeEvent struct {
ProcessName string ProcessName string
@@ -48,3 +49,12 @@ type LogDataEvent struct {
func (e LogDataEvent) Type() uint32 { func (e LogDataEvent) Type() uint32 {
return LogDataEventID return LogDataEventID
} }
type ModelPreloadedEvent struct {
ModelName string
Success bool
}
func (e ModelPreloadedEvent) Type() uint32 {
return ModelPreloadedEventID
}
+5 -6
View File
@@ -13,9 +13,10 @@ import (
) )
var ( var (
nextTestPort int = 12000 nextTestPort int = 12000
portMutex sync.Mutex portMutex sync.Mutex
testLogger = NewLogMonitorWriter(os.Stdout) testLogger = NewLogMonitorWriter(os.Stdout)
simpleResponderPath = getSimpleResponderPath()
) )
// Check if the binary exists // Check if the binary exists
@@ -69,13 +70,11 @@ func getTestSimpleResponderConfig(expectedMessage string) ModelConfig {
} }
func getTestSimpleResponderConfigPort(expectedMessage string, port int) ModelConfig { func getTestSimpleResponderConfigPort(expectedMessage string, port int) ModelConfig {
binaryPath := getSimpleResponderPath()
// Create a YAML string with just the values we want to set // Create a YAML string with just the values we want to set
yamlStr := fmt.Sprintf(` yamlStr := fmt.Sprintf(`
cmd: '%s --port %d --silent --respond %s' cmd: '%s --port %d --silent --respond %s'
proxy: "http://127.0.0.1:%d" proxy: "http://127.0.0.1:%d"
`, binaryPath, port, expectedMessage, port) `, simpleResponderPath, port, expectedMessage, port)
var cfg ModelConfig var cfg ModelConfig
if err := yaml.Unmarshal([]byte(yamlStr), &cfg); err != nil { if err := yaml.Unmarshal([]byte(yamlStr), &cfg); err != nil {
+31 -22
View File
@@ -5,12 +5,20 @@ import (
"fmt" "fmt"
"io" "io"
"net/http" "net/http"
"strings"
"time" "time"
"github.com/gin-gonic/gin" "github.com/gin-gonic/gin"
"github.com/tidwall/gjson" "github.com/tidwall/gjson"
) )
type MetricsRecorder struct {
metricsMonitor *MetricsMonitor
realModelName string
// isStreaming bool
startTime time.Time
}
// MetricsMiddleware sets up the MetricsResponseWriter for capturing upstream requests // MetricsMiddleware sets up the MetricsResponseWriter for capturing upstream requests
func MetricsMiddleware(pm *ProxyManager) gin.HandlerFunc { func MetricsMiddleware(pm *ProxyManager) gin.HandlerFunc {
return func(c *gin.Context) { return func(c *gin.Context) {
@@ -41,48 +49,48 @@ func MetricsMiddleware(pm *ProxyManager) gin.HandlerFunc {
metricsRecorder: &MetricsRecorder{ metricsRecorder: &MetricsRecorder{
metricsMonitor: pm.metricsMonitor, metricsMonitor: pm.metricsMonitor,
realModelName: realModelName, realModelName: realModelName,
isStreaming: gjson.GetBytes(bodyBytes, "stream").Bool(),
startTime: time.Now(), startTime: time.Now(),
}, },
} }
c.Writer = writer c.Writer = writer
c.Next() c.Next()
rec := writer.metricsRecorder // check for streaming response
rec.processBody(writer.body) if strings.Contains(c.Writer.Header().Get("Content-Type"), "text/event-stream") {
} writer.metricsRecorder.processStreamingResponse(writer.body)
} } else {
writer.metricsRecorder.processNonStreamingResponse(writer.body)
}
type MetricsRecorder struct {
metricsMonitor *MetricsMonitor
realModelName string
isStreaming bool
startTime time.Time
}
// processBody handles response processing after request completes
func (rec *MetricsRecorder) processBody(body []byte) {
if rec.isStreaming {
rec.processStreamingResponse(body)
} else {
rec.processNonStreamingResponse(body)
} }
} }
func (rec *MetricsRecorder) parseAndRecordMetrics(jsonData gjson.Result) bool { func (rec *MetricsRecorder) parseAndRecordMetrics(jsonData gjson.Result) bool {
usage := jsonData.Get("usage") usage := jsonData.Get("usage")
if !usage.Exists() { timings := jsonData.Get("timings")
if !usage.Exists() && !timings.Exists() {
return false return false
} }
// default values // default values
outputTokens := int(jsonData.Get("usage.completion_tokens").Int()) outputTokens := 0
inputTokens := int(jsonData.Get("usage.prompt_tokens").Int()) inputTokens := 0
// timings data
tokensPerSecond := -1.0 tokensPerSecond := -1.0
promptPerSecond := -1.0
durationMs := int(time.Since(rec.startTime).Milliseconds()) durationMs := int(time.Since(rec.startTime).Milliseconds())
if usage.Exists() {
outputTokens = int(jsonData.Get("usage.completion_tokens").Int())
inputTokens = int(jsonData.Get("usage.prompt_tokens").Int())
}
// use llama-server's timing data for tok/sec and duration as it is more accurate // use llama-server's timing data for tok/sec and duration as it is more accurate
if timings := jsonData.Get("timings"); timings.Exists() { if timings.Exists() {
inputTokens = int(jsonData.Get("timings.prompt_n").Int())
outputTokens = int(jsonData.Get("timings.predicted_n").Int())
promptPerSecond = jsonData.Get("timings.prompt_per_second").Float()
tokensPerSecond = jsonData.Get("timings.predicted_per_second").Float() tokensPerSecond = jsonData.Get("timings.predicted_per_second").Float()
durationMs = int(jsonData.Get("timings.prompt_ms").Float() + jsonData.Get("timings.predicted_ms").Float()) durationMs = int(jsonData.Get("timings.prompt_ms").Float() + jsonData.Get("timings.predicted_ms").Float())
} }
@@ -92,6 +100,7 @@ func (rec *MetricsRecorder) parseAndRecordMetrics(jsonData gjson.Result) bool {
Model: rec.realModelName, Model: rec.realModelName,
InputTokens: inputTokens, InputTokens: inputTokens,
OutputTokens: outputTokens, OutputTokens: outputTokens,
PromptPerSecond: promptPerSecond,
TokensPerSecond: tokensPerSecond, TokensPerSecond: tokensPerSecond,
DurationMs: durationMs, DurationMs: durationMs,
}) })
+1
View File
@@ -15,6 +15,7 @@ type TokenMetrics struct {
Model string `json:"model"` Model string `json:"model"`
InputTokens int `json:"input_tokens"` InputTokens int `json:"input_tokens"`
OutputTokens int `json:"output_tokens"` OutputTokens int `json:"output_tokens"`
PromptPerSecond float64 `json:"prompt_per_second"`
TokensPerSecond float64 `json:"tokens_per_second"` TokensPerSecond float64 `json:"tokens_per_second"`
DurationMs int `json:"duration_ms"` DurationMs int `json:"duration_ms"`
} }
+12 -1
View File
@@ -5,6 +5,7 @@ import (
"errors" "errors"
"fmt" "fmt"
"io" "io"
"net"
"net/http" "net/http"
"net/url" "net/url"
"os/exec" "os/exec"
@@ -363,8 +364,18 @@ func (p *Process) stopCommand() {
} }
func (p *Process) checkHealthEndpoint(healthURL string) error { func (p *Process) checkHealthEndpoint(healthURL string) error {
client := &http.Client{ client := &http.Client{
Timeout: 500 * time.Millisecond, // wait a short time for a tcp connection to be established
Transport: &http.Transport{
DialContext: (&net.Dialer{
Timeout: 500 * time.Millisecond,
}).DialContext,
},
// give a long time to respond to the health check endpoint
// after the connection is established. See issue: 276
Timeout: 5000 * time.Millisecond,
} }
req, err := http.NewRequest("GET", healthURL, nil) req, err := http.NewRequest("GET", healthURL, nil)
+10
View File
@@ -60,10 +60,20 @@ func (pg *ProcessGroup) ProxyRequest(modelID string, writer http.ResponseWriter,
if pg.swap { if pg.swap {
pg.Lock() pg.Lock()
if pg.lastUsedProcess != modelID { if pg.lastUsedProcess != modelID {
// is there something already running?
if pg.lastUsedProcess != "" { if pg.lastUsedProcess != "" {
pg.processes[pg.lastUsedProcess].Stop() pg.processes[pg.lastUsedProcess].Stop()
} }
// wait for the request to the new model to be fully handled
// and prevent race conditions see issue #277
pg.processes[modelID].ProxyRequest(writer, request)
pg.lastUsedProcess = modelID pg.lastUsedProcess = modelID
// short circuit and exit
pg.Unlock()
return nil
} }
pg.Unlock() pg.Unlock()
} }
+34 -16
View File
@@ -4,6 +4,7 @@ import (
"bytes" "bytes"
"net/http" "net/http"
"net/http/httptest" "net/http/httptest"
"sync"
"testing" "testing"
"github.com/stretchr/testify/assert" "github.com/stretchr/testify/assert"
@@ -44,32 +45,49 @@ func TestProcessGroup_HasMember(t *testing.T) {
assert.False(t, pg.HasMember("model3")) assert.False(t, pg.HasMember("model3"))
} }
func TestProcessGroup_ProxyRequestSwapIsTrue(t *testing.T) { // TestProcessGroup_ProxyRequestSwapIsTrueParallel tests that when swap is true
// and multiple requests are made in parallel, only one process is running at a time.
func TestProcessGroup_ProxyRequestSwapIsTrueParallel(t *testing.T) {
var processGroupTestConfig = AddDefaultGroupToConfig(Config{
HealthCheckTimeout: 15,
Models: map[string]ModelConfig{
// use the same listening so if a model is already running, it will fail
// this is a way to test that swap isolation is working
// properly when there are parallel requests made at the
// same time.
"model1": getTestSimpleResponderConfigPort("model1", 9832),
"model2": getTestSimpleResponderConfigPort("model2", 9832),
"model3": getTestSimpleResponderConfigPort("model3", 9832),
"model4": getTestSimpleResponderConfigPort("model4", 9832),
"model5": getTestSimpleResponderConfigPort("model5", 9832),
},
Groups: map[string]GroupConfig{
"G1": {
Swap: true,
Members: []string{"model1", "model2", "model3", "model4", "model5"},
},
},
})
pg := NewProcessGroup("G1", processGroupTestConfig, testLogger, testLogger) pg := NewProcessGroup("G1", processGroupTestConfig, testLogger, testLogger)
defer pg.StopProcesses(StopWaitForInflightRequest) defer pg.StopProcesses(StopWaitForInflightRequest)
tests := []string{"model1", "model2"} tests := []string{"model1", "model2", "model3", "model4", "model5"}
var wg sync.WaitGroup
wg.Add(len(tests))
for _, modelName := range tests { for _, modelName := range tests {
t.Run(modelName, func(t *testing.T) { go func(modelName string) {
reqBody := `{"x", "y"}` defer wg.Done()
req := httptest.NewRequest("POST", "/v1/chat/completions", bytes.NewBufferString(reqBody)) req := httptest.NewRequest("POST", "/v1/chat/completions", nil)
w := httptest.NewRecorder() w := httptest.NewRecorder()
assert.NoError(t, pg.ProxyRequest(modelName, w, req)) assert.NoError(t, pg.ProxyRequest(modelName, w, req))
assert.Equal(t, http.StatusOK, w.Code) assert.Equal(t, http.StatusOK, w.Code)
assert.Contains(t, w.Body.String(), modelName) assert.Contains(t, w.Body.String(), modelName)
}(modelName)
// make sure only one process is in the running state
count := 0
for _, process := range pg.processes {
if process.CurrentState() == StateReady {
count++
}
}
assert.Equal(t, 1, count)
})
} }
wg.Wait()
} }
func TestProcessGroup_ProxyRequestSwapIsFalse(t *testing.T) { func TestProcessGroup_ProxyRequestSwapIsFalse(t *testing.T) {
+43 -4
View File
@@ -15,6 +15,7 @@ import (
"time" "time"
"github.com/gin-gonic/gin" "github.com/gin-gonic/gin"
"github.com/mostlygeek/llama-swap/event"
"github.com/tidwall/gjson" "github.com/tidwall/gjson"
"github.com/tidwall/sjson" "github.com/tidwall/sjson"
) )
@@ -96,6 +97,35 @@ func New(config Config) *ProxyManager {
} }
pm.setupGinEngine() pm.setupGinEngine()
// run any startup hooks
if len(config.Hooks.OnStartup.Preload) > 0 {
// do it in the background, don't block startup -- not sure if good idea yet
go func() {
discardWriter := &DiscardWriter{}
for _, realModelName := range config.Hooks.OnStartup.Preload {
proxyLogger.Infof("Preloading model: %s", realModelName)
processGroup, _, err := pm.swapProcessGroup(realModelName)
if err != nil {
event.Emit(ModelPreloadedEvent{
ModelName: realModelName,
Success: false,
})
proxyLogger.Errorf("Failed to preload model %s: %v", realModelName, err)
continue
} else {
req, _ := http.NewRequest("GET", "/", nil)
processGroup.ProxyRequest(realModelName, discardWriter, req)
event.Emit(ModelPreloadedEvent{
ModelName: realModelName,
Success: true,
})
}
}
}()
}
return pm return pm
} }
@@ -161,11 +191,20 @@ func (pm *ProxyManager) setupGinEngine() {
// Support legacy /v1/completions api, see issue #12 // Support legacy /v1/completions api, see issue #12
pm.ginEngine.POST("/v1/completions", mm, pm.proxyOAIHandler) pm.ginEngine.POST("/v1/completions", mm, pm.proxyOAIHandler)
// Support embeddings // Support embeddings and reranking
pm.ginEngine.POST("/v1/embeddings", mm, pm.proxyOAIHandler) pm.ginEngine.POST("/v1/embeddings", mm, pm.proxyOAIHandler)
// llama-server's /reranking endpoint + aliases
pm.ginEngine.POST("/reranking", mm, pm.proxyOAIHandler)
pm.ginEngine.POST("/rerank", mm, pm.proxyOAIHandler)
pm.ginEngine.POST("/v1/rerank", mm, pm.proxyOAIHandler) pm.ginEngine.POST("/v1/rerank", mm, pm.proxyOAIHandler)
pm.ginEngine.POST("/v1/reranking", mm, pm.proxyOAIHandler) pm.ginEngine.POST("/v1/reranking", mm, pm.proxyOAIHandler)
pm.ginEngine.POST("/rerank", mm, pm.proxyOAIHandler)
// llama-server's /infill endpoint for code infilling
pm.ginEngine.POST("/infill", mm, pm.proxyOAIHandler)
// llama-server's /completion endpoint
pm.ginEngine.POST("/completion", mm, pm.proxyOAIHandler)
// Support audio/speech endpoint // Support audio/speech endpoint
pm.ginEngine.POST("/v1/audio/speech", pm.proxyOAIHandler) pm.ginEngine.POST("/v1/audio/speech", pm.proxyOAIHandler)
@@ -361,7 +400,7 @@ func (pm *ProxyManager) proxyToUpstream(c *gin.Context) {
return return
} }
processGroup, _, err := pm.swapProcessGroup(requestedModel) processGroup, realModelName, err := pm.swapProcessGroup(requestedModel)
if err != nil { if err != nil {
pm.sendErrorResponse(c, http.StatusInternalServerError, fmt.Sprintf("error swapping process group: %s", err.Error())) pm.sendErrorResponse(c, http.StatusInternalServerError, fmt.Sprintf("error swapping process group: %s", err.Error()))
return return
@@ -369,7 +408,7 @@ func (pm *ProxyManager) proxyToUpstream(c *gin.Context) {
// rewrite the path // rewrite the path
c.Request.URL.Path = c.Param("upstreamPath") c.Request.URL.Path = c.Param("upstreamPath")
processGroup.ProxyRequest(requestedModel, c.Writer, c.Request) processGroup.ProxyRequest(realModelName, c.Writer, c.Request)
} }
func (pm *ProxyManager) proxyOAIHandler(c *gin.Context) { func (pm *ProxyManager) proxyOAIHandler(c *gin.Context) {
+3 -5
View File
@@ -132,7 +132,7 @@ func (pm *ProxyManager) apiSendEvents(c *gin.Context) {
} }
} }
sendMetrics := func(metrics TokenMetrics) { sendMetrics := func(metrics []TokenMetrics) {
jsonData, err := json.Marshal(metrics) jsonData, err := json.Marshal(metrics)
if err == nil { if err == nil {
select { select {
@@ -168,16 +168,14 @@ func (pm *ProxyManager) apiSendEvents(c *gin.Context) {
* Send Metrics data * Send Metrics data
*/ */
defer event.On(func(e TokenMetricsEvent) { defer event.On(func(e TokenMetricsEvent) {
sendMetrics(e.Metrics) sendMetrics([]TokenMetrics{e.Metrics})
})() })()
// send initial batch of data // send initial batch of data
sendLogData("proxy", pm.proxyLogger.GetHistory()) sendLogData("proxy", pm.proxyLogger.GetHistory())
sendLogData("upstream", pm.upstreamLogger.GetHistory()) sendLogData("upstream", pm.upstreamLogger.GetHistory())
sendModels() sendModels()
for _, metrics := range pm.metricsMonitor.GetMetrics() { sendMetrics(pm.metricsMonitor.GetMetrics())
sendMetrics(metrics)
}
for { for {
select { select {
+145 -50
View File
@@ -9,10 +9,12 @@ import (
"net/http" "net/http"
"net/http/httptest" "net/http/httptest"
"strconv" "strconv"
"strings"
"sync" "sync"
"testing" "testing"
"time" "time"
"github.com/mostlygeek/llama-swap/event"
"github.com/stretchr/testify/assert" "github.com/stretchr/testify/assert"
"github.com/tidwall/gjson" "github.com/tidwall/gjson"
) )
@@ -40,7 +42,6 @@ func TestProxyManager_SwapProcessCorrectly(t *testing.T) {
assert.Contains(t, w.Body.String(), modelName) assert.Contains(t, w.Body.String(), modelName)
} }
} }
func TestProxyManager_SwapMultiProcess(t *testing.T) { func TestProxyManager_SwapMultiProcess(t *testing.T) {
config := AddDefaultGroupToConfig(Config{ config := AddDefaultGroupToConfig(Config{
HealthCheckTimeout: 15, HealthCheckTimeout: 15,
@@ -280,48 +281,48 @@ func TestProxyManager_ListModelsHandler(t *testing.T) {
} }
func TestProxyManager_ListModelsHandler_SortedByID(t *testing.T) { func TestProxyManager_ListModelsHandler_SortedByID(t *testing.T) {
// Intentionally add models in non-sorted order and with an unlisted model // Intentionally add models in non-sorted order and with an unlisted model
config := Config{ config := Config{
HealthCheckTimeout: 15, HealthCheckTimeout: 15,
Models: map[string]ModelConfig{ Models: map[string]ModelConfig{
"zeta": getTestSimpleResponderConfig("zeta"), "zeta": getTestSimpleResponderConfig("zeta"),
"alpha": getTestSimpleResponderConfig("alpha"), "alpha": getTestSimpleResponderConfig("alpha"),
"beta": getTestSimpleResponderConfig("beta"), "beta": getTestSimpleResponderConfig("beta"),
"hidden": func() ModelConfig { "hidden": func() ModelConfig {
mc := getTestSimpleResponderConfig("hidden") mc := getTestSimpleResponderConfig("hidden")
mc.Unlisted = true mc.Unlisted = true
return mc return mc
}(), }(),
}, },
LogLevel: "error", LogLevel: "error",
} }
proxy := New(config) proxy := New(config)
// Request models list // Request models list
req := httptest.NewRequest("GET", "/v1/models", nil) req := httptest.NewRequest("GET", "/v1/models", nil)
w := httptest.NewRecorder() w := httptest.NewRecorder()
proxy.ServeHTTP(w, req) proxy.ServeHTTP(w, req)
assert.Equal(t, http.StatusOK, w.Code) assert.Equal(t, http.StatusOK, w.Code)
var response struct { var response struct {
Data []map[string]interface{} `json:"data"` Data []map[string]interface{} `json:"data"`
} }
if err := json.Unmarshal(w.Body.Bytes(), &response); err != nil { if err := json.Unmarshal(w.Body.Bytes(), &response); err != nil {
t.Fatalf("Failed to parse JSON response: %v", err) t.Fatalf("Failed to parse JSON response: %v", err)
} }
// We expect only the listed models in sorted order by id // We expect only the listed models in sorted order by id
expectedOrder := []string{"alpha", "beta", "zeta"} expectedOrder := []string{"alpha", "beta", "zeta"}
if assert.Len(t, response.Data, len(expectedOrder), "unexpected number of listed models") { if assert.Len(t, response.Data, len(expectedOrder), "unexpected number of listed models") {
got := make([]string, 0, len(response.Data)) got := make([]string, 0, len(response.Data))
for _, m := range response.Data { for _, m := range response.Data {
id, _ := m["id"].(string) id, _ := m["id"].(string)
got = append(got, id) got = append(got, id)
} }
assert.Equal(t, expectedOrder, got, "models should be sorted by id ascending") assert.Equal(t, expectedOrder, got, "models should be sorted by id ascending")
} }
} }
func TestProxyManager_Shutdown(t *testing.T) { func TestProxyManager_Shutdown(t *testing.T) {
@@ -656,21 +657,34 @@ func TestProxyManager_CORSOptionsHandler(t *testing.T) {
} }
func TestProxyManager_Upstream(t *testing.T) { func TestProxyManager_Upstream(t *testing.T) {
config := AddDefaultGroupToConfig(Config{ configStr := fmt.Sprintf(`
HealthCheckTimeout: 15, logLevel: error
Models: map[string]ModelConfig{ models:
"model1": getTestSimpleResponderConfig("model1"), model1:
}, cmd: %s -port ${PORT} -silent -respond model1
LogLevel: "error", aliases: [model-alias]
}) `, getSimpleResponderPath())
config, err := LoadConfigFromReader(strings.NewReader(configStr))
assert.NoError(t, err)
proxy := New(config) proxy := New(config)
defer proxy.StopProcesses(StopWaitForInflightRequest) defer proxy.StopProcesses(StopWaitForInflightRequest)
req := httptest.NewRequest("GET", "/upstream/model1/test", nil) t.Run("main model name", func(t *testing.T) {
rec := httptest.NewRecorder() req := httptest.NewRequest("GET", "/upstream/model1/test", nil)
proxy.ServeHTTP(rec, req) rec := httptest.NewRecorder()
assert.Equal(t, http.StatusOK, rec.Code) proxy.ServeHTTP(rec, req)
assert.Equal(t, "model1", rec.Body.String()) assert.Equal(t, http.StatusOK, rec.Code)
assert.Equal(t, "model1", rec.Body.String())
})
t.Run("model alias", func(t *testing.T) {
req := httptest.NewRequest("GET", "/upstream/model-alias/test", nil)
rec := httptest.NewRecorder()
proxy.ServeHTTP(rec, req)
assert.Equal(t, http.StatusOK, rec.Code)
assert.Equal(t, "model1", rec.Body.String())
})
} }
func TestProxyManager_ChatContentLength(t *testing.T) { func TestProxyManager_ChatContentLength(t *testing.T) {
@@ -818,3 +832,84 @@ func TestProxyManager_HealthEndpoint(t *testing.T) {
assert.Equal(t, http.StatusOK, rec.Code) assert.Equal(t, http.StatusOK, rec.Code)
assert.Equal(t, "OK", rec.Body.String()) assert.Equal(t, "OK", rec.Body.String())
} }
// Ensure the custom llama-server /completion endpoint proxies correctly
func TestProxyManager_CompletionEndpoint(t *testing.T) {
config := AddDefaultGroupToConfig(Config{
HealthCheckTimeout: 15,
Models: map[string]ModelConfig{
"model1": getTestSimpleResponderConfig("model1"),
},
LogLevel: "error",
})
proxy := New(config)
defer proxy.StopProcesses(StopWaitForInflightRequest)
reqBody := `{"model":"model1"}`
req := httptest.NewRequest("POST", "/completion", bytes.NewBufferString(reqBody))
w := httptest.NewRecorder()
proxy.ServeHTTP(w, req)
assert.Equal(t, http.StatusOK, w.Code)
assert.Contains(t, w.Body.String(), "model1")
}
func TestProxyManager_StartupHooks(t *testing.T) {
// using real YAML as the configuration has gotten more complex
// is the right approach as LoadConfigFromReader() does a lot more
// than parse YAML now. Eventually migrate all tests to use this approach
configStr := strings.Replace(`
logLevel: error
hooks:
on_startup:
preload:
- model1
- model2
groups:
preloadTestGroup:
swap: false
members:
- model1
- model2
models:
model1:
cmd: ${simpleresponderpath} --port ${PORT} --silent --respond model1
model2:
cmd: ${simpleresponderpath} --port ${PORT} --silent --respond model2
`, "${simpleresponderpath}", simpleResponderPath, -1)
// Create a test model configuration
config, err := LoadConfigFromReader(strings.NewReader(configStr))
if !assert.NoError(t, err, "Invalid configuration") {
return
}
preloadChan := make(chan ModelPreloadedEvent, 2) // buffer for 2 expected events
unsub := event.On(func(e ModelPreloadedEvent) {
preloadChan <- e
})
defer unsub()
// Create the proxy which should trigger preloading
proxy := New(config)
defer proxy.StopProcesses(StopWaitForInflightRequest)
for i := 0; i < 2; i++ {
select {
case <-preloadChan:
case <-time.After(5 * time.Second):
t.Fatal("timed out waiting for models to preload")
}
}
// make sure they are both loaded
_, foundGroup := proxy.processGroups["preloadTestGroup"]
if !assert.True(t, foundGroup, "preloadTestGroup should exist") {
return
}
assert.Equal(t, StateReady, proxy.processGroups["preloadTestGroup"].processes["model1"].CurrentState())
assert.Equal(t, StateReady, proxy.processGroups["preloadTestGroup"].processes["model2"].CurrentState())
}
+62 -34
View File
@@ -1,50 +1,78 @@
import { useEffect, useCallback } from "react";
import { BrowserRouter as Router, Routes, Route, Navigate, NavLink } from "react-router-dom"; import { BrowserRouter as Router, Routes, Route, Navigate, NavLink } from "react-router-dom";
import { useTheme } from "./contexts/ThemeProvider"; import { useTheme } from "./contexts/ThemeProvider";
import { APIProvider } from "./contexts/APIProvider"; import { useAPI } from "./contexts/APIProvider";
import LogViewerPage from "./pages/LogViewer"; import LogViewerPage from "./pages/LogViewer";
import ModelPage from "./pages/Models"; import ModelPage from "./pages/Models";
import ActivityPage from "./pages/Activity"; import ActivityPage from "./pages/Activity";
import ConnectionStatusIcon from "./components/ConnectionStatus";
import { RiSunFill, RiMoonFill } from "react-icons/ri"; import { RiSunFill, RiMoonFill } from "react-icons/ri";
function App() { function App() {
const { isNarrow, toggleTheme, isDarkMode } = useTheme(); const { isNarrow, toggleTheme, isDarkMode, appTitle, setAppTitle, setConnectionState } = useTheme();
const handleTitleChange = useCallback(
(newTitle: string) => {
setAppTitle(newTitle.replace(/\n/g, "").trim().substring(0, 64) || "llama-swap");
},
[setAppTitle]
);
const { connectionStatus } = useAPI();
// Synchronize the window.title connections state with the actual connection state
useEffect(() => {
setConnectionState(connectionStatus);
}, [connectionStatus]);
return ( return (
<Router basename="/ui/"> <Router basename="/ui/">
<APIProvider> <div className="flex flex-col h-screen">
<div className="flex flex-col h-screen"> <nav className="bg-surface border-b border-border p-2 h-[75px]">
<nav className="bg-surface border-b border-border p-2 h-[75px]"> <div className="flex items-center justify-between mx-auto px-4 h-full">
<div className="flex items-center justify-between mx-auto px-4 h-full"> {!isNarrow && (
{!isNarrow && <h1 className="flex items-center p-0">llama-swap</h1>} <h1
<div className="flex items-center space-x-4"> contentEditable
<NavLink to="/" className={({ isActive }) => (isActive ? "navlink active" : "navlink")}> suppressContentEditableWarning
Logs className="flex items-center p-0 outline-none hover:bg-gray-100 dark:hover:bg-gray-700 rounded px-1"
</NavLink> onBlur={(e) => handleTitleChange(e.currentTarget.textContent || "(set title)")}
onKeyDown={(e) => {
<NavLink to="/models" className={({ isActive }) => (isActive ? "navlink active" : "navlink")}> if (e.key === "Enter") {
Models e.preventDefault();
</NavLink> handleTitleChange(e.currentTarget.textContent || "(set title)");
e.currentTarget.blur();
<NavLink to="/activity" className={({ isActive }) => (isActive ? "navlink active" : "navlink")}> }
Activity }}
</NavLink> >
<button className="" onClick={toggleTheme}> {appTitle}
{isDarkMode ? <RiMoonFill /> : <RiSunFill />} </h1>
</button> )}
</div> <div className="flex items-center space-x-4">
<NavLink to="/" className={({ isActive }) => (isActive ? "navlink active" : "navlink")}>
Logs
</NavLink>
<NavLink to="/models" className={({ isActive }) => (isActive ? "navlink active" : "navlink")}>
Models
</NavLink>
<NavLink to="/activity" className={({ isActive }) => (isActive ? "navlink active" : "navlink")}>
Activity
</NavLink>
<button className="" onClick={toggleTheme}>
{isDarkMode ? <RiMoonFill /> : <RiSunFill />}
</button>
<ConnectionStatusIcon />
</div> </div>
</nav> </div>
</nav>
<main className="flex-1 overflow-auto p-4"> <main className="flex-1 overflow-auto p-4">
<Routes> <Routes>
<Route path="/" element={<LogViewerPage />} /> <Route path="/" element={<LogViewerPage />} />
<Route path="/models" element={<ModelPage />} /> <Route path="/models" element={<ModelPage />} />
<Route path="/activity" element={<ActivityPage />} /> <Route path="/activity" element={<ActivityPage />} />
<Route path="*" element={<Navigate to="/" replace />} /> <Route path="*" element={<Navigate to="/" replace />} />
</Routes> </Routes>
</main> </main>
</div> </div>
</APIProvider>
</Router> </Router>
); );
} }
+26
View File
@@ -0,0 +1,26 @@
import { useAPI } from "../contexts/APIProvider";
import { useMemo } from "react";
const ConnectionStatusIcon = () => {
const { connectionStatus } = useAPI();
const eventStatusColor = useMemo(() => {
switch (connectionStatus) {
case "connected":
return "bg-green-500";
case "connecting":
return "bg-yellow-500";
case "disconnected":
default:
return "bg-red-500";
}
}, [connectionStatus]);
return (
<div className="flex items-center" title={`event stream: ${connectionStatus}`}>
<span className={`inline-block w-3 h-3 rounded-full ${eventStatusColor} mr-2`}></span>
</div>
);
};
export default ConnectionStatusIcon;
+28 -4
View File
@@ -1,4 +1,5 @@
import { useRef, createContext, useState, useContext, useEffect, useCallback, useMemo, type ReactNode } from "react"; import { useRef, createContext, useState, useContext, useEffect, useCallback, useMemo, type ReactNode } from "react";
import type { ConnectionState } from "../lib/types";
type ModelStatus = "ready" | "starting" | "stopping" | "stopped" | "shutdown" | "unknown"; type ModelStatus = "ready" | "starting" | "stopping" | "stopped" | "shutdown" | "unknown";
const LOG_LENGTH_LIMIT = 1024 * 100; /* 100KB of log data */ const LOG_LENGTH_LIMIT = 1024 * 100; /* 100KB of log data */
@@ -20,6 +21,7 @@ interface APIProviderType {
proxyLogs: string; proxyLogs: string;
upstreamLogs: string; upstreamLogs: string;
metrics: Metrics[]; metrics: Metrics[];
connectionStatus: ConnectionState;
} }
interface Metrics { interface Metrics {
@@ -28,6 +30,7 @@ interface Metrics {
model: string; model: string;
input_tokens: number; input_tokens: number;
output_tokens: number; output_tokens: number;
prompt_per_second: number;
tokens_per_second: number; tokens_per_second: number;
duration_ms: number; duration_ms: number;
} }
@@ -51,6 +54,7 @@ export function APIProvider({ children, autoStartAPIEvents = true }: APIProvider
const [proxyLogs, setProxyLogs] = useState(""); const [proxyLogs, setProxyLogs] = useState("");
const [upstreamLogs, setUpstreamLogs] = useState(""); const [upstreamLogs, setUpstreamLogs] = useState("");
const [metrics, setMetrics] = useState<Metrics[]>([]); const [metrics, setMetrics] = useState<Metrics[]>([]);
const [connectionStatus, setConnectionState] = useState<ConnectionState>("disconnected");
const apiEventSource = useRef<EventSource | null>(null); const apiEventSource = useRef<EventSource | null>(null);
const [models, setModels] = useState<Model[]>([]); const [models, setModels] = useState<Model[]>([]);
@@ -74,7 +78,20 @@ export function APIProvider({ children, autoStartAPIEvents = true }: APIProvider
const initialDelay = 1000; // 1 second const initialDelay = 1000; // 1 second
const connect = () => { const connect = () => {
apiEventSource.current = null;
const eventSource = new EventSource("/api/events"); const eventSource = new EventSource("/api/events");
setConnectionState("connecting");
eventSource.onopen = () => {
// clear everything out on connect to keep things in sync
setProxyLogs("");
setUpstreamLogs("");
setMetrics([]); // clear metrics on reconnect
setModels([]); // clear models on reconnect
apiEventSource.current = eventSource;
retryCount = 0;
setConnectionState("connected");
};
eventSource.onmessage = (e: MessageEvent) => { eventSource.onmessage = (e: MessageEvent) => {
try { try {
@@ -83,6 +100,12 @@ export function APIProvider({ children, autoStartAPIEvents = true }: APIProvider
case "modelStatus": case "modelStatus":
{ {
const models = JSON.parse(message.data) as Model[]; const models = JSON.parse(message.data) as Model[];
// sort models by name and id
models.sort((a, b) => {
return (a.name + a.id).localeCompare(b.name + b.id);
});
setModels(models); setModels(models);
} }
break; break;
@@ -101,9 +124,9 @@ export function APIProvider({ children, autoStartAPIEvents = true }: APIProvider
case "metrics": case "metrics":
{ {
const newMetric = JSON.parse(message.data) as Metrics; const newMetrics = JSON.parse(message.data) as Metrics[];
setMetrics((prevMetrics) => { setMetrics((prevMetrics) => {
return [newMetric, ...prevMetrics]; return [...newMetrics, ...prevMetrics];
}); });
} }
break; break;
@@ -112,14 +135,14 @@ export function APIProvider({ children, autoStartAPIEvents = true }: APIProvider
console.error(e.data, err); console.error(e.data, err);
} }
}; };
eventSource.onerror = () => { eventSource.onerror = () => {
eventSource.close(); eventSource.close();
retryCount++; retryCount++;
const delay = Math.min(initialDelay * Math.pow(2, retryCount - 1), 5000); const delay = Math.min(initialDelay * Math.pow(2, retryCount - 1), 5000);
setConnectionState("disconnected");
setTimeout(connect, delay); setTimeout(connect, delay);
}; };
apiEventSource.current = eventSource;
}; };
connect(); connect();
@@ -187,6 +210,7 @@ export function APIProvider({ children, autoStartAPIEvents = true }: APIProvider
proxyLogs, proxyLogs,
upstreamLogs, upstreamLogs,
metrics, metrics,
connectionStatus,
}), }),
[models, listModels, unloadAllModels, loadModel, enableAPIEvents, proxyLogs, upstreamLogs, metrics] [models, listModels, unloadAllModels, loadModel, enableAPIEvents, proxyLogs, upstreamLogs, metrics]
); );
+30 -1
View File
@@ -1,5 +1,6 @@
import { createContext, useContext, useEffect, type ReactNode, useMemo, useState } from "react"; import { createContext, useContext, useEffect, type ReactNode, useMemo, useState } from "react";
import { usePersistentState } from "../hooks/usePersistentState"; import { usePersistentState } from "../hooks/usePersistentState";
import type { ConnectionState } from "../lib/types";
type ScreenWidth = "xs" | "sm" | "md" | "lg" | "xl" | "2xl"; type ScreenWidth = "xs" | "sm" | "md" | "lg" | "xl" | "2xl";
type ThemeContextType = { type ThemeContextType = {
@@ -7,6 +8,11 @@ type ThemeContextType = {
screenWidth: ScreenWidth; screenWidth: ScreenWidth;
isNarrow: boolean; isNarrow: boolean;
toggleTheme: () => void; toggleTheme: () => void;
// for managing the window title and connection state information
appTitle: string;
setAppTitle: (title: string) => void;
setConnectionState: (state: ConnectionState) => void;
}; };
const ThemeContext = createContext<ThemeContextType | undefined>(undefined); const ThemeContext = createContext<ThemeContextType | undefined>(undefined);
@@ -16,6 +22,17 @@ type ThemeProviderProps = {
}; };
export function ThemeProvider({ children }: ThemeProviderProps) { export function ThemeProvider({ children }: ThemeProviderProps) {
const [appTitle, setAppTitle] = usePersistentState("app-title", "llama-swap");
const [connectionState, setConnectionState] = useState<ConnectionState>("disconnected");
/**
* Set the document.title with informative information
*/
useEffect(() => {
const connectionIcon = connectionState === "connecting" ? "🟡" : connectionState === "connected" ? "🟢" : "🔴";
document.title = connectionIcon + " " + appTitle; // Set initial title
}, [appTitle, connectionState]);
const [isDarkMode, setIsDarkMode] = usePersistentState<boolean>("theme", false); const [isDarkMode, setIsDarkMode] = usePersistentState<boolean>("theme", false);
const [screenWidth, setScreenWidth] = useState<ScreenWidth>("md"); // Default to md const [screenWidth, setScreenWidth] = useState<ScreenWidth>("md"); // Default to md
@@ -55,7 +72,19 @@ export function ThemeProvider({ children }: ThemeProviderProps) {
}, [screenWidth]); }, [screenWidth]);
return ( return (
<ThemeContext.Provider value={{ isDarkMode, toggleTheme, screenWidth, isNarrow }}>{children}</ThemeContext.Provider> <ThemeContext.Provider
value={{
isDarkMode,
toggleTheme,
screenWidth,
isNarrow,
appTitle,
setAppTitle,
setConnectionState,
}}
>
{children}
</ThemeContext.Provider>
); );
} }
+1
View File
@@ -0,0 +1 @@
export type ConnectionState = "connected" | "connecting" | "disconnected";
+4 -1
View File
@@ -3,11 +3,14 @@ import { createRoot } from "react-dom/client";
import "./index.css"; import "./index.css";
import App from "./App.tsx"; import App from "./App.tsx";
import { ThemeProvider } from "./contexts/ThemeProvider"; import { ThemeProvider } from "./contexts/ThemeProvider";
import { APIProvider } from "./contexts/APIProvider";
createRoot(document.getElementById("root")!).render( createRoot(document.getElementById("root")!).render(
<StrictMode> <StrictMode>
<ThemeProvider> <ThemeProvider>
<App /> <APIProvider>
<App />
</APIProvider>
</ThemeProvider> </ThemeProvider>
</StrictMode> </StrictMode>
); );
+9 -20
View File
@@ -1,4 +1,4 @@
import { useState, useEffect } from "react"; import { useMemo } from "react";
import { useAPI } from "../contexts/APIProvider"; import { useAPI } from "../contexts/APIProvider";
const formatTimestamp = (timestamp: string): string => { const formatTimestamp = (timestamp: string): string => {
@@ -15,25 +15,10 @@ const formatDuration = (ms: number): string => {
const ActivityPage = () => { const ActivityPage = () => {
const { metrics } = useAPI(); const { metrics } = useAPI();
const [error, setError] = useState<string | null>(null); const sortedMetrics = useMemo(() => {
return [...metrics].sort((a, b) => b.id - a.id);
useEffect(() => {
if (metrics.length > 0) {
setError(null);
}
}, [metrics]); }, [metrics]);
if (error) {
return (
<div className="p-6">
<h1 className="text-2xl font-bold mb-4">Activity</h1>
<div className="bg-red-50 border border-red-200 rounded-md p-4">
<p className="text-red-800">{error}</p>
</div>
</div>
);
}
return ( return (
<div className="p-6"> <div className="p-6">
<h1 className="text-2xl font-bold mb-4">Activity</h1> <h1 className="text-2xl font-bold mb-4">Activity</h1>
@@ -47,21 +32,25 @@ const ActivityPage = () => {
<table className="min-w-full divide-y"> <table className="min-w-full divide-y">
<thead> <thead>
<tr> <tr>
<th className="px-4 py-3 text-left text-xs font-medium uppercase tracking-wider">Id</th>
<th className="px-6 py-3 text-left text-xs font-medium uppercase tracking-wider">Timestamp</th> <th className="px-6 py-3 text-left text-xs font-medium uppercase tracking-wider">Timestamp</th>
<th className="px-6 py-3 text-left text-xs font-medium uppercase tracking-wider">Model</th> <th className="px-6 py-3 text-left text-xs font-medium uppercase tracking-wider">Model</th>
<th className="px-6 py-3 text-left text-xs font-medium uppercase tracking-wider">Input Tokens</th> <th className="px-6 py-3 text-left text-xs font-medium uppercase tracking-wider">Input Tokens</th>
<th className="px-6 py-3 text-left text-xs font-medium uppercase tracking-wider">Output Tokens</th> <th className="px-6 py-3 text-left text-xs font-medium uppercase tracking-wider">Output Tokens</th>
<th className="px-6 py-3 text-left text-xs font-medium uppercase tracking-wider">Prompt Processing</th>
<th className="px-6 py-3 text-left text-xs font-medium uppercase tracking-wider">Generation Speed</th> <th className="px-6 py-3 text-left text-xs font-medium uppercase tracking-wider">Generation Speed</th>
<th className="px-6 py-3 text-left text-xs font-medium uppercase tracking-wider">Duration</th> <th className="px-6 py-3 text-left text-xs font-medium uppercase tracking-wider">Duration</th>
</tr> </tr>
</thead> </thead>
<tbody className="divide-y"> <tbody className="divide-y">
{metrics.map((metric, index) => ( {sortedMetrics.map((metric) => (
<tr key={`${metric.id}-${index}`}> <tr key={`metric_${metric.id}`}>
<td className="px-4 py-4 whitespace-nowrap text-sm">{metric.id + 1 /* un-zero index */}</td>
<td className="px-6 py-4 whitespace-nowrap text-sm">{formatTimestamp(metric.timestamp)}</td> <td className="px-6 py-4 whitespace-nowrap text-sm">{formatTimestamp(metric.timestamp)}</td>
<td className="px-6 py-4 whitespace-nowrap text-sm">{metric.model}</td> <td className="px-6 py-4 whitespace-nowrap text-sm">{metric.model}</td>
<td className="px-6 py-4 whitespace-nowrap text-sm">{metric.input_tokens.toLocaleString()}</td> <td className="px-6 py-4 whitespace-nowrap text-sm">{metric.input_tokens.toLocaleString()}</td>
<td className="px-6 py-4 whitespace-nowrap text-sm">{metric.output_tokens.toLocaleString()}</td> <td className="px-6 py-4 whitespace-nowrap text-sm">{metric.output_tokens.toLocaleString()}</td>
<td className="px-6 py-4 whitespace-nowrap text-sm">{formatSpeed(metric.prompt_per_second)}</td>
<td className="px-6 py-4 whitespace-nowrap text-sm">{formatSpeed(metric.tokens_per_second)}</td> <td className="px-6 py-4 whitespace-nowrap text-sm">{formatSpeed(metric.tokens_per_second)}</td>
<td className="px-6 py-4 whitespace-nowrap text-sm">{formatDuration(metric.duration_ms)}</td> <td className="px-6 py-4 whitespace-nowrap text-sm">{formatDuration(metric.duration_ms)}</td>
</tr> </tr>
+47 -30
View File
@@ -4,7 +4,7 @@ import { LogPanel } from "./LogViewer";
import { usePersistentState } from "../hooks/usePersistentState"; import { usePersistentState } from "../hooks/usePersistentState";
import { Panel, PanelGroup, PanelResizeHandle } from "react-resizable-panels"; import { Panel, PanelGroup, PanelResizeHandle } from "react-resizable-panels";
import { useTheme } from "../contexts/ThemeProvider"; import { useTheme } from "../contexts/ThemeProvider";
import { RiEyeFill, RiEyeOffFill, RiStopCircleLine } from "react-icons/ri"; import { RiEyeFill, RiEyeOffFill, RiStopCircleLine, RiSwapBoxFill } from "react-icons/ri";
export default function ModelsPage() { export default function ModelsPage() {
const { isNarrow } = useTheme(); const { isNarrow } = useTheme();
@@ -40,6 +40,7 @@ function ModelsPanel() {
const { models, loadModel, unloadAllModels } = useAPI(); const { models, loadModel, unloadAllModels } = useAPI();
const [isUnloading, setIsUnloading] = useState(false); const [isUnloading, setIsUnloading] = useState(false);
const [showUnlisted, setShowUnlisted] = usePersistentState("showUnlisted", true); const [showUnlisted, setShowUnlisted] = usePersistentState("showUnlisted", true);
const [showIdorName, setShowIdorName] = usePersistentState<"id" | "name">("showIdorName", "id"); // true = show ID, false = show name
const filteredModels = useMemo(() => { const filteredModels = useMemo(() => {
return models.filter((model) => showUnlisted || !model.unlisted); return models.filter((model) => showUnlisted || !model.unlisted);
@@ -58,18 +59,28 @@ function ModelsPanel() {
} }
}, [unloadAllModels]); }, [unloadAllModels]);
const toggleIdorName = useCallback(() => {
setShowIdorName((prev) => (prev === "name" ? "id" : "name"));
}, [showIdorName]);
return ( return (
<div className="card h-full flex flex-col"> <div className="card h-full flex flex-col">
<div className="shrink-0"> <div className="shrink-0">
<h2>Models</h2> <h2>Models</h2>
<div className="flex justify-between"> <div className="flex justify-between">
<button <div className="flex gap-2">
className="btn flex items-center gap-2" <button className="btn flex items-center gap-2" onClick={toggleIdorName} style={{ lineHeight: "1.2" }}>
onClick={() => setShowUnlisted(!showUnlisted)} <RiSwapBoxFill /> {showIdorName === "id" ? "ID" : "Name"}
style={{ lineHeight: "1.2" }} </button>
>
{showUnlisted ? <RiEyeFill /> : <RiEyeOffFill />} unlisted <button
</button> className="btn flex items-center gap-2"
onClick={() => setShowUnlisted(!showUnlisted)}
style={{ lineHeight: "1.2" }}
>
{showUnlisted ? <RiEyeFill /> : <RiEyeOffFill />} unlisted
</button>
</div>
<button className="btn flex items-center gap-2" onClick={handleUnloadAllModels} disabled={isUnloading}> <button className="btn flex items-center gap-2" onClick={handleUnloadAllModels} disabled={isUnloading}>
<RiStopCircleLine size="24" /> {isUnloading ? "Unloading..." : "Unload"} <RiStopCircleLine size="24" /> {isUnloading ? "Unloading..." : "Unload"}
</button> </button>
@@ -80,7 +91,7 @@ function ModelsPanel() {
<table className="w-full"> <table className="w-full">
<thead className="sticky top-0 bg-card z-10"> <thead className="sticky top-0 bg-card z-10">
<tr className="border-b border-primary bg-surface"> <tr className="border-b border-primary bg-surface">
<th className="text-left p-2">Name</th> <th className="text-left p-2">{showIdorName === "id" ? "Model ID" : "Name"}</th>
<th className="text-left p-2"></th> <th className="text-left p-2"></th>
<th className="text-left p-2">State</th> <th className="text-left p-2">State</th>
</tr> </tr>
@@ -90,7 +101,7 @@ function ModelsPanel() {
<tr key={model.id} className="border-b hover:bg-secondary-hover border-border"> <tr key={model.id} className="border-b hover:bg-secondary-hover border-border">
<td className={`p-2 ${model.unlisted ? "text-txtsecondary" : ""}`}> <td className={`p-2 ${model.unlisted ? "text-txtsecondary" : ""}`}>
<a href={`/upstream/${model.id}/`} className={`underline`} target="_blank"> <a href={`/upstream/${model.id}/`} className={`underline`} target="_blank">
{model.name !== "" ? model.name : model.id} {showIdorName === "id" ? model.id : model.name !== "" ? model.name : model.id}
</a> </a>
{model.description !== "" && ( {model.description !== "" && (
<p className={model.unlisted ? "text-opacity-70" : ""}> <p className={model.unlisted ? "text-opacity-70" : ""}>
@@ -122,35 +133,41 @@ function ModelsPanel() {
function StatsPanel() { function StatsPanel() {
const { metrics } = useAPI(); const { metrics } = useAPI();
const [totalRequests, totalTokens, avgTokensPerSecond] = useMemo(() => { const [totalRequests, totalInputTokens, totalOutputTokens, avgTokensPerSecond] = useMemo(() => {
const totalRequests = metrics.length; const totalRequests = metrics.length;
if (totalRequests === 0) { if (totalRequests === 0) {
return [0, 0, 0]; return [0, 0, 0];
} }
const totalTokens = metrics.reduce((sum, m) => sum + m.output_tokens, 0); const totalInputTokens = metrics.reduce((sum, m) => sum + m.input_tokens, 0);
const totalOutputTokens = metrics.reduce((sum, m) => sum + m.output_tokens, 0);
const avgTokensPerSecond = (metrics.reduce((sum, m) => sum + m.tokens_per_second, 0) / totalRequests).toFixed(2); const avgTokensPerSecond = (metrics.reduce((sum, m) => sum + m.tokens_per_second, 0) / totalRequests).toFixed(2);
return [totalRequests, totalTokens, avgTokensPerSecond]; return [totalRequests, totalInputTokens, totalOutputTokens, avgTokensPerSecond];
}, [metrics]); }, [metrics]);
return ( return (
<div className="card"> <div className="card">
<h2>Chat Activity</h2> <div className="rounded-lg overflow-hidden border border-gray-200">
<table className="w-full border border-gray-200"> <table className="w-full">
<tbody> <tbody>
<tr className="border-b border-gray-200"> <tr>
<td className="py-2 px-4 font-medium border-r border-gray-200">Requests</td> <th className="p-2 font-medium border-b border-gray-200 text-right">Requests</th>
<td className="py-2 px-4 text-right">{totalRequests}</td> <th className="p-2 font-medium border-l border-b border-gray-200 text-right">Processed</th>
</tr> <th className="p-2 font-medium border-l border-b border-gray-200 text-right">Generated</th>
<tr className="border-b border-gray-200"> <th className="p-2 font-medium border-l border-b border-gray-200 text-right">Tokens/Sec</th>
<td className="py-2 px-4 font-medium border-r border-gray-200">Total Tokens Generated</td> </tr>
<td className="py-2 px-4 text-right">{totalTokens}</td> <tr>
</tr> <td className="p-2 text-right border-r border-gray-200">{totalRequests}</td>
<tr> <td className="p-2 text-right border-r border-gray-200">
<td className="py-2 px-4 font-medium border-r border-gray-200">Average Tokens/Second</td> {new Intl.NumberFormat().format(totalInputTokens)}
<td className="py-2 px-4 text-right">{avgTokensPerSecond}</td> </td>
</tr> <td className="p-2 text-right border-r border-gray-200">
</tbody> {new Intl.NumberFormat().format(totalOutputTokens)}
</table> </td>
<td className="p-2 text-right">{avgTokensPerSecond}</td>
</tr>
</tbody>
</table>
</div>
</div> </div>
); );
} }