Add different timeout scenarios to Process.checkHealthEndpoint #276 (#278 )

- add a TCP connection timeout of 500ms - increase HTTP client timeout to 5000ms In this new behaviour the upstream has 500ms to accept a tcp connection and 5000ms to respond to the HTTP request.
add /completion endpoint (#275 )
2025-08-28 22:03:14 -07:00 · 2025-08-28 21:41:02 -07:00 · 2025-08-28 21:38:40 -07:00 · 2025-08-27 08:36:05 -07:00 · 2025-08-22 16:08:37 -07:00 · 2025-08-20 13:58:24 -07:00
29 changed files with 829 additions and 230 deletions
@@ -1,11 +1,13 @@
 ---
 name: Bug Report
-about: Something is not working as expected...
+about: I found a defect
 title: ''
-labels: bug
+labels: 'unconfirmed bug'
 assignees: ''
 ---
 > [!IMPORTANT]
 > If you have questions about llama-swap please post in the Q&A in Discussions. Use bug reports when you've found a defect and wish to discuss a fix.
 **Describe the bug**
 A clear and concise description of what the bug is.
@@ -22,6 +22,13 @@ jobs:
      with:
        go-version: '1.23'
    # Only run in this linux based runner
    - name: Check Formatting
      run: |
        if [ "$(gofmt -l . | grep -v 'event/.*_test.go' | wc -l)" -gt 0 ]; then
          gofmt -l . | grep -v 'event/.*_test.go'
          exit 1
        fi
    # cache simple-responder to save the build time
    - name: Restore Simple Responder
      id: restore-simple-responder
@@ -4,3 +4,4 @@ build/
 dist/
 .vscode
 .DS_Store
 .dev/
@@ -18,9 +18,12 @@ Written in golang, it is very easy to install (single binary with no dependencie
  - `v1/completions`
  - `v1/chat/completions`
  - `v1/embeddings`
  - `v1/rerank`, `v1/reranking`, `rerank`
  - `v1/audio/speech` ([#36](https://github.com/mostlygeek/llama-swap/issues/36))
  - `v1/audio/transcriptions` ([docs](https://github.com/mostlygeek/llama-swap/issues/41#issuecomment-2722637867))
 - ✅ llama-server (llama.cpp) supported endpoints:
  - `v1/rerank`, `v1/reranking`, `/rerank`
  - `/infill` - for code infilling
  - `/completion` - for completion endpoint
 - ✅ llama-swap custom API endpoints
  - `/ui` - web UI
  - `/log` - remote log monitoring
@@ -31,8 +34,9 @@ Written in golang, it is very easy to install (single binary with no dependencie
 - ✅ Run multiple models at once with `Groups` ([#107](https://github.com/mostlygeek/llama-swap/issues/107))
 - ✅ Automatic unloading of models after timeout by setting a `ttl`
 - ✅ Use any local OpenAI compatible server (llama.cpp, vllm, tabbyAPI, etc)
- ✅ Docker and Podman support
+- ✅ Reliable Docker and Podman support with `cmdStart` and `cmdStop`
 - ✅ Full control over server settings per model
 - ✅ Preload models on startup with `hooks` ([#235](https://github.com/mostlygeek/llama-swap/pull/235))
 ## How does llama-swap work?
@@ -42,9 +46,9 @@ In the most basic configuration llama-swap handles one model at a time. For more
 ## config.yaml
-llama-swap is managed entirely through a yaml configuration file. 
+llama-swap is managed entirely through a yaml configuration file.
-It can be very minimal to start: 
+It can be very minimal to start:
 ```yaml
 models:
@@ -55,7 +59,7 @@ models:
      --port ${PORT}
 ```
-However, there are many more capabilities that llama-swap supports: 
+However, there are many more capabilities that llama-swap supports:
 - `groups` to run multiple models at once
 - `ttl` to automatically unload models
@@ -71,9 +75,13 @@ See the [configuration documentation](https://github.com/mostlygeek/llama-swap/w
 ## Web UI
-llama-swap ships with a real time web interface to monitor logs and status of models:
+llama-swap includes a real time web interface for monitoring logs and models:
-<img width="1786" height="1334" alt="image" src="https://github.com/user-attachments/assets/d6258cb9-1dad-40db-828f-2be860aec8fe" />
+<img width="1360" height="963" alt="image" src="https://github.com/user-attachments/assets/adef4a8e-de0b-49db-885a-8f6dedae6799" />
 The Activity Page shows recent requests:
 <img width="1360" height="963" alt="image" src="https://github.com/user-attachments/assets/5f3edee6-d03a-4ae5-ae06-b20ac1f135bd" />
 ## Installation
@@ -86,7 +94,7 @@ llama-swap can be installed in multiple ways
 ### Docker Install ([download images](https://github.com/mostlygeek/llama-swap/pkgs/container/llama-swap))
-Docker images with llama-swap and llama-server are built nightly. 
+Docker images with llama-swap and llama-server are built nightly.
 ```shell
 # use CPU inference comes with the example config above
@@ -133,10 +141,10 @@ $ docker run -it --rm --runtime nvidia -p 9292:8080 \
 ### Homebrew Install (macOS/Linux)
-The latest release of `llama-swap` can be installed via [Homebrew](https://brew.sh). 
+The latest release of `llama-swap` can be installed via [Homebrew](https://brew.sh).
 ```shell
-# Set up tap and install formula 
+# Set up tap and install formula
 brew tap mostlygeek/llama-swap
 brew install llama-swap
 # Run llama-swap
@@ -1,9 +1,17 @@
 # llama-swap YAML configuration example
 # -------------------------------------
 #
 # 💡 Tip - Use an LLM with this file!
 # ====================================
 #  This example configuration is written to be LLM friendly. Try
 #  copying this file into an LLM and asking it to explain or generate
 #  sections for you.
 # ====================================
 # Usage notes:
 # - Below are all the available configuration options for llama-swap.
-# - Settings with a default value, or noted as optional can be omitted.
+# - Settings noted as "required" must be in your configuration file
-# - Settings that are marked required must be in your configuration file
+# - Settings noted as "optional" can be omitted
 # healthCheckTimeout: number of seconds to wait for a model to be ready to serve requests
 # - optional, default: 120
@@ -27,9 +35,9 @@ metricsMaxInMemory: 1000
 # - it is automatically incremented for every model that uses it
 startPort: 10001
-# macros: sets a dictionary of string:string pairs
+# macros: a dictionary of string substitutions
 # - optional, default: empty dictionary
-# - these are reusable snippets
+# - macros are reusable snippets
 # - used in a model's cmd, cmdStop, proxy and checkEndpoint
 # - useful for reducing common configuration settings
 macros:
@@ -92,44 +100,55 @@ models:
    # checkEndpoint: URL path to check if the server is ready
    # - optional, default: /health
    # - use "none" to skip endpoint ready checking
    # - endpoint is expected to return an HTTP 200 response
-    # - all requests wait until the endpoint is ready (or fails)
+    # - all requests wait until the endpoint is ready or fails
    # - use "none" to skip endpoint health checking
    checkEndpoint: /custom-endpoint
-    # ttl: automatically unload the model after this many seconds
+    # ttl: automatically unload the model after ttl seconds
    # - optional, default: 0
    # - ttl values must be a value greater than 0
    # - a value of 0 disables automatic unloading of the model
    ttl: 60
-    # useModelName: overrides the model name that is sent to upstream server
+    # useModelName: override the model name that is sent to upstream server
    # - optional, default: ""
-    # - useful when the upstream server expects a specific model name or format
+    # - useful for when the upstream server expects a specific model name that
    #   is different from the model's ID
    useModelName: "qwen:qwq"
    # filters: a dictionary of filter settings
    # - optional, default: empty dictionary
    # - only strip_params is currently supported
    filters:
      # strip_params: a comma separated list of parameters to remove from the request
      # - optional, default: ""
-      # - useful for preventing overriding of default server params by requests
+      # - useful for server side enforcement of sampling parameters
-      # - `model` parameter is never removed
+      # - the `model` parameter can never be removed
      # - can be any JSON key in the request body
      # - recommended to stick to sampling parameters
      strip_params: "temperature, top_p, top_k"
    # concurrencyLimit: overrides the allowed number of active parallel requests to a model
    # - optional, default: 0
    # - useful for limiting the number of active parallel requests a model can process
    # - must be set per model
    # - any number greater than 0 will override the internal default value of 10
    # - any requests that exceeds the limit will receive an HTTP 429 Too Many Requests response
    # - recommended to be omitted and the default used
    concurrencyLimit: 0
  # Unlisted model example:
  "qwen-unlisted":
-    # unlisted: true or false
+    # unlisted: boolean, true or false
    # - optional, default: false
-    # - unlisted models do not show up in /v1/models or /upstream lists
+    # - unlisted models do not show up in /v1/models api requests
    # - can be requested as normal through all apis
    unlisted: true
    cmd: llama-server --port ${PORT} -m Llama-3.2-1B-Instruct-Q4_K_M.gguf -ngl 0
  # Docker example:
-  # container run times like Docker and Podman can also be used with a
+  # container run times like Docker and Podman can be used reliably with a
  # a combination of cmd and cmdStop.
  "docker-llama":
    proxy: "http://127.0.0.1:${PORT}"
@@ -142,24 +161,26 @@ models:
    # cmdStop: command to run to stop the model gracefully
    # - optional, default: ""
    # - useful for stopping commands managed by another system
    # - on POSIX systems: a SIGTERM is sent for graceful shutdown
    # - on Windows, taskkill is used
    # - processes are given 5 seconds to shutdown until they are forcefully killed
    # - the upstream's process id is available in the ${PID} macro
    #
    # When empty, llama-swap has this default behaviour:
    # - on POSIX systems: a SIGTERM signal is sent
    # - on Windows, calls taskkill to stop the process
    # - processes have 5 seconds to shutdown until forceful termination is attempted
    cmdStop: docker stop dockertest
 # groups: a dictionary of group settings
 # - optional, default: empty dictionary
-# - provide advanced controls over model swapping behaviour.
+# - provides advanced controls over model swapping behaviour
-# - Using groups some models can be kept loaded indefinitely, while others are swapped out.
+# - using groups some models can be kept loaded indefinitely, while others are swapped out
-# - model ids must be defined in the Models section
+# - model IDs must be defined in the Models section
 # - a model can only be a member of one group
 # - group behaviour is controlled via the `swap`, `exclusive` and `persistent` fields
 # - see issue #109 for details
 #
 # NOTE: the example below uses model names that are not defined above for demonstration purposes
 groups:
-  # group1 is same as the default behaviour of llama-swap where only one model is allowed
+  # group1 works the same as the default behaviour of llama-swap where only one model is allowed
  # to run a time across the whole llama-swap instance
  "group1":
    # swap: controls the model swapping behaviour in within the group
@@ -181,10 +202,13 @@ groups:
      - "qwen-unlisted"
  # Example:
-  # - in this group all the models can run at the same time
+  # - in group2 all models can run at the same time
-  # - when a different group loads all running models in this group are unloaded
+  # - when a different group is loaded it causes all running models in this group to unload
  "group2":
    swap: false
    # exclusive: false does not unload other groups when a model in group2 is requested
    # - the models in group2 will be loaded but will not unload any other groups
    exclusive: false
    members:
      - "docker-llama"
@@ -207,3 +231,19 @@ groups:
      - "forever-modelA"
      - "forever-modelB"
      - "forever-modelc"
 # hooks: a dictionary of event triggers and actions
 # - optional, default: empty dictionary
 # - the only supported hook is on_startup
 hooks:
  # on_startup: a dictionary of actions to perform on startup
  # - optional, default: empty dictionary
  # - the only supported action is preload
  on_startup:
        # preload: a list of model ids to load on startup
        # - optional, default: empty list
        # - model names must match keys in the models sections
        # - when preloading multiple models at once, define a group
        #   otherwise models will be loaded and swapped out
    preload:
      - "llama"
@@ -133,7 +133,7 @@ func main() {
 							ReloadingState: proxy.ReloadingStateStart,
 						})
 					} else if changeEvent.Name == filepath.Join(configDir, "..data") && changeEvent.Has(fsnotify.Create) {
-						// the change for k8s configmap 
+						// the change for k8s configmap
 						event.Emit(proxy.ConfigFileChangedEvent{
 							ReloadingState: proxy.ReloadingStateStart,
 						})
@@ -0,0 +1,159 @@
 package main
 // created for issue: #252 https://github.com/mostlygeek/llama-swap/issues/252
 // this simple benchmark tool sends a lot of small chat completion requests to llama-swap
 // to make sure all the requests are accounted for.
 //
 // requests can be sent in parallel, and the tool will report the results.
 // usage: go run main.go -baseurl http://localhost:8080/v1 -model llama3 -requests 1000 -par 5
 import (
 	"bytes"
 	"flag"
 	"fmt"
 	"io"
 	"log"
 	"net/http"
 	"os"
 	"sync"
 	"time"
 )
 func main() {
 	// ----- CLI arguments ----------------------------------------------------
 	var (
 		baseurl         string
 		modelName       string
 		totalRequests   int
 		parallelization int
 	)
 	flag.StringVar(&baseurl, "baseurl", "http://localhost:8080/v1", "Base URL of the API (e.g., https://api.example.com)")
 	flag.StringVar(&modelName, "model", "", "Model name to use")
 	flag.IntVar(&totalRequests, "requests", 1, "Total number of requests to send")
 	flag.IntVar(&parallelization, "par", 1, "Maximum number of concurrent requests")
 	flag.Parse()
 	if baseurl == "" || modelName == "" {
 		fmt.Println("Error: both -baseurl and -model are required.")
 		flag.Usage()
 		os.Exit(1)
 	}
 	if totalRequests <= 0 {
 		fmt.Println("Error: -requests must be greater than 0.")
 		os.Exit(1)
 	}
 	if parallelization <= 0 {
 		fmt.Println("Error: -parallelization must be greater than 0.")
 		os.Exit(1)
 	}
 	// ----- HTTP client -------------------------------------------------------
 	client := &http.Client{
 		Timeout: 30 * time.Second,
 	}
 	// ----- Tracking response codes -------------------------------------------
 	statusCounts := make(map[int]int) // map[statusCode]count
 	var mu sync.Mutex                 // protects statusCounts
 	// ----- Request queue (buffered channel) ----------------------------------
 	requests := make(chan int, 10) // Buffered channel with capacity 10
 	// Goroutine to fill the request queue
 	go func() {
 		for i := 0; i < totalRequests; i++ {
 			requests <- i + 1
 		}
 		close(requests)
 	}()
 	// ----- Worker pool -------------------------------------------------------
 	var wg sync.WaitGroup
 	for i := 0; i < parallelization; i++ {
 		wg.Add(1)
 		go func(workerID int) {
 			defer wg.Done()
 			for reqID := range requests {
 				// Build request payload as a single line JSON string
 				payload := `{"model":"` + modelName + `","max_tokens":100,"stream":false,"messages":[{"role":"user","content":"write a snake game in python"}]}`
 				// Send POST request
 				req, err := http.NewRequest(http.MethodPost,
 					fmt.Sprintf("%s/chat/completions", baseurl),
 					bytes.NewReader([]byte(payload)))
 				if err != nil {
 					log.Printf("[worker %d][req %d] request creation error: %v", workerID, reqID, err)
 					mu.Lock()
 					statusCounts[-1]++
 					mu.Unlock()
 					continue
 				}
 				req.Header.Set("Content-Type", "application/json")
 				resp, err := client.Do(req)
 				if err != nil {
 					log.Printf("[worker %d][req %d] HTTP request error: %v", workerID, reqID, err)
 					mu.Lock()
 					statusCounts[-1]++
 					mu.Unlock()
 					continue
 				}
 				io.Copy(io.Discard, resp.Body)
 				resp.Body.Close()
 				// Record status code
 				mu.Lock()
 				statusCounts[resp.StatusCode]++
 				mu.Unlock()
 			}
 		}(i + 1)
 	}
 	// ----- Status ticker (prints every second) -------------------------------
 	done := make(chan struct{})
 	tickerDone := make(chan struct{})
 	go func() {
 		ticker := time.NewTicker(1 * time.Second)
 		startTime := time.Now()
 		for {
 			select {
 			case <-ticker.C:
 				mu.Lock()
 				// Compute how many requests have completed so far
 				completed := 0
 				for _, cnt := range statusCounts {
 					completed += cnt
 				}
 				// Calculate duration and progress
 				duration := time.Since(startTime)
 				progress := completed * 100 / totalRequests
 				fmt.Printf("Duration: %v, Completed: %d%% requests\n", duration, progress)
 				mu.Unlock()
 			case <-done:
 				duration := time.Since(startTime)
 				fmt.Printf("Duration: %v, Completed: %d%% requests\n", duration, 100)
 				close(tickerDone)
 				return
 			}
 		}
 	}()
 	// Wait for all workers to finish
 	wg.Wait()
 	close(done)  // stops the status-update goroutine
 	<-tickerDone // give ticker time to finish / print
 	// ----- Summary ------------------------------------------------------------
 	fmt.Println("\n\n=== HTTP response code summary ===")
 	mu.Lock()
 	for code, cnt := range statusCounts {
 		if code == -1 {
 			fmt.Printf("Client-side errors (no HTTP response): %d\n", cnt)
 		} else {
 			fmt.Printf("%d : %d\n", code, cnt)
 		}
 	}
 	mu.Unlock()
 }
@@ -153,6 +153,19 @@ func main() {
 	})
 	// llama-server compatibility: /completion
 	r.POST("/completion", func(c *gin.Context) {
 		c.Header("Content-Type", "application/json")
 		c.JSON(http.StatusOK, gin.H{
 			"responseMessage": *responseMessage,
 			"usage": gin.H{
 				"completion_tokens": 10,
 				"prompt_tokens":     25,
 				"total_tokens":      35,
 			},
 		})
 	})
 	// issue #41
 	r.POST("/v1/audio/transcriptions", func(c *gin.Context) {
 		// Parse the multipart form
@@ -138,6 +138,14 @@ func (c *GroupConfig) UnmarshalYAML(unmarshal func(interface{}) error) error {
 	return nil
 }
 type HooksConfig struct {
 	OnStartup HookOnStartup `yaml:"on_startup"`
 }
 type HookOnStartup struct {
 	Preload []string `yaml:"preload"`
 }
 type Config struct {
 	HealthCheckTimeout int                    `yaml:"healthCheckTimeout"`
 	LogRequests        bool                   `yaml:"logRequests"`
@@ -155,6 +163,9 @@ type Config struct {
 	// automatic port assignments
 	StartPort int `yaml:"startPort"`
 	// hooks, see: #209
 	Hooks HooksConfig `yaml:"hooks"`
 }
 func (c *Config) RealModelName(search string) (string, bool) {
@@ -330,6 +341,22 @@ func LoadConfigFromReader(r io.Reader) (Config, error) {
 		}
 	}
 	// clean up hooks preload
 	if len(config.Hooks.OnStartup.Preload) > 0 {
 		var toPreload []string
 		for _, modelID := range config.Hooks.OnStartup.Preload {
 			modelID = strings.TrimSpace(modelID)
 			if modelID == "" {
 				continue
 			}
 			if real, found := config.RealModelName(modelID); found {
 				toPreload = append(toPreload, real)
 			}
 		}
 		config.Hooks.OnStartup.Preload = toPreload
 	}
 	return config, nil
 }
@@ -100,6 +100,9 @@ func TestConfig_LoadPosix(t *testing.T) {
 	content := `
 macros:
  svr-path: "path/to/server"
 hooks:
  on_startup:
    preload: ["model1", "model2"]
 models:
  model1:
    cmd: path/to/cmd --arg1 one
@@ -163,6 +166,11 @@ groups:
 		Macros: map[string]string{
 			"svr-path": "path/to/server",
 		},
 		Hooks: HooksConfig{
 			OnStartup: HookOnStartup{
 				Preload: []string{"model1", "model2"},
 			},
 		},
 		Models: map[string]ModelConfig{
 			"model1": {
 				Cmd:           "path/to/cmd --arg1 one",
@@ -0,0 +1,27 @@
 package proxy
 import "net/http"
 // Custom discard writer that implements http.ResponseWriter but just discards everything
 type DiscardWriter struct {
 	header http.Header
 	status int
 }
 func (w *DiscardWriter) Header() http.Header {
 	if w.header == nil {
 		w.header = make(http.Header)
 	}
 	return w.header
 }
 func (w *DiscardWriter) Write(data []byte) (int, error) {
 	return len(data), nil
 }
 func (w *DiscardWriter) WriteHeader(code int) {
 	w.status = code
 }
 // Satisfy the http.Flusher interface for streaming responses
 func (w *DiscardWriter) Flush() {}
@@ -7,6 +7,7 @@ const ChatCompletionStatsEventID = 0x02
 const ConfigFileChangedEventID = 0x03
 const LogDataEventID = 0x04
 const TokenMetricsEventID = 0x05
 const ModelPreloadedEventID = 0x06
 type ProcessStateChangeEvent struct {
 	ProcessName string
@@ -48,3 +49,12 @@ type LogDataEvent struct {
 func (e LogDataEvent) Type() uint32 {
 	return LogDataEventID
 }
 type ModelPreloadedEvent struct {
 	ModelName string
 	Success   bool
 }
 func (e ModelPreloadedEvent) Type() uint32 {
 	return ModelPreloadedEventID
 }
@@ -13,9 +13,10 @@ import (
 )
 var (
-	nextTestPort int = 12000
+	nextTestPort        int = 12000
-	portMutex    sync.Mutex
+	portMutex           sync.Mutex
-	testLogger   = NewLogMonitorWriter(os.Stdout)
+	testLogger          = NewLogMonitorWriter(os.Stdout)
 	simpleResponderPath = getSimpleResponderPath()
 )
 // Check if the binary exists
@@ -69,13 +70,11 @@ func getTestSimpleResponderConfig(expectedMessage string) ModelConfig {
 }
 func getTestSimpleResponderConfigPort(expectedMessage string, port int) ModelConfig {
 	binaryPath := getSimpleResponderPath()
 	// Create a YAML string with just the values we want to set
 	yamlStr := fmt.Sprintf(`
 cmd: '%s --port %d --silent --respond %s'
 proxy: "http://127.0.0.1:%d"
-`, binaryPath, port, expectedMessage, port)
+`, simpleResponderPath, port, expectedMessage, port)
 	var cfg ModelConfig
 	if err := yaml.Unmarshal([]byte(yamlStr), &cfg); err != nil {
@@ -5,12 +5,20 @@ import (
 	"fmt"
 	"io"
 	"net/http"
 	"strings"
 	"time"
 	"github.com/gin-gonic/gin"
 	"github.com/tidwall/gjson"
 )
 type MetricsRecorder struct {
 	metricsMonitor *MetricsMonitor
 	realModelName  string
 	//	isStreaming    bool
 	startTime time.Time
 }
 // MetricsMiddleware sets up the MetricsResponseWriter for capturing upstream requests
 func MetricsMiddleware(pm *ProxyManager) gin.HandlerFunc {
 	return func(c *gin.Context) {
@@ -41,48 +49,48 @@ func MetricsMiddleware(pm *ProxyManager) gin.HandlerFunc {
 			metricsRecorder: &MetricsRecorder{
 				metricsMonitor: pm.metricsMonitor,
 				realModelName:  realModelName,
 				isStreaming:    gjson.GetBytes(bodyBytes, "stream").Bool(),
 				startTime:      time.Now(),
 			},
 		}
 		c.Writer = writer
 		c.Next()
-		rec := writer.metricsRecorder
+		// check for streaming response
-		rec.processBody(writer.body)
+		if strings.Contains(c.Writer.Header().Get("Content-Type"), "text/event-stream") {
-	}
+			writer.metricsRecorder.processStreamingResponse(writer.body)
-}
+		} else {
 			writer.metricsRecorder.processNonStreamingResponse(writer.body)
 		}
 type MetricsRecorder struct {
 	metricsMonitor *MetricsMonitor
 	realModelName  string
 	isStreaming    bool
 	startTime      time.Time
 }
 // processBody handles response processing after request completes
 func (rec *MetricsRecorder) processBody(body []byte) {
 	if rec.isStreaming {
 		rec.processStreamingResponse(body)
 	} else {
 		rec.processNonStreamingResponse(body)
 	}
 }
 func (rec *MetricsRecorder) parseAndRecordMetrics(jsonData gjson.Result) bool {
 	usage := jsonData.Get("usage")
-	if !usage.Exists() {
+	timings := jsonData.Get("timings")
 	if !usage.Exists() && !timings.Exists() {
 		return false
 	}
 	// default values
-	outputTokens := int(jsonData.Get("usage.completion_tokens").Int())
+	outputTokens := 0
-	inputTokens := int(jsonData.Get("usage.prompt_tokens").Int())
+	inputTokens := 0
 	// timings data
 	tokensPerSecond := -1.0
 	promptPerSecond := -1.0
 	durationMs := int(time.Since(rec.startTime).Milliseconds())
 	if usage.Exists() {
 		outputTokens = int(jsonData.Get("usage.completion_tokens").Int())
 		inputTokens = int(jsonData.Get("usage.prompt_tokens").Int())
 	}
 	// use llama-server's timing data for tok/sec and duration as it is more accurate
-	if timings := jsonData.Get("timings"); timings.Exists() {
+	if timings.Exists() {
 		inputTokens = int(jsonData.Get("timings.prompt_n").Int())
 		outputTokens = int(jsonData.Get("timings.predicted_n").Int())
 		promptPerSecond = jsonData.Get("timings.prompt_per_second").Float()
 		tokensPerSecond = jsonData.Get("timings.predicted_per_second").Float()
 		durationMs = int(jsonData.Get("timings.prompt_ms").Float() + jsonData.Get("timings.predicted_ms").Float())
 	}
@@ -92,6 +100,7 @@ func (rec *MetricsRecorder) parseAndRecordMetrics(jsonData gjson.Result) bool {
 		Model:           rec.realModelName,
 		InputTokens:     inputTokens,
 		OutputTokens:    outputTokens,
 		PromptPerSecond: promptPerSecond,
 		TokensPerSecond: tokensPerSecond,
 		DurationMs:      durationMs,
 	})
@@ -15,6 +15,7 @@ type TokenMetrics struct {
 	Model           string    `json:"model"`
 	InputTokens     int       `json:"input_tokens"`
 	OutputTokens    int       `json:"output_tokens"`
 	PromptPerSecond float64   `json:"prompt_per_second"`
 	TokensPerSecond float64   `json:"tokens_per_second"`
 	DurationMs      int       `json:"duration_ms"`
 }
@@ -5,6 +5,7 @@ import (
 	"errors"
 	"fmt"
 	"io"
 	"net"
 	"net/http"
 	"net/url"
 	"os/exec"
@@ -363,8 +364,18 @@ func (p *Process) stopCommand() {
 }
 func (p *Process) checkHealthEndpoint(healthURL string) error {
 	client := &http.Client{
-		Timeout: 500 * time.Millisecond,
+		// wait a short time for a tcp connection to be established
 		Transport: &http.Transport{
 			DialContext: (&net.Dialer{
 				Timeout: 500 * time.Millisecond,
 			}).DialContext,
 		},
 		// give a long time to respond to the health check endpoint
 		// after the connection is established. See issue: 276
 		Timeout: 5000 * time.Millisecond,
 	}
 	req, err := http.NewRequest("GET", healthURL, nil)
@@ -60,10 +60,20 @@ func (pg *ProcessGroup) ProxyRequest(modelID string, writer http.ResponseWriter,
 	if pg.swap {
 		pg.Lock()
 		if pg.lastUsedProcess != modelID {
 			// is there something already running?
 			if pg.lastUsedProcess != "" {
 				pg.processes[pg.lastUsedProcess].Stop()
 			}
 			// wait for the request to the new model to be fully handled
 			// and prevent race conditions see issue #277
 			pg.processes[modelID].ProxyRequest(writer, request)
 			pg.lastUsedProcess = modelID
 			// short circuit and exit
 			pg.Unlock()
 			return nil
 		}
 		pg.Unlock()
 	}
@@ -4,6 +4,7 @@ import (
 	"bytes"
 	"net/http"
 	"net/http/httptest"
 	"sync"
 	"testing"
 	"github.com/stretchr/testify/assert"
@@ -44,32 +45,49 @@ func TestProcessGroup_HasMember(t *testing.T) {
 	assert.False(t, pg.HasMember("model3"))
 }
-func TestProcessGroup_ProxyRequestSwapIsTrue(t *testing.T) {
+// TestProcessGroup_ProxyRequestSwapIsTrueParallel tests that when swap is true
 // and multiple requests are made in parallel, only one process is running at a time.
 func TestProcessGroup_ProxyRequestSwapIsTrueParallel(t *testing.T) {
 	var processGroupTestConfig = AddDefaultGroupToConfig(Config{
 		HealthCheckTimeout: 15,
 		Models: map[string]ModelConfig{
 			// use the same listening so if a model is already running, it will fail
 			// this is a way to test that swap isolation is working
 			// properly when there are parallel requests made at the
 			// same time.
 			"model1": getTestSimpleResponderConfigPort("model1", 9832),
 			"model2": getTestSimpleResponderConfigPort("model2", 9832),
 			"model3": getTestSimpleResponderConfigPort("model3", 9832),
 			"model4": getTestSimpleResponderConfigPort("model4", 9832),
 			"model5": getTestSimpleResponderConfigPort("model5", 9832),
 		},
 		Groups: map[string]GroupConfig{
 			"G1": {
 				Swap:    true,
 				Members: []string{"model1", "model2", "model3", "model4", "model5"},
 			},
 		},
 	})
 	pg := NewProcessGroup("G1", processGroupTestConfig, testLogger, testLogger)
 	defer pg.StopProcesses(StopWaitForInflightRequest)
-	tests := []string{"model1", "model2"}
+	tests := []string{"model1", "model2", "model3", "model4", "model5"}
 	var wg sync.WaitGroup
 	wg.Add(len(tests))
 	for _, modelName := range tests {
-		t.Run(modelName, func(t *testing.T) {
+		go func(modelName string) {
-			reqBody := `{"x", "y"}`
+			defer wg.Done()
-			req := httptest.NewRequest("POST", "/v1/chat/completions", bytes.NewBufferString(reqBody))
+			req := httptest.NewRequest("POST", "/v1/chat/completions", nil)
 			w := httptest.NewRecorder()
 			assert.NoError(t, pg.ProxyRequest(modelName, w, req))
 			assert.Equal(t, http.StatusOK, w.Code)
 			assert.Contains(t, w.Body.String(), modelName)
-
+		}(modelName)
 			// make sure only one process is in the running state
 			count := 0
 			for _, process := range pg.processes {
 				if process.CurrentState() == StateReady {
 					count++
 				}
 			}
 			assert.Equal(t, 1, count)
 		})
 	}
 	wg.Wait()
 }
 func TestProcessGroup_ProxyRequestSwapIsFalse(t *testing.T) {
@@ -15,6 +15,7 @@ import (
 	"time"
 	"github.com/gin-gonic/gin"
 	"github.com/mostlygeek/llama-swap/event"
 	"github.com/tidwall/gjson"
 	"github.com/tidwall/sjson"
 )
@@ -96,6 +97,35 @@ func New(config Config) *ProxyManager {
 	}
 	pm.setupGinEngine()
 	// run any startup hooks
 	if len(config.Hooks.OnStartup.Preload) > 0 {
 		// do it in the background, don't block startup -- not sure if good idea yet
 		go func() {
 			discardWriter := &DiscardWriter{}
 			for _, realModelName := range config.Hooks.OnStartup.Preload {
 				proxyLogger.Infof("Preloading model: %s", realModelName)
 				processGroup, _, err := pm.swapProcessGroup(realModelName)
 				if err != nil {
 					event.Emit(ModelPreloadedEvent{
 						ModelName: realModelName,
 						Success:   false,
 					})
 					proxyLogger.Errorf("Failed to preload model %s: %v", realModelName, err)
 					continue
 				} else {
 					req, _ := http.NewRequest("GET", "/", nil)
 					processGroup.ProxyRequest(realModelName, discardWriter, req)
 					event.Emit(ModelPreloadedEvent{
 						ModelName: realModelName,
 						Success:   true,
 					})
 				}
 			}
 		}()
 	}
 	return pm
 }
@@ -161,11 +191,20 @@ func (pm *ProxyManager) setupGinEngine() {
 	// Support legacy /v1/completions api, see issue #12
 	pm.ginEngine.POST("/v1/completions", mm, pm.proxyOAIHandler)
-	// Support embeddings
+	// Support embeddings and reranking
 	pm.ginEngine.POST("/v1/embeddings", mm, pm.proxyOAIHandler)
 	// llama-server's /reranking endpoint + aliases
 	pm.ginEngine.POST("/reranking", mm, pm.proxyOAIHandler)
 	pm.ginEngine.POST("/rerank", mm, pm.proxyOAIHandler)
 	pm.ginEngine.POST("/v1/rerank", mm, pm.proxyOAIHandler)
 	pm.ginEngine.POST("/v1/reranking", mm, pm.proxyOAIHandler)
-	pm.ginEngine.POST("/rerank", mm, pm.proxyOAIHandler)
+
 	// llama-server's /infill endpoint for code infilling
 	pm.ginEngine.POST("/infill", mm, pm.proxyOAIHandler)
 	// llama-server's /completion endpoint
 	pm.ginEngine.POST("/completion", mm, pm.proxyOAIHandler)
 	// Support audio/speech endpoint
 	pm.ginEngine.POST("/v1/audio/speech", pm.proxyOAIHandler)
@@ -361,7 +400,7 @@ func (pm *ProxyManager) proxyToUpstream(c *gin.Context) {
 		return
 	}
-	processGroup, _, err := pm.swapProcessGroup(requestedModel)
+	processGroup, realModelName, err := pm.swapProcessGroup(requestedModel)
 	if err != nil {
 		pm.sendErrorResponse(c, http.StatusInternalServerError, fmt.Sprintf("error swapping process group: %s", err.Error()))
 		return
@@ -369,7 +408,7 @@ func (pm *ProxyManager) proxyToUpstream(c *gin.Context) {
 	// rewrite the path
 	c.Request.URL.Path = c.Param("upstreamPath")
-	processGroup.ProxyRequest(requestedModel, c.Writer, c.Request)
+	processGroup.ProxyRequest(realModelName, c.Writer, c.Request)
 }
 func (pm *ProxyManager) proxyOAIHandler(c *gin.Context) {
@@ -132,7 +132,7 @@ func (pm *ProxyManager) apiSendEvents(c *gin.Context) {
 		}
 	}
-	sendMetrics := func(metrics TokenMetrics) {
+	sendMetrics := func(metrics []TokenMetrics) {
 		jsonData, err := json.Marshal(metrics)
 		if err == nil {
 			select {
@@ -168,16 +168,14 @@ func (pm *ProxyManager) apiSendEvents(c *gin.Context) {
 	 * Send Metrics data
 	 */
 	defer event.On(func(e TokenMetricsEvent) {
-		sendMetrics(e.Metrics)
+		sendMetrics([]TokenMetrics{e.Metrics})
 	})()
 	// send initial batch of data
 	sendLogData("proxy", pm.proxyLogger.GetHistory())
 	sendLogData("upstream", pm.upstreamLogger.GetHistory())
 	sendModels()
-	for _, metrics := range pm.metricsMonitor.GetMetrics() {
+	sendMetrics(pm.metricsMonitor.GetMetrics())
 		sendMetrics(metrics)
 	}
 	for {
 		select {
@@ -9,10 +9,12 @@ import (
 	"net/http"
 	"net/http/httptest"
 	"strconv"
 	"strings"
 	"sync"
 	"testing"
 	"time"
 	"github.com/mostlygeek/llama-swap/event"
 	"github.com/stretchr/testify/assert"
 	"github.com/tidwall/gjson"
 )
@@ -40,7 +42,6 @@ func TestProxyManager_SwapProcessCorrectly(t *testing.T) {
 		assert.Contains(t, w.Body.String(), modelName)
 	}
 }
 func TestProxyManager_SwapMultiProcess(t *testing.T) {
 	config := AddDefaultGroupToConfig(Config{
 		HealthCheckTimeout: 15,
@@ -280,48 +281,48 @@ func TestProxyManager_ListModelsHandler(t *testing.T) {
 }
 func TestProxyManager_ListModelsHandler_SortedByID(t *testing.T) {
-    // Intentionally add models in non-sorted order and with an unlisted model
+	// Intentionally add models in non-sorted order and with an unlisted model
-    config := Config{
+	config := Config{
-        HealthCheckTimeout: 15,
+		HealthCheckTimeout: 15,
-        Models: map[string]ModelConfig{
+		Models: map[string]ModelConfig{
-            "zeta":   getTestSimpleResponderConfig("zeta"),
+			"zeta":  getTestSimpleResponderConfig("zeta"),
-            "alpha":  getTestSimpleResponderConfig("alpha"),
+			"alpha": getTestSimpleResponderConfig("alpha"),
-            "beta":   getTestSimpleResponderConfig("beta"),
+			"beta":  getTestSimpleResponderConfig("beta"),
-            "hidden": func() ModelConfig {
+			"hidden": func() ModelConfig {
-                mc := getTestSimpleResponderConfig("hidden")
+				mc := getTestSimpleResponderConfig("hidden")
-                mc.Unlisted = true
+				mc.Unlisted = true
-                return mc
+				return mc
-            }(),
+			}(),
-        },
+		},
-        LogLevel: "error",
+		LogLevel: "error",
-    }
+	}
-    proxy := New(config)
+	proxy := New(config)
-    // Request models list
+	// Request models list
-    req := httptest.NewRequest("GET", "/v1/models", nil)
+	req := httptest.NewRequest("GET", "/v1/models", nil)
-    w := httptest.NewRecorder()
+	w := httptest.NewRecorder()
-    proxy.ServeHTTP(w, req)
+	proxy.ServeHTTP(w, req)
-    assert.Equal(t, http.StatusOK, w.Code)
+	assert.Equal(t, http.StatusOK, w.Code)
-    var response struct {
+	var response struct {
-        Data []map[string]interface{} `json:"data"`
+		Data []map[string]interface{} `json:"data"`
-    }
+	}
-    if err := json.Unmarshal(w.Body.Bytes(), &response); err != nil {
+	if err := json.Unmarshal(w.Body.Bytes(), &response); err != nil {
-        t.Fatalf("Failed to parse JSON response: %v", err)
+		t.Fatalf("Failed to parse JSON response: %v", err)
-    }
+	}
-    // We expect only the listed models in sorted order by id
+	// We expect only the listed models in sorted order by id
-    expectedOrder := []string{"alpha", "beta", "zeta"}
+	expectedOrder := []string{"alpha", "beta", "zeta"}
-    if assert.Len(t, response.Data, len(expectedOrder), "unexpected number of listed models") {
+	if assert.Len(t, response.Data, len(expectedOrder), "unexpected number of listed models") {
-        got := make([]string, 0, len(response.Data))
+		got := make([]string, 0, len(response.Data))
-        for _, m := range response.Data {
+		for _, m := range response.Data {
-            id, _ := m["id"].(string)
+			id, _ := m["id"].(string)
-            got = append(got, id)
+			got = append(got, id)
-        }
+		}
-        assert.Equal(t, expectedOrder, got, "models should be sorted by id ascending")
+		assert.Equal(t, expectedOrder, got, "models should be sorted by id ascending")
-    }
+	}
 }
 func TestProxyManager_Shutdown(t *testing.T) {
@@ -656,21 +657,34 @@ func TestProxyManager_CORSOptionsHandler(t *testing.T) {
 }
 func TestProxyManager_Upstream(t *testing.T) {
-	config := AddDefaultGroupToConfig(Config{
+	configStr := fmt.Sprintf(`
-		HealthCheckTimeout: 15,
+logLevel: error
-		Models: map[string]ModelConfig{
+models:
-			"model1": getTestSimpleResponderConfig("model1"),
+  model1:
-		},
+    cmd: %s -port ${PORT} -silent -respond model1
-		LogLevel: "error",
+    aliases: [model-alias]
-	})
+`, getSimpleResponderPath())
 	config, err := LoadConfigFromReader(strings.NewReader(configStr))
 	assert.NoError(t, err)
 	proxy := New(config)
 	defer proxy.StopProcesses(StopWaitForInflightRequest)
-	req := httptest.NewRequest("GET", "/upstream/model1/test", nil)
+	t.Run("main model name", func(t *testing.T) {
-	rec := httptest.NewRecorder()
+		req := httptest.NewRequest("GET", "/upstream/model1/test", nil)
-	proxy.ServeHTTP(rec, req)
+		rec := httptest.NewRecorder()
-	assert.Equal(t, http.StatusOK, rec.Code)
+		proxy.ServeHTTP(rec, req)
-	assert.Equal(t, "model1", rec.Body.String())
+		assert.Equal(t, http.StatusOK, rec.Code)
 		assert.Equal(t, "model1", rec.Body.String())
 	})
 	t.Run("model alias", func(t *testing.T) {
 		req := httptest.NewRequest("GET", "/upstream/model-alias/test", nil)
 		rec := httptest.NewRecorder()
 		proxy.ServeHTTP(rec, req)
 		assert.Equal(t, http.StatusOK, rec.Code)
 		assert.Equal(t, "model1", rec.Body.String())
 	})
 }
 func TestProxyManager_ChatContentLength(t *testing.T) {
@@ -818,3 +832,84 @@ func TestProxyManager_HealthEndpoint(t *testing.T) {
 	assert.Equal(t, http.StatusOK, rec.Code)
 	assert.Equal(t, "OK", rec.Body.String())
 }
 // Ensure the custom llama-server /completion endpoint proxies correctly
 func TestProxyManager_CompletionEndpoint(t *testing.T) {
 	config := AddDefaultGroupToConfig(Config{
 		HealthCheckTimeout: 15,
 		Models: map[string]ModelConfig{
 			"model1": getTestSimpleResponderConfig("model1"),
 		},
 		LogLevel: "error",
 	})
 	proxy := New(config)
 	defer proxy.StopProcesses(StopWaitForInflightRequest)
 	reqBody := `{"model":"model1"}`
 	req := httptest.NewRequest("POST", "/completion", bytes.NewBufferString(reqBody))
 	w := httptest.NewRecorder()
 	proxy.ServeHTTP(w, req)
 	assert.Equal(t, http.StatusOK, w.Code)
 	assert.Contains(t, w.Body.String(), "model1")
 }
 func TestProxyManager_StartupHooks(t *testing.T) {
 	// using real YAML as the configuration has gotten more complex
 	// is the right approach as LoadConfigFromReader() does a lot more
 	// than parse YAML now. Eventually migrate all tests to use this approach
 	configStr := strings.Replace(`
 logLevel: error
 hooks:
  on_startup:
    preload:
      - model1
      - model2
 groups:
  preloadTestGroup:
    swap: false
    members:
       - model1
       - model2
 models:
  model1:
    cmd: ${simpleresponderpath} --port ${PORT} --silent --respond model1
  model2:
      cmd: ${simpleresponderpath} --port ${PORT} --silent --respond model2
 `, "${simpleresponderpath}", simpleResponderPath, -1)
 	// Create a test model configuration
 	config, err := LoadConfigFromReader(strings.NewReader(configStr))
 	if !assert.NoError(t, err, "Invalid configuration") {
 		return
 	}
 	preloadChan := make(chan ModelPreloadedEvent, 2) // buffer for 2 expected events
 	unsub := event.On(func(e ModelPreloadedEvent) {
 		preloadChan <- e
 	})
 	defer unsub()
 	// Create the proxy which should trigger preloading
 	proxy := New(config)
 	defer proxy.StopProcesses(StopWaitForInflightRequest)
 	for i := 0; i < 2; i++ {
 		select {
 		case <-preloadChan:
 		case <-time.After(5 * time.Second):
 			t.Fatal("timed out waiting for models to preload")
 		}
 	}
 	// make sure they are both loaded
 	_, foundGroup := proxy.processGroups["preloadTestGroup"]
 	if !assert.True(t, foundGroup, "preloadTestGroup should exist") {
 		return
 	}
 	assert.Equal(t, StateReady, proxy.processGroups["preloadTestGroup"].processes["model1"].CurrentState())
 	assert.Equal(t, StateReady, proxy.processGroups["preloadTestGroup"].processes["model2"].CurrentState())
 }
@@ -1,50 +1,78 @@
 import { useEffect, useCallback } from "react";
 import { BrowserRouter as Router, Routes, Route, Navigate, NavLink } from "react-router-dom";
 import { useTheme } from "./contexts/ThemeProvider";
-import { APIProvider } from "./contexts/APIProvider";
+import { useAPI } from "./contexts/APIProvider";
 import LogViewerPage from "./pages/LogViewer";
 import ModelPage from "./pages/Models";
 import ActivityPage from "./pages/Activity";
 import ConnectionStatusIcon from "./components/ConnectionStatus";
 import { RiSunFill, RiMoonFill } from "react-icons/ri";
 function App() {
-  const { isNarrow, toggleTheme, isDarkMode } = useTheme();
+  const { isNarrow, toggleTheme, isDarkMode, appTitle, setAppTitle, setConnectionState } = useTheme();
  const handleTitleChange = useCallback(
    (newTitle: string) => {
      setAppTitle(newTitle.replace(/\n/g, "").trim().substring(0, 64) || "llama-swap");
    },
    [setAppTitle]
  );
  const { connectionStatus } = useAPI();
  // Synchronize the window.title connections state with the actual connection state
  useEffect(() => {
    setConnectionState(connectionStatus);
  }, [connectionStatus]);
  return (
    <Router basename="/ui/">
-      <APIProvider>
+      <div className="flex flex-col h-screen">
-        <div className="flex flex-col h-screen">
+        <nav className="bg-surface border-b border-border p-2 h-[75px]">
-          <nav className="bg-surface border-b border-border p-2 h-[75px]">
+          <div className="flex items-center justify-between mx-auto px-4 h-full">
-            <div className="flex items-center justify-between mx-auto px-4 h-full">
+            {!isNarrow && (
-              {!isNarrow && <h1 className="flex items-center p-0">llama-swap</h1>}
+              <h1
-              <div className="flex items-center space-x-4">
+                contentEditable
-                <NavLink to="/" className={({ isActive }) => (isActive ? "navlink active" : "navlink")}>
+                suppressContentEditableWarning
-                  Logs
+                className="flex items-center p-0 outline-none hover:bg-gray-100 dark:hover:bg-gray-700 rounded px-1"
-                </NavLink>
+                onBlur={(e) => handleTitleChange(e.currentTarget.textContent || "(set title)")}
-
+                onKeyDown={(e) => {
-                <NavLink to="/models" className={({ isActive }) => (isActive ? "navlink active" : "navlink")}>
+                  if (e.key === "Enter") {
-                  Models
+                    e.preventDefault();
-                </NavLink>
+                    handleTitleChange(e.currentTarget.textContent || "(set title)");
-
+                    e.currentTarget.blur();
-                <NavLink to="/activity" className={({ isActive }) => (isActive ? "navlink active" : "navlink")}>
+                  }
-                  Activity
+                }}
-                </NavLink>
+              >
-                <button className="" onClick={toggleTheme}>
+                {appTitle}
-                  {isDarkMode ? <RiMoonFill /> : <RiSunFill />}
+              </h1>
-                </button>
+            )}
-              </div>
+            <div className="flex items-center space-x-4">
              <NavLink to="/" className={({ isActive }) => (isActive ? "navlink active" : "navlink")}>
                Logs
              </NavLink>
              <NavLink to="/models" className={({ isActive }) => (isActive ? "navlink active" : "navlink")}>
                Models
              </NavLink>
              <NavLink to="/activity" className={({ isActive }) => (isActive ? "navlink active" : "navlink")}>
                Activity
              </NavLink>
              <button className="" onClick={toggleTheme}>
                {isDarkMode ? <RiMoonFill /> : <RiSunFill />}
              </button>
              <ConnectionStatusIcon />
            </div>
-          </nav>
+          </div>
        </nav>
-          <main className="flex-1 overflow-auto p-4">
+        <main className="flex-1 overflow-auto p-4">
-            <Routes>
+          <Routes>
-              <Route path="/" element={<LogViewerPage />} />
+            <Route path="/" element={<LogViewerPage />} />
-              <Route path="/models" element={<ModelPage />} />
+            <Route path="/models" element={<ModelPage />} />
-              <Route path="/activity" element={<ActivityPage />} />
+            <Route path="/activity" element={<ActivityPage />} />
-              <Route path="*" element={<Navigate to="/" replace />} />
+            <Route path="*" element={<Navigate to="/" replace />} />
-            </Routes>
+          </Routes>
-          </main>
+        </main>
-        </div>
+      </div>
      </APIProvider>
    </Router>
  );
 }
@@ -0,0 +1,26 @@
 import { useAPI } from "../contexts/APIProvider";
 import { useMemo } from "react";
 const ConnectionStatusIcon = () => {
  const { connectionStatus } = useAPI();
  const eventStatusColor = useMemo(() => {
    switch (connectionStatus) {
      case "connected":
        return "bg-green-500";
      case "connecting":
        return "bg-yellow-500";
      case "disconnected":
      default:
        return "bg-red-500";
    }
  }, [connectionStatus]);
  return (
    <div className="flex items-center" title={`event stream: ${connectionStatus}`}>
      <span className={`inline-block w-3 h-3 rounded-full ${eventStatusColor} mr-2`}></span>
    </div>
  );
 };
 export default ConnectionStatusIcon;
@@ -1,4 +1,5 @@
 import { useRef, createContext, useState, useContext, useEffect, useCallback, useMemo, type ReactNode } from "react";
 import type { ConnectionState } from "../lib/types";
 type ModelStatus = "ready" | "starting" | "stopping" | "stopped" | "shutdown" | "unknown";
 const LOG_LENGTH_LIMIT = 1024 * 100; /* 100KB of log data */
@@ -20,6 +21,7 @@ interface APIProviderType {
  proxyLogs: string;
  upstreamLogs: string;
  metrics: Metrics[];
  connectionStatus: ConnectionState;
 }
 interface Metrics {
@@ -28,6 +30,7 @@ interface Metrics {
  model: string;
  input_tokens: number;
  output_tokens: number;
  prompt_per_second: number;
  tokens_per_second: number;
  duration_ms: number;
 }
@@ -51,6 +54,7 @@ export function APIProvider({ children, autoStartAPIEvents = true }: APIProvider
  const [proxyLogs, setProxyLogs] = useState("");
  const [upstreamLogs, setUpstreamLogs] = useState("");
  const [metrics, setMetrics] = useState<Metrics[]>([]);
  const [connectionStatus, setConnectionState] = useState<ConnectionState>("disconnected");
  const apiEventSource = useRef<EventSource | null>(null);
  const [models, setModels] = useState<Model[]>([]);
@@ -74,7 +78,20 @@ export function APIProvider({ children, autoStartAPIEvents = true }: APIProvider
    const initialDelay = 1000; // 1 second
    const connect = () => {
      apiEventSource.current = null;
      const eventSource = new EventSource("/api/events");
      setConnectionState("connecting");
      eventSource.onopen = () => {
        // clear everything out on connect to keep things in sync
        setProxyLogs("");
        setUpstreamLogs("");
        setMetrics([]); // clear metrics on reconnect
        setModels([]); // clear models on reconnect
        apiEventSource.current = eventSource;
        retryCount = 0;
        setConnectionState("connected");
      };
      eventSource.onmessage = (e: MessageEvent) => {
        try {
@@ -83,6 +100,12 @@ export function APIProvider({ children, autoStartAPIEvents = true }: APIProvider
            case "modelStatus":
              {
                const models = JSON.parse(message.data) as Model[];
                // sort models by name and id
                models.sort((a, b) => {
                  return (a.name + a.id).localeCompare(b.name + b.id);
                });
                setModels(models);
              }
              break;
@@ -101,9 +124,9 @@ export function APIProvider({ children, autoStartAPIEvents = true }: APIProvider
            case "metrics":
              {
-                const newMetric = JSON.parse(message.data) as Metrics;
+                const newMetrics = JSON.parse(message.data) as Metrics[];
                setMetrics((prevMetrics) => {
-                  return [newMetric, ...prevMetrics];
+                  return [...newMetrics, ...prevMetrics];
                });
              }
              break;
@@ -112,14 +135,14 @@ export function APIProvider({ children, autoStartAPIEvents = true }: APIProvider
          console.error(e.data, err);
        }
      };
      eventSource.onerror = () => {
        eventSource.close();
        retryCount++;
        const delay = Math.min(initialDelay * Math.pow(2, retryCount - 1), 5000);
        setConnectionState("disconnected");
        setTimeout(connect, delay);
      };
      apiEventSource.current = eventSource;
    };
    connect();
@@ -187,6 +210,7 @@ export function APIProvider({ children, autoStartAPIEvents = true }: APIProvider
      proxyLogs,
      upstreamLogs,
      metrics,
      connectionStatus,
    }),
    [models, listModels, unloadAllModels, loadModel, enableAPIEvents, proxyLogs, upstreamLogs, metrics]
  );
@@ -1,5 +1,6 @@
 import { createContext, useContext, useEffect, type ReactNode, useMemo, useState } from "react";
 import { usePersistentState } from "../hooks/usePersistentState";
 import type { ConnectionState } from "../lib/types";
 type ScreenWidth = "xs" | "sm" | "md" | "lg" | "xl" | "2xl";
 type ThemeContextType = {
@@ -7,6 +8,11 @@ type ThemeContextType = {
  screenWidth: ScreenWidth;
  isNarrow: boolean;
  toggleTheme: () => void;
  // for managing the window title and connection state information
  appTitle: string;
  setAppTitle: (title: string) => void;
  setConnectionState: (state: ConnectionState) => void;
 };
 const ThemeContext = createContext<ThemeContextType | undefined>(undefined);
@@ -16,6 +22,17 @@ type ThemeProviderProps = {
 };
 export function ThemeProvider({ children }: ThemeProviderProps) {
  const [appTitle, setAppTitle] = usePersistentState("app-title", "llama-swap");
  const [connectionState, setConnectionState] = useState<ConnectionState>("disconnected");
  /**
   * Set the document.title with informative information
   */
  useEffect(() => {
    const connectionIcon = connectionState === "connecting" ? "🟡" : connectionState === "connected" ? "🟢" : "🔴";
    document.title = connectionIcon + " " + appTitle; // Set initial title
  }, [appTitle, connectionState]);
  const [isDarkMode, setIsDarkMode] = usePersistentState<boolean>("theme", false);
  const [screenWidth, setScreenWidth] = useState<ScreenWidth>("md"); // Default to md
@@ -55,7 +72,19 @@ export function ThemeProvider({ children }: ThemeProviderProps) {
  }, [screenWidth]);
  return (
-    <ThemeContext.Provider value={{ isDarkMode, toggleTheme, screenWidth, isNarrow }}>{children}</ThemeContext.Provider>
+    <ThemeContext.Provider
      value={{
        isDarkMode,
        toggleTheme,
        screenWidth,
        isNarrow,
        appTitle,
        setAppTitle,
        setConnectionState,
      }}
    >
      {children}
    </ThemeContext.Provider>
  );
 }
@@ -0,0 +1 @@
 export type ConnectionState = "connected" | "connecting" | "disconnected";
@@ -3,11 +3,14 @@ import { createRoot } from "react-dom/client";
 import "./index.css";
 import App from "./App.tsx";
 import { ThemeProvider } from "./contexts/ThemeProvider";
 import { APIProvider } from "./contexts/APIProvider";
 createRoot(document.getElementById("root")!).render(
  <StrictMode>
    <ThemeProvider>
-      <App />
+      <APIProvider>
        <App />
      </APIProvider>
    </ThemeProvider>
  </StrictMode>
 );
@@ -1,4 +1,4 @@
-import { useState, useEffect } from "react";
+import { useMemo } from "react";
 import { useAPI } from "../contexts/APIProvider";
 const formatTimestamp = (timestamp: string): string => {
@@ -15,25 +15,10 @@ const formatDuration = (ms: number): string => {
 const ActivityPage = () => {
  const { metrics } = useAPI();
-  const [error, setError] = useState<string | null>(null);
+  const sortedMetrics = useMemo(() => {
-
+    return [...metrics].sort((a, b) => b.id - a.id);
  useEffect(() => {
    if (metrics.length > 0) {
      setError(null);
    }
  }, [metrics]);
  if (error) {
    return (
      <div className="p-6">
        <h1 className="text-2xl font-bold mb-4">Activity</h1>
        <div className="bg-red-50 border border-red-200 rounded-md p-4">
          <p className="text-red-800">{error}</p>
        </div>
      </div>
    );
  }
  return (
    <div className="p-6">
      <h1 className="text-2xl font-bold mb-4">Activity</h1>
@@ -47,21 +32,25 @@ const ActivityPage = () => {
          <table className="min-w-full divide-y">
            <thead>
              <tr>
                <th className="px-4 py-3 text-left text-xs font-medium uppercase tracking-wider">Id</th>
                <th className="px-6 py-3 text-left text-xs font-medium uppercase tracking-wider">Timestamp</th>
                <th className="px-6 py-3 text-left text-xs font-medium uppercase tracking-wider">Model</th>
                <th className="px-6 py-3 text-left text-xs font-medium uppercase tracking-wider">Input Tokens</th>
                <th className="px-6 py-3 text-left text-xs font-medium uppercase tracking-wider">Output Tokens</th>
                <th className="px-6 py-3 text-left text-xs font-medium uppercase tracking-wider">Prompt Processing</th>
                <th className="px-6 py-3 text-left text-xs font-medium uppercase tracking-wider">Generation Speed</th>
                <th className="px-6 py-3 text-left text-xs font-medium uppercase tracking-wider">Duration</th>
              </tr>
            </thead>
            <tbody className="divide-y">
-              {metrics.map((metric, index) => (
+              {sortedMetrics.map((metric) => (
-                <tr key={`${metric.id}-${index}`}>
+                <tr key={`metric_${metric.id}`}>
                  <td className="px-4 py-4 whitespace-nowrap text-sm">{metric.id + 1 /* un-zero index */}</td>
                  <td className="px-6 py-4 whitespace-nowrap text-sm">{formatTimestamp(metric.timestamp)}</td>
                  <td className="px-6 py-4 whitespace-nowrap text-sm">{metric.model}</td>
                  <td className="px-6 py-4 whitespace-nowrap text-sm">{metric.input_tokens.toLocaleString()}</td>
                  <td className="px-6 py-4 whitespace-nowrap text-sm">{metric.output_tokens.toLocaleString()}</td>
                  <td className="px-6 py-4 whitespace-nowrap text-sm">{formatSpeed(metric.prompt_per_second)}</td>
                  <td className="px-6 py-4 whitespace-nowrap text-sm">{formatSpeed(metric.tokens_per_second)}</td>
                  <td className="px-6 py-4 whitespace-nowrap text-sm">{formatDuration(metric.duration_ms)}</td>
                </tr>
@@ -4,7 +4,7 @@ import { LogPanel } from "./LogViewer";
 import { usePersistentState } from "../hooks/usePersistentState";
 import { Panel, PanelGroup, PanelResizeHandle } from "react-resizable-panels";
 import { useTheme } from "../contexts/ThemeProvider";
-import { RiEyeFill, RiEyeOffFill, RiStopCircleLine } from "react-icons/ri";
+import { RiEyeFill, RiEyeOffFill, RiStopCircleLine, RiSwapBoxFill } from "react-icons/ri";
 export default function ModelsPage() {
  const { isNarrow } = useTheme();
@@ -40,6 +40,7 @@ function ModelsPanel() {
  const { models, loadModel, unloadAllModels } = useAPI();
  const [isUnloading, setIsUnloading] = useState(false);
  const [showUnlisted, setShowUnlisted] = usePersistentState("showUnlisted", true);
  const [showIdorName, setShowIdorName] = usePersistentState<"id" | "name">("showIdorName", "id"); // true = show ID, false = show name
  const filteredModels = useMemo(() => {
    return models.filter((model) => showUnlisted || !model.unlisted);
@@ -58,18 +59,28 @@ function ModelsPanel() {
    }
  }, [unloadAllModels]);
  const toggleIdorName = useCallback(() => {
    setShowIdorName((prev) => (prev === "name" ? "id" : "name"));
  }, [showIdorName]);
  return (
    <div className="card h-full flex flex-col">
      <div className="shrink-0">
        <h2>Models</h2>
        <div className="flex justify-between">
-          <button
+          <div className="flex gap-2">
-            className="btn flex items-center gap-2"
+            <button className="btn flex items-center gap-2" onClick={toggleIdorName} style={{ lineHeight: "1.2" }}>
-            onClick={() => setShowUnlisted(!showUnlisted)}
+              <RiSwapBoxFill /> {showIdorName === "id" ? "ID" : "Name"}
-            style={{ lineHeight: "1.2" }}
+            </button>
-          >
+
-            {showUnlisted ? <RiEyeFill /> : <RiEyeOffFill />} unlisted
+            <button
-          </button>
+              className="btn flex items-center gap-2"
              onClick={() => setShowUnlisted(!showUnlisted)}
              style={{ lineHeight: "1.2" }}
            >
              {showUnlisted ? <RiEyeFill /> : <RiEyeOffFill />} unlisted
            </button>
          </div>
          <button className="btn flex items-center gap-2" onClick={handleUnloadAllModels} disabled={isUnloading}>
            <RiStopCircleLine size="24" /> {isUnloading ? "Unloading..." : "Unload"}
          </button>
@@ -80,7 +91,7 @@ function ModelsPanel() {
        <table className="w-full">
          <thead className="sticky top-0 bg-card z-10">
            <tr className="border-b border-primary bg-surface">
-              <th className="text-left p-2">Name</th>
+              <th className="text-left p-2">{showIdorName === "id" ? "Model ID" : "Name"}</th>
              <th className="text-left p-2"></th>
              <th className="text-left p-2">State</th>
            </tr>
@@ -90,7 +101,7 @@ function ModelsPanel() {
              <tr key={model.id} className="border-b hover:bg-secondary-hover border-border">
                <td className={`p-2 ${model.unlisted ? "text-txtsecondary" : ""}`}>
                  <a href={`/upstream/${model.id}/`} className={`underline`} target="_blank">
-                    {model.name !== "" ? model.name : model.id}
+                    {showIdorName === "id" ? model.id : model.name !== "" ? model.name : model.id}
                  </a>
                  {model.description !== "" && (
                    <p className={model.unlisted ? "text-opacity-70" : ""}>
@@ -122,35 +133,41 @@ function ModelsPanel() {
 function StatsPanel() {
  const { metrics } = useAPI();
-  const [totalRequests, totalTokens, avgTokensPerSecond] = useMemo(() => {
+  const [totalRequests, totalInputTokens, totalOutputTokens, avgTokensPerSecond] = useMemo(() => {
    const totalRequests = metrics.length;
    if (totalRequests === 0) {
      return [0, 0, 0];
    }
-    const totalTokens = metrics.reduce((sum, m) => sum + m.output_tokens, 0);
+    const totalInputTokens = metrics.reduce((sum, m) => sum + m.input_tokens, 0);
    const totalOutputTokens = metrics.reduce((sum, m) => sum + m.output_tokens, 0);
    const avgTokensPerSecond = (metrics.reduce((sum, m) => sum + m.tokens_per_second, 0) / totalRequests).toFixed(2);
-    return [totalRequests, totalTokens, avgTokensPerSecond];
+    return [totalRequests, totalInputTokens, totalOutputTokens, avgTokensPerSecond];
  }, [metrics]);
  return (
    <div className="card">
-      <h2>Chat Activity</h2>
+      <div className="rounded-lg overflow-hidden border border-gray-200">
-      <table className="w-full border border-gray-200">
+        <table className="w-full">
-        <tbody>
+          <tbody>
-          <tr className="border-b border-gray-200">
+            <tr>
-            <td className="py-2 px-4 font-medium border-r border-gray-200">Requests</td>
+              <th className="p-2 font-medium border-b border-gray-200 text-right">Requests</th>
-            <td className="py-2 px-4 text-right">{totalRequests}</td>
+              <th className="p-2 font-medium border-l border-b border-gray-200 text-right">Processed</th>
-          </tr>
+              <th className="p-2 font-medium border-l border-b border-gray-200 text-right">Generated</th>
-          <tr className="border-b border-gray-200">
+              <th className="p-2 font-medium border-l border-b border-gray-200 text-right">Tokens/Sec</th>
-            <td className="py-2 px-4 font-medium border-r border-gray-200">Total Tokens Generated</td>
+            </tr>
-            <td className="py-2 px-4 text-right">{totalTokens}</td>
+            <tr>
-          </tr>
+              <td className="p-2 text-right border-r border-gray-200">{totalRequests}</td>
-          <tr>
+              <td className="p-2 text-right border-r border-gray-200">
-            <td className="py-2 px-4 font-medium border-r border-gray-200">Average Tokens/Second</td>
+                {new Intl.NumberFormat().format(totalInputTokens)}
-            <td className="py-2 px-4 text-right">{avgTokensPerSecond}</td>
+              </td>
-          </tr>
+              <td className="p-2 text-right border-r border-gray-200">
-        </tbody>
+                {new Intl.NumberFormat().format(totalOutputTokens)}
-      </table>
+              </td>
              <td className="p-2 text-right">{avgTokensPerSecond}</td>
            </tr>
          </tbody>
        </table>
      </div>
    </div>
  );
 }
Author	SHA1	Message	Date
Benson Wong	831a90d3b0	Add different timeout scenarios to Process.checkHealthEndpoint #276 (#278 ) - add a TCP connection timeout of 500ms - increase HTTP client timeout to 5000ms In this new behaviour the upstream has 500ms to accept a tcp connection and 5000ms to respond to the HTTP request.	2025-08-28 22:03:14 -07:00
Yandrik	977f1856bb	add /completion endpoint (#275 ) * feat: add /completion endpoint * chore: reformat using gofmt	2025-08-28 21:41:02 -07:00
Benson Wong	52b329f7bc	Fix #277 race condition in ProcessGroup.ProxyRequest when swap=true	2025-08-28 21:38:40 -07:00
Benson Wong	57803fd3aa	Support llama-server's /infill endpoint (#272 ) Add support for llama-server's /infill endpoint and metrics gathering on the Activities page.	2025-08-27 08:36:05 -07:00
Benson Wong	c55d0cc842	Add docs for model.concurrencyLimit #263 [skip ci]	2025-08-22 16:08:37 -07:00
Benson Wong	7acbaf4712	Add connection status indicator in UI (#260 ) * show connection status as icon in UI title * make connection status event driven	2025-08-20 13:58:24 -07:00
Benson Wong	fcc5ad135a	UI: Allow editing of title (#246 ) - make <h1> title contentEditable - title setting persists across reloads in localStorage	2025-08-17 09:42:06 -07:00
Benson Wong	305e5a0031	improve example config [skip ci]	2025-08-17 09:19:04 -07:00
Benson Wong	04fc67354a	Improve Activity event handling in the UI (#254 ) Improve Activity event handling in the UI - fixes #252 found that the Activity page showed activity inconsistent with /api/metrics - Change data structure for event metrics to array. - Add Event stream connections status indicator	2025-08-15 21:44:08 -07:00
Benson Wong	4662cf7699	add 'unconfirmed bug' as default label in bug-report.md	2025-08-15 15:38:12 -07:00
Benson Wong	5dc6b3e6d9	Add barebones but working implementation of model preload (#209 , #235 ) Add barebones but working implementation of model preload * add config test for Preload hook * improve TestProxyManager_StartupHooks * docs for new hook configuration * add a .dev to .gitignore	2025-08-14 10:27:28 -07:00
Benson Wong	74c69f39ef	Add prompt processing metrics (#250 ) - capture prompt processing metrics - display prompt processing metrics on UI Activity page	2025-08-14 10:02:16 -07:00
Benson Wong	a186318892	Update Readme, Add screenshot for Activities page [skip ci]	2025-08-08 13:39:46 -07:00
Benson Wong	c4e4d5e1e9	Update Readme UI Screenshot [skip ci]	2025-08-08 13:33:47 -07:00
Benson Wong	7985e94ba4	add tokens processed to ui models page	2025-08-08 13:28:39 -07:00
Benson Wong	74556c3a36	Update bug-report.md [skip ci]	2025-08-08 09:52:05 -07:00
Benson Wong	5c381e4b30	Add gofmt linting to ci	2025-08-07 20:29:18 -07:00
Benson Wong	10569ed546	Fix model alias usage in upstream path (#230 ) Model alias values are not properly resolved and work in upstream/ path. Related to #229.	2025-08-07 20:16:56 -07:00
Benson Wong	5b10b3c23f	UI Tweaks (#228 ) * sort model names in UI * add toggle to show model id/name on UI model page	2025-08-07 11:07:03 -07:00
		`@@ -0,0 +1 @@`
							`export type ConnectionState = "connected" \| "connecting" \| "disconnected";`