small tweak to example config

Add ${MODEL_ID} macro (#226 )
The automatic ${MODEL_ID} macro includes the name of the model and can be used in Cmd and CmdStop.
2025-09-01 21:26:58 -07:00 · 2025-09-01 21:21:37 -07:00 · 2025-08-28 23:44:37 -07:00 · 2025-08-28 22:47:28 -07:00 · 2025-08-28 22:03:14 -07:00 · 2025-08-28 21:41:02 -07:00
21 changed files with 549 additions and 143 deletions
@@ -2,7 +2,7 @@
 name: Bug Report
 about: I found a defect
 title: ''
-labels: bug
+labels: 'unconfirmed bug'
 assignees: ''

 ---
@@ -7,7 +7,7 @@

 llama-swap is a light weight, transparent proxy server that provides automatic model swapping to llama.cpp's server.

-Written in golang, it is very easy to install (single binary with no dependencies) and configure (single yaml file). To get started, download a pre-built binary or use the provided docker images.
+Written in golang, it is very easy to install (single binary with no dependencies) and configure (single yaml file). To get started, download a pre-built binary, a provided docker images or Homebrew.

 ## Features:

@@ -18,9 +18,12 @@ Written in golang, it is very easy to install (single binary with no dependencie
  - `v1/completions`
  - `v1/chat/completions`
  - `v1/embeddings`
-  - `v1/rerank`, `v1/reranking`, `rerank`
  - `v1/audio/speech` ([#36](https://github.com/mostlygeek/llama-swap/issues/36))
  - `v1/audio/transcriptions` ([docs](https://github.com/mostlygeek/llama-swap/issues/41#issuecomment-2722637867))
+- ✅ llama-server (llama.cpp) supported endpoints:
+  - `v1/rerank`, `v1/reranking`, `/rerank`
+  - `/infill` - for code infilling
+  - `/completion` - for completion endpoint
 - ✅ llama-swap custom API endpoints
  - `/ui` - web UI
  - `/log` - remote log monitoring
@@ -204,4 +207,7 @@ For Python based inference servers like vllm or tabbyAPI it is recommended to ru

 ## Star History

+> [!NOTE]
+> ⭐️ Star this project to help others discover it! 
+
 [![Star History Chart](https://api.star-history.com/svg?repos=mostlygeek/llama-swap&type=Date)](https://www.star-history.com/#mostlygeek/llama-swap&Date)
@@ -3,14 +3,15 @@
 #
 # 💡 Tip - Use an LLM with this file!
 # ====================================
-#  This example configuration is written to be LLM friendly! Try
+#  This example configuration is written to be LLM friendly. Try
 #  copying this file into an LLM and asking it to explain or generate
 #  sections for you.
 # ====================================
-#
+
+# Usage notes:
 # - Below are all the available configuration options for llama-swap.
-# - Settings with a default value, or noted as optional can be omitted.
-# - Settings that are marked required must be in your configuration file
+# - Settings noted as "required" must be in your configuration file
+# - Settings noted as "optional" can be omitted

 # healthCheckTimeout: number of seconds to wait for a model to be ready to serve requests
 # - optional, default: 120
@@ -34,9 +35,9 @@ metricsMaxInMemory: 1000
 # - it is automatically incremented for every model that uses it
 startPort: 10001

-# macros: sets a dictionary of string:string pairs
+# macros: a dictionary of string substitutions
 # - optional, default: empty dictionary
-# - these are reusable snippets
+# - macros are reusable snippets
 # - used in a model's cmd, cmdStop, proxy and checkEndpoint
 # - useful for reducing common configuration settings
 macros:
@@ -48,8 +49,8 @@ macros:
 # - required
 # - each key is the model's ID, used in API requests
 # - model settings have default values that are used if they are not defined here
-# - below are examples of the various settings a model can have:
-# - available model settings: env, cmd, cmdStop, proxy, aliases, checkEndpoint, ttl, unlisted
+# - the model's ID is available in the ${MODEL_ID} macro, also available in macros defined above
+# - below are examples of the all the settings a model can have
 models:

  # keys are the model names used in API requests
@@ -99,49 +100,60 @@ models:

    # checkEndpoint: URL path to check if the server is ready
    # - optional, default: /health
-    # - use "none" to skip endpoint ready checking
    # - endpoint is expected to return an HTTP 200 response
-    # - all requests wait until the endpoint is ready (or fails)
+    # - all requests wait until the endpoint is ready or fails
+    # - use "none" to skip endpoint health checking
    checkEndpoint: /custom-endpoint

-    # ttl: automatically unload the model after this many seconds
+    # ttl: automatically unload the model after ttl seconds
    # - optional, default: 0
    # - ttl values must be a value greater than 0
    # - a value of 0 disables automatic unloading of the model
    ttl: 60

-    # useModelName: overrides the model name that is sent to upstream server
+    # useModelName: override the model name that is sent to upstream server
    # - optional, default: ""
-    # - useful when the upstream server expects a specific model name or format
+    # - useful for when the upstream server expects a specific model name that
+    #   is different from the model's ID
    useModelName: "qwen:qwq"

    # filters: a dictionary of filter settings
    # - optional, default: empty dictionary
+    # - only strip_params is currently supported
    filters:
      # strip_params: a comma separated list of parameters to remove from the request
      # - optional, default: ""
-      # - useful for preventing overriding of default server params by requests
-      # - `model` parameter is never removed
+      # - useful for server side enforcement of sampling parameters
+      # - the `model` parameter can never be removed
      # - can be any JSON key in the request body
      # - recommended to stick to sampling parameters
      strip_params: "temperature, top_p, top_k"

+    # concurrencyLimit: overrides the allowed number of active parallel requests to a model
+    # - optional, default: 0
+    # - useful for limiting the number of active parallel requests a model can process
+    # - must be set per model
+    # - any number greater than 0 will override the internal default value of 10
+    # - any requests that exceeds the limit will receive an HTTP 429 Too Many Requests response
+    # - recommended to be omitted and the default used
+    concurrencyLimit: 0
+
  # Unlisted model example:
  "qwen-unlisted":
-    # unlisted: true or false
+    # unlisted: boolean, true or false
    # - optional, default: false
-    # - unlisted models do not show up in /v1/models or /upstream lists
+    # - unlisted models do not show up in /v1/models api requests
    # - can be requested as normal through all apis
    unlisted: true
    cmd: llama-server --port ${PORT} -m Llama-3.2-1B-Instruct-Q4_K_M.gguf -ngl 0

  # Docker example:
-  # container run times like Docker and Podman can also be used with a
-  # a combination of cmd and cmdStop.
+  # container runtimes like Docker and Podman can be used reliably with
+  # a combination of cmd, cmdStop, and ${MODEL_ID}
  "docker-llama":
    proxy: "http://127.0.0.1:${PORT}"
    cmd: |
-      docker run --name dockertest
+      docker run --name ${MODEL_ID}
      --init --rm -p ${PORT}:8080 -v /mnt/nvme/models:/models
      ghcr.io/ggml-org/llama.cpp:server
      --model '/models/Qwen2.5-Coder-0.5B-Instruct-Q4_K_M.gguf'
@@ -149,24 +161,26 @@ models:
    # cmdStop: command to run to stop the model gracefully
    # - optional, default: ""
    # - useful for stopping commands managed by another system
-    # - on POSIX systems: a SIGTERM is sent for graceful shutdown
-    # - on Windows, taskkill is used
-    # - processes are given 5 seconds to shutdown until they are forcefully killed
    # - the upstream's process id is available in the ${PID} macro
-    cmdStop: docker stop dockertest
+    #
+    # When empty, llama-swap has this default behaviour:
+    # - on POSIX systems: a SIGTERM signal is sent
+    # - on Windows, calls taskkill to stop the process
+    # - processes have 5 seconds to shutdown until forceful termination is attempted
+    cmdStop: docker stop ${MODEL_ID}

 # groups: a dictionary of group settings
 # - optional, default: empty dictionary
-# - provide advanced controls over model swapping behaviour.
-# - Using groups some models can be kept loaded indefinitely, while others are swapped out.
-# - model ids must be defined in the Models section
+# - provides advanced controls over model swapping behaviour
+# - using groups some models can be kept loaded indefinitely, while others are swapped out
+# - model IDs must be defined in the Models section
 # - a model can only be a member of one group
 # - group behaviour is controlled via the `swap`, `exclusive` and `persistent` fields
 # - see issue #109 for details
 #
 # NOTE: the example below uses model names that are not defined above for demonstration purposes
 groups:
-  # group1 is same as the default behaviour of llama-swap where only one model is allowed
+  # group1 works the same as the default behaviour of llama-swap where only one model is allowed
  # to run a time across the whole llama-swap instance
  "group1":
    # swap: controls the model swapping behaviour in within the group
@@ -188,10 +202,13 @@ groups:
      - "qwen-unlisted"

  # Example:
-  # - in this group all the models can run at the same time
-  # - when a different group loads all running models in this group are unloaded
+  # - in group2 all models can run at the same time
+  # - when a different group is loaded it causes all running models in this group to unload
  "group2":
    swap: false
+
+    # exclusive: false does not unload other groups when a model in group2 is requested
+    # - the models in group2 will be loaded but will not unload any other groups
    exclusive: false
    members:
      - "docker-llama"
@@ -220,7 +237,7 @@ groups:
 # - the only supported hook is on_startup
 hooks:
  # on_startup: a dictionary of actions to perform on startup
-  # - optional, default: empty dictionar
+  # - optional, default: empty dictionary
  # - the only supported action is preload
  on_startup:
        # preload: a list of model ids to load on startup
@@ -229,4 +246,4 @@ hooks:
        # - when preloading multiple models at once, define a group
        #   otherwise models will be loaded and swapped out
    preload:
-      - "llama"
+      - "llama"
@@ -0,0 +1,159 @@
+package main
+
+// created for issue: #252 https://github.com/mostlygeek/llama-swap/issues/252
+// this simple benchmark tool sends a lot of small chat completion requests to llama-swap
+// to make sure all the requests are accounted for.
+//
+// requests can be sent in parallel, and the tool will report the results.
+// usage: go run main.go -baseurl http://localhost:8080/v1 -model llama3 -requests 1000 -par 5
+
+import (
+	"bytes"
+	"flag"
+	"fmt"
+	"io"
+	"log"
+	"net/http"
+	"os"
+	"sync"
+	"time"
+)
+
+func main() {
+	// ----- CLI arguments ----------------------------------------------------
+	var (
+		baseurl         string
+		modelName       string
+		totalRequests   int
+		parallelization int
+	)
+
+	flag.StringVar(&baseurl, "baseurl", "http://localhost:8080/v1", "Base URL of the API (e.g., https://api.example.com)")
+	flag.StringVar(&modelName, "model", "", "Model name to use")
+	flag.IntVar(&totalRequests, "requests", 1, "Total number of requests to send")
+	flag.IntVar(&parallelization, "par", 1, "Maximum number of concurrent requests")
+	flag.Parse()
+
+	if baseurl == "" || modelName == "" {
+		fmt.Println("Error: both -baseurl and -model are required.")
+		flag.Usage()
+		os.Exit(1)
+	}
+	if totalRequests <= 0 {
+		fmt.Println("Error: -requests must be greater than 0.")
+		os.Exit(1)
+	}
+	if parallelization <= 0 {
+		fmt.Println("Error: -parallelization must be greater than 0.")
+		os.Exit(1)
+	}
+
+	// ----- HTTP client -------------------------------------------------------
+	client := &http.Client{
+		Timeout: 30 * time.Second,
+	}
+
+	// ----- Tracking response codes -------------------------------------------
+	statusCounts := make(map[int]int) // map[statusCode]count
+	var mu sync.Mutex                 // protects statusCounts
+
+	// ----- Request queue (buffered channel) ----------------------------------
+	requests := make(chan int, 10) // Buffered channel with capacity 10
+
+	// Goroutine to fill the request queue
+	go func() {
+		for i := 0; i < totalRequests; i++ {
+			requests <- i + 1
+		}
+		close(requests)
+	}()
+
+	// ----- Worker pool -------------------------------------------------------
+	var wg sync.WaitGroup
+	for i := 0; i < parallelization; i++ {
+		wg.Add(1)
+		go func(workerID int) {
+			defer wg.Done()
+
+			for reqID := range requests {
+				// Build request payload as a single line JSON string
+				payload := `{"model":"` + modelName + `","max_tokens":100,"stream":false,"messages":[{"role":"user","content":"write a snake game in python"}]}`
+
+				// Send POST request
+				req, err := http.NewRequest(http.MethodPost,
+					fmt.Sprintf("%s/chat/completions", baseurl),
+					bytes.NewReader([]byte(payload)))
+				if err != nil {
+					log.Printf("[worker %d][req %d] request creation error: %v", workerID, reqID, err)
+					mu.Lock()
+					statusCounts[-1]++
+					mu.Unlock()
+					continue
+				}
+				req.Header.Set("Content-Type", "application/json")
+
+				resp, err := client.Do(req)
+				if err != nil {
+					log.Printf("[worker %d][req %d] HTTP request error: %v", workerID, reqID, err)
+					mu.Lock()
+					statusCounts[-1]++
+					mu.Unlock()
+					continue
+				}
+				io.Copy(io.Discard, resp.Body)
+				resp.Body.Close()
+
+				// Record status code
+				mu.Lock()
+				statusCounts[resp.StatusCode]++
+				mu.Unlock()
+			}
+		}(i + 1)
+	}
+
+	// ----- Status ticker (prints every second) -------------------------------
+	done := make(chan struct{})
+	tickerDone := make(chan struct{})
+	go func() {
+		ticker := time.NewTicker(1 * time.Second)
+		startTime := time.Now()
+		for {
+			select {
+			case <-ticker.C:
+				mu.Lock()
+				// Compute how many requests have completed so far
+				completed := 0
+				for _, cnt := range statusCounts {
+					completed += cnt
+				}
+				// Calculate duration and progress
+				duration := time.Since(startTime)
+				progress := completed * 100 / totalRequests
+				fmt.Printf("Duration: %v, Completed: %d%% requests\n", duration, progress)
+				mu.Unlock()
+			case <-done:
+				duration := time.Since(startTime)
+				fmt.Printf("Duration: %v, Completed: %d%% requests\n", duration, 100)
+				close(tickerDone)
+				return
+			}
+		}
+	}()
+
+	// Wait for all workers to finish
+	wg.Wait()
+	close(done)  // stops the status-update goroutine
+	<-tickerDone // give ticker time to finish / print
+
+	// ----- Summary ------------------------------------------------------------
+	fmt.Println("\n\n=== HTTP response code summary ===")
+	mu.Lock()
+	for code, cnt := range statusCounts {
+		if code == -1 {
+			fmt.Printf("Client-side errors (no HTTP response): %d\n", cnt)
+		} else {
+			fmt.Printf("%d : %d\n", code, cnt)
+		}
+	}
+	mu.Unlock()
+}
@@ -153,6 +153,19 @@ func main() {

 	})

+	// llama-server compatibility: /completion
+	r.POST("/completion", func(c *gin.Context) {
+		c.Header("Content-Type", "application/json")
+		c.JSON(http.StatusOK, gin.H{
+			"responseMessage": *responseMessage,
+			"usage": gin.H{
+				"completion_tokens": 10,
+				"prompt_tokens":     25,
+				"total_tokens":      35,
+			},
+		})
+	})
+
 	// issue #41
 	r.POST("/v1/audio/transcriptions", func(c *gin.Context) {
 		// Parse the multipart form
@@ -237,7 +237,7 @@ func LoadConfigFromReader(r io.Reader) (Config, error) {

 	- name must fit the regex ^[a-zA-Z0-9_-]+$
 	- names must be less than 64 characters (no reason, just cause)
-	- name can not be any reserved macros: PORT
+	- name can not be any reserved macros: PORT, MODEL_ID
 	- macro values must be less than 1024 characters
 	*/
 	macroNameRegex := regexp.MustCompile(`^[a-zA-Z0-9_-]+$`)
@@ -253,6 +253,7 @@ func LoadConfigFromReader(r io.Reader) (Config, error) {
 		}
 		switch macroName {
 		case "PORT":
+		case "MODEL_ID":
 			return Config{}, fmt.Errorf("macro name '%s' is reserved and cannot be used", macroName)
 		}
 	}
@@ -296,6 +297,11 @@ func LoadConfigFromReader(r io.Reader) (Config, error) {
 			nextPort++
 		}

+		if strings.Contains(modelConfig.Cmd, "${MODEL_ID}") || strings.Contains(modelConfig.CmdStop, "${MODEL_ID}") {
+			modelConfig.Cmd = strings.ReplaceAll(modelConfig.Cmd, "${MODEL_ID}", modelId)
+			modelConfig.CmdStop = strings.ReplaceAll(modelConfig.CmdStop, "${MODEL_ID}", modelId)
+		}
+
 		// make sure there are no unknown macros that have not been replaced
 		macroPattern := regexp.MustCompile(`\$\{([a-zA-Z0-9_-]+)\}`)
 		fieldMap := map[string]string{
@@ -440,3 +440,44 @@ models:
 	expectedCmd := "/user/llama.cpp/build/bin/llama-server --port 9990 --model /path/to/model.gguf -ngl 99"
 	assert.Equal(t, expectedCmd, cmdStr, "Final command does not match expected structure")
 }
+
+func TestConfig_MacroModelId(t *testing.T) {
+	content := `
+startPort: 9000
+macros:
+  "docker-llama": docker run --name ${MODEL_ID} -p ${PORT}:8080 docker_img
+  "docker-stop": docker stop ${MODEL_ID}
+
+models:
+  model1:
+    cmd: /path/to/server -p ${PORT} -hf ${MODEL_ID}
+
+  model2:
+    cmd: ${docker-llama}
+    cmdStop: ${docker-stop}
+
+  author/model:F16:
+    cmd: /path/to/server -p ${PORT} -hf ${MODEL_ID}
+    cmdStop: stop
+`
+
+	config, err := LoadConfigFromReader(strings.NewReader(content))
+	assert.NoError(t, err)
+	sanitizedCmd, err := SanitizeCommand(config.Models["model1"].Cmd)
+	assert.NoError(t, err)
+	assert.Equal(t, "/path/to/server -p 9001 -hf model1", strings.Join(sanitizedCmd, " "))
+
+	assert.Equal(t, "docker stop ${MODEL_ID}", config.Macros["docker-stop"])
+
+	sanitizedCmd2, err := SanitizeCommand(config.Models["model2"].Cmd)
+	assert.NoError(t, err)
+	assert.Equal(t, "docker run --name model2 -p 9002:8080 docker_img", strings.Join(sanitizedCmd2, " "))
+
+	sanitizedCmdStop, err := SanitizeCommand(config.Models["model2"].CmdStop)
+	assert.NoError(t, err)
+	assert.Equal(t, "docker stop model2", strings.Join(sanitizedCmdStop, " "))
+
+	sanitizedCmd3, err := SanitizeCommand(config.Models["author/model:F16"].Cmd)
+	assert.NoError(t, err)
+	assert.Equal(t, "/path/to/server -p 9000 -hf author/model:F16", strings.Join(sanitizedCmd3, " "))
+}
@@ -5,12 +5,20 @@ import (
 	"fmt"
 	"io"
 	"net/http"
+	"strings"
 	"time"

 	"github.com/gin-gonic/gin"
 	"github.com/tidwall/gjson"
 )

+type MetricsRecorder struct {
+	metricsMonitor *MetricsMonitor
+	realModelName  string
+	//	isStreaming    bool
+	startTime time.Time
+}
+
 // MetricsMiddleware sets up the MetricsResponseWriter for capturing upstream requests
 func MetricsMiddleware(pm *ProxyManager) gin.HandlerFunc {
 	return func(c *gin.Context) {
@@ -41,49 +49,47 @@ func MetricsMiddleware(pm *ProxyManager) gin.HandlerFunc {
 			metricsRecorder: &MetricsRecorder{
 				metricsMonitor: pm.metricsMonitor,
 				realModelName:  realModelName,
-				isStreaming:    gjson.GetBytes(bodyBytes, "stream").Bool(),
 				startTime:      time.Now(),
 			},
 		}
 		c.Writer = writer
 		c.Next()

-		rec := writer.metricsRecorder
-		rec.processBody(writer.body)
-	}
-}
+		// check for streaming response
+		if strings.Contains(c.Writer.Header().Get("Content-Type"), "text/event-stream") {
+			writer.metricsRecorder.processStreamingResponse(writer.body)
+		} else {
+			writer.metricsRecorder.processNonStreamingResponse(writer.body)
+		}

-type MetricsRecorder struct {
-	metricsMonitor *MetricsMonitor
-	realModelName  string
-	isStreaming    bool
-	startTime      time.Time
-}
-
-// processBody handles response processing after request completes
-func (rec *MetricsRecorder) processBody(body []byte) {
-	if rec.isStreaming {
-		rec.processStreamingResponse(body)
-	} else {
-		rec.processNonStreamingResponse(body)
 	}
 }

 func (rec *MetricsRecorder) parseAndRecordMetrics(jsonData gjson.Result) bool {
 	usage := jsonData.Get("usage")
-	if !usage.Exists() {
+	timings := jsonData.Get("timings")
+	if !usage.Exists() && !timings.Exists() {
 		return false
 	}

 	// default values
-	outputTokens := int(jsonData.Get("usage.completion_tokens").Int())
-	inputTokens := int(jsonData.Get("usage.prompt_tokens").Int())
+	outputTokens := 0
+	inputTokens := 0
+
+	// timings data
 	tokensPerSecond := -1.0
 	promptPerSecond := -1.0
 	durationMs := int(time.Since(rec.startTime).Milliseconds())

+	if usage.Exists() {
+		outputTokens = int(jsonData.Get("usage.completion_tokens").Int())
+		inputTokens = int(jsonData.Get("usage.prompt_tokens").Int())
+	}
+
 	// use llama-server's timing data for tok/sec and duration as it is more accurate
-	if timings := jsonData.Get("timings"); timings.Exists() {
+	if timings.Exists() {
+		inputTokens = int(jsonData.Get("timings.prompt_n").Int())
+		outputTokens = int(jsonData.Get("timings.predicted_n").Int())
 		promptPerSecond = jsonData.Get("timings.prompt_per_second").Float()
 		tokensPerSecond = jsonData.Get("timings.predicted_per_second").Float()
 		durationMs = int(jsonData.Get("timings.prompt_ms").Float() + jsonData.Get("timings.predicted_ms").Float())
@@ -5,6 +5,7 @@ import (
 	"errors"
 	"fmt"
 	"io"
+	"net"
 	"net/http"
 	"net/url"
 	"os/exec"
@@ -363,8 +364,18 @@ func (p *Process) stopCommand() {
 }

 func (p *Process) checkHealthEndpoint(healthURL string) error {
+
 	client := &http.Client{
-		Timeout: 500 * time.Millisecond,
+		// wait a short time for a tcp connection to be established
+		Transport: &http.Transport{
+			DialContext: (&net.Dialer{
+				Timeout: 500 * time.Millisecond,
+			}).DialContext,
+		},
+
+		// give a long time to respond to the health check endpoint
+		// after the connection is established. See issue: 276
+		Timeout: 5000 * time.Millisecond,
 	}

 	req, err := http.NewRequest("GET", healthURL, nil)
@@ -60,10 +60,20 @@ func (pg *ProcessGroup) ProxyRequest(modelID string, writer http.ResponseWriter,
 	if pg.swap {
 		pg.Lock()
 		if pg.lastUsedProcess != modelID {
+
+			// is there something already running?
 			if pg.lastUsedProcess != "" {
 				pg.processes[pg.lastUsedProcess].Stop()
 			}
+
+			// wait for the request to the new model to be fully handled
+			// and prevent race conditions see issue #277
+			pg.processes[modelID].ProxyRequest(writer, request)
 			pg.lastUsedProcess = modelID
+
+			// short circuit and exit
+			pg.Unlock()
+			return nil
 		}
 		pg.Unlock()
 	}
@@ -4,6 +4,7 @@ import (
 	"bytes"
 	"net/http"
 	"net/http/httptest"
+	"sync"
 	"testing"

 	"github.com/stretchr/testify/assert"
@@ -44,32 +45,49 @@ func TestProcessGroup_HasMember(t *testing.T) {
 	assert.False(t, pg.HasMember("model3"))
 }

-func TestProcessGroup_ProxyRequestSwapIsTrue(t *testing.T) {
+// TestProcessGroup_ProxyRequestSwapIsTrueParallel tests that when swap is true
+// and multiple requests are made in parallel, only one process is running at a time.
+func TestProcessGroup_ProxyRequestSwapIsTrueParallel(t *testing.T) {
+	var processGroupTestConfig = AddDefaultGroupToConfig(Config{
+		HealthCheckTimeout: 15,
+		Models: map[string]ModelConfig{
+			// use the same listening so if a model is already running, it will fail
+			// this is a way to test that swap isolation is working
+			// properly when there are parallel requests made at the
+			// same time.
+			"model1": getTestSimpleResponderConfigPort("model1", 9832),
+			"model2": getTestSimpleResponderConfigPort("model2", 9832),
+			"model3": getTestSimpleResponderConfigPort("model3", 9832),
+			"model4": getTestSimpleResponderConfigPort("model4", 9832),
+			"model5": getTestSimpleResponderConfigPort("model5", 9832),
+		},
+		Groups: map[string]GroupConfig{
+			"G1": {
+				Swap:    true,
+				Members: []string{"model1", "model2", "model3", "model4", "model5"},
+			},
+		},
+	})
+
 	pg := NewProcessGroup("G1", processGroupTestConfig, testLogger, testLogger)
 	defer pg.StopProcesses(StopWaitForInflightRequest)

-	tests := []string{"model1", "model2"}
+	tests := []string{"model1", "model2", "model3", "model4", "model5"}

+	var wg sync.WaitGroup
+
+	wg.Add(len(tests))
 	for _, modelName := range tests {
-		t.Run(modelName, func(t *testing.T) {
-			reqBody := `{"x", "y"}`
-			req := httptest.NewRequest("POST", "/v1/chat/completions", bytes.NewBufferString(reqBody))
+		go func(modelName string) {
+			defer wg.Done()
+			req := httptest.NewRequest("POST", "/v1/chat/completions", nil)
 			w := httptest.NewRecorder()
-
 			assert.NoError(t, pg.ProxyRequest(modelName, w, req))
 			assert.Equal(t, http.StatusOK, w.Code)
 			assert.Contains(t, w.Body.String(), modelName)
-
-			// make sure only one process is in the running state
-			count := 0
-			for _, process := range pg.processes {
-				if process.CurrentState() == StateReady {
-					count++
-				}
-			}
-			assert.Equal(t, 1, count)
-		})
+		}(modelName)
 	}
+	wg.Wait()
 }

 func TestProcessGroup_ProxyRequestSwapIsFalse(t *testing.T) {
@@ -191,11 +191,20 @@ func (pm *ProxyManager) setupGinEngine() {
 	// Support legacy /v1/completions api, see issue #12
 	pm.ginEngine.POST("/v1/completions", mm, pm.proxyOAIHandler)

-	// Support embeddings
+	// Support embeddings and reranking
 	pm.ginEngine.POST("/v1/embeddings", mm, pm.proxyOAIHandler)
+
+	// llama-server's /reranking endpoint + aliases
+	pm.ginEngine.POST("/reranking", mm, pm.proxyOAIHandler)
+	pm.ginEngine.POST("/rerank", mm, pm.proxyOAIHandler)
 	pm.ginEngine.POST("/v1/rerank", mm, pm.proxyOAIHandler)
 	pm.ginEngine.POST("/v1/reranking", mm, pm.proxyOAIHandler)
-	pm.ginEngine.POST("/rerank", mm, pm.proxyOAIHandler)
+
+	// llama-server's /infill endpoint for code infilling
+	pm.ginEngine.POST("/infill", mm, pm.proxyOAIHandler)
+
+	// llama-server's /completion endpoint
+	pm.ginEngine.POST("/completion", mm, pm.proxyOAIHandler)

 	// Support audio/speech endpoint
 	pm.ginEngine.POST("/v1/audio/speech", pm.proxyOAIHandler)
@@ -132,7 +132,7 @@ func (pm *ProxyManager) apiSendEvents(c *gin.Context) {
 		}
 	}

-	sendMetrics := func(metrics TokenMetrics) {
+	sendMetrics := func(metrics []TokenMetrics) {
 		jsonData, err := json.Marshal(metrics)
 		if err == nil {
 			select {
@@ -168,16 +168,14 @@ func (pm *ProxyManager) apiSendEvents(c *gin.Context) {
 	 * Send Metrics data
 	 */
 	defer event.On(func(e TokenMetricsEvent) {
-		sendMetrics(e.Metrics)
+		sendMetrics([]TokenMetrics{e.Metrics})
 	})()

 	// send initial batch of data
 	sendLogData("proxy", pm.proxyLogger.GetHistory())
 	sendLogData("upstream", pm.upstreamLogger.GetHistory())
 	sendModels()
-	for _, metrics := range pm.metricsMonitor.GetMetrics() {
-		sendMetrics(metrics)
-	}
+	sendMetrics(pm.metricsMonitor.GetMetrics())

 	for {
 		select {
@@ -42,7 +42,6 @@ func TestProxyManager_SwapProcessCorrectly(t *testing.T) {
 		assert.Contains(t, w.Body.String(), modelName)
 	}
 }
-
 func TestProxyManager_SwapMultiProcess(t *testing.T) {
 	config := AddDefaultGroupToConfig(Config{
 		HealthCheckTimeout: 15,
@@ -834,6 +833,28 @@ func TestProxyManager_HealthEndpoint(t *testing.T) {
 	assert.Equal(t, "OK", rec.Body.String())
 }

+// Ensure the custom llama-server /completion endpoint proxies correctly
+func TestProxyManager_CompletionEndpoint(t *testing.T) {
+	config := AddDefaultGroupToConfig(Config{
+		HealthCheckTimeout: 15,
+		Models: map[string]ModelConfig{
+			"model1": getTestSimpleResponderConfig("model1"),
+		},
+		LogLevel: "error",
+	})
+
+	proxy := New(config)
+	defer proxy.StopProcesses(StopWaitForInflightRequest)
+
+	reqBody := `{"model":"model1"}`
+	req := httptest.NewRequest("POST", "/completion", bytes.NewBufferString(reqBody))
+	w := httptest.NewRecorder()
+
+	proxy.ServeHTTP(w, req)
+	assert.Equal(t, http.StatusOK, w.Code)
+	assert.Contains(t, w.Body.String(), "model1")
+}
+
 func TestProxyManager_StartupHooks(t *testing.T) {

 	// using real YAML as the configuration has gotten more complex
@@ -1,50 +1,78 @@
+import { useEffect, useCallback } from "react";
 import { BrowserRouter as Router, Routes, Route, Navigate, NavLink } from "react-router-dom";
 import { useTheme } from "./contexts/ThemeProvider";
-import { APIProvider } from "./contexts/APIProvider";
+import { useAPI } from "./contexts/APIProvider";
 import LogViewerPage from "./pages/LogViewer";
 import ModelPage from "./pages/Models";
 import ActivityPage from "./pages/Activity";
+import ConnectionStatusIcon from "./components/ConnectionStatus";
 import { RiSunFill, RiMoonFill } from "react-icons/ri";

 function App() {
-  const { isNarrow, toggleTheme, isDarkMode } = useTheme();
+  const { isNarrow, toggleTheme, isDarkMode, appTitle, setAppTitle, setConnectionState } = useTheme();
+  const handleTitleChange = useCallback(
+    (newTitle: string) => {
+      setAppTitle(newTitle.replace(/\n/g, "").trim().substring(0, 64) || "llama-swap");
+    },
+    [setAppTitle]
+  );
+
+  const { connectionStatus } = useAPI();
+
+  // Synchronize the window.title connections state with the actual connection state
+  useEffect(() => {
+    setConnectionState(connectionStatus);
+  }, [connectionStatus]);

  return (
    <Router basename="/ui/">
-      <APIProvider>
-        <div className="flex flex-col h-screen">
-          <nav className="bg-surface border-b border-border p-2 h-[75px]">
-            <div className="flex items-center justify-between mx-auto px-4 h-full">
-              {!isNarrow && <h1 className="flex items-center p-0">llama-swap</h1>}
-              <div className="flex items-center space-x-4">
-                <NavLink to="/" className={({ isActive }) => (isActive ? "navlink active" : "navlink")}>
-                  Logs
-                </NavLink>
-
-                <NavLink to="/models" className={({ isActive }) => (isActive ? "navlink active" : "navlink")}>
-                  Models
-                </NavLink>
-
-                <NavLink to="/activity" className={({ isActive }) => (isActive ? "navlink active" : "navlink")}>
-                  Activity
-                </NavLink>
-                <button className="" onClick={toggleTheme}>
-                  {isDarkMode ? <RiMoonFill /> : <RiSunFill />}
-                </button>
-              </div>
+      <div className="flex flex-col h-screen">
+        <nav className="bg-surface border-b border-border p-2 h-[75px]">
+          <div className="flex items-center justify-between mx-auto px-4 h-full">
+            {!isNarrow && (
+              <h1
+                contentEditable
+                suppressContentEditableWarning
+                className="flex items-center p-0 outline-none hover:bg-gray-100 dark:hover:bg-gray-700 rounded px-1"
+                onBlur={(e) => handleTitleChange(e.currentTarget.textContent || "(set title)")}
+                onKeyDown={(e) => {
+                  if (e.key === "Enter") {
+                    e.preventDefault();
+                    handleTitleChange(e.currentTarget.textContent || "(set title)");
+                    e.currentTarget.blur();
+                  }
+                }}
+              >
+                {appTitle}
+              </h1>
+            )}
+            <div className="flex items-center space-x-4">
+              <NavLink to="/" className={({ isActive }) => (isActive ? "navlink active" : "navlink")}>
+                Logs
+              </NavLink>
+              <NavLink to="/models" className={({ isActive }) => (isActive ? "navlink active" : "navlink")}>
+                Models
+              </NavLink>
+              <NavLink to="/activity" className={({ isActive }) => (isActive ? "navlink active" : "navlink")}>
+                Activity
+              </NavLink>
+              <button className="" onClick={toggleTheme}>
+                {isDarkMode ? <RiMoonFill /> : <RiSunFill />}
+              </button>
+              <ConnectionStatusIcon />
            </div>
-          </nav>
+          </div>
+        </nav>

-          <main className="flex-1 overflow-auto p-4">
-            <Routes>
-              <Route path="/" element={<LogViewerPage />} />
-              <Route path="/models" element={<ModelPage />} />
-              <Route path="/activity" element={<ActivityPage />} />
-              <Route path="*" element={<Navigate to="/" replace />} />
-            </Routes>
-          </main>
-        </div>
-      </APIProvider>
+        <main className="flex-1 overflow-auto p-4">
+          <Routes>
+            <Route path="/" element={<LogViewerPage />} />
+            <Route path="/models" element={<ModelPage />} />
+            <Route path="/activity" element={<ActivityPage />} />
+            <Route path="*" element={<Navigate to="/" replace />} />
+          </Routes>
+        </main>
+      </div>
    </Router>
  );
 }
@@ -0,0 +1,26 @@
+import { useAPI } from "../contexts/APIProvider";
+import { useMemo } from "react";
+
+const ConnectionStatusIcon = () => {
+  const { connectionStatus } = useAPI();
+
+  const eventStatusColor = useMemo(() => {
+    switch (connectionStatus) {
+      case "connected":
+        return "bg-green-500";
+      case "connecting":
+        return "bg-yellow-500";
+      case "disconnected":
+      default:
+        return "bg-red-500";
+    }
+  }, [connectionStatus]);
+
+  return (
+    <div className="flex items-center" title={`event stream: ${connectionStatus}`}>
+      <span className={`inline-block w-3 h-3 rounded-full ${eventStatusColor} mr-2`}></span>
+    </div>
+  );
+};
+
+export default ConnectionStatusIcon;
@@ -1,4 +1,5 @@
 import { useRef, createContext, useState, useContext, useEffect, useCallback, useMemo, type ReactNode } from "react";
+import type { ConnectionState } from "../lib/types";

 type ModelStatus = "ready" | "starting" | "stopping" | "stopped" | "shutdown" | "unknown";
 const LOG_LENGTH_LIMIT = 1024 * 100; /* 100KB of log data */
@@ -20,6 +21,7 @@ interface APIProviderType {
  proxyLogs: string;
  upstreamLogs: string;
  metrics: Metrics[];
+  connectionStatus: ConnectionState;
 }

 interface Metrics {
@@ -52,6 +54,7 @@ export function APIProvider({ children, autoStartAPIEvents = true }: APIProvider
  const [proxyLogs, setProxyLogs] = useState("");
  const [upstreamLogs, setUpstreamLogs] = useState("");
  const [metrics, setMetrics] = useState<Metrics[]>([]);
+  const [connectionStatus, setConnectionState] = useState<ConnectionState>("disconnected");
  const apiEventSource = useRef<EventSource | null>(null);

  const [models, setModels] = useState<Model[]>([]);
@@ -75,7 +78,20 @@ export function APIProvider({ children, autoStartAPIEvents = true }: APIProvider
    const initialDelay = 1000; // 1 second

    const connect = () => {
+      apiEventSource.current = null;
      const eventSource = new EventSource("/api/events");
+      setConnectionState("connecting");
+
+      eventSource.onopen = () => {
+        // clear everything out on connect to keep things in sync
+        setProxyLogs("");
+        setUpstreamLogs("");
+        setMetrics([]); // clear metrics on reconnect
+        setModels([]); // clear models on reconnect
+        apiEventSource.current = eventSource;
+        retryCount = 0;
+        setConnectionState("connected");
+      };

      eventSource.onmessage = (e: MessageEvent) => {
        try {
@@ -108,9 +124,9 @@ export function APIProvider({ children, autoStartAPIEvents = true }: APIProvider

            case "metrics":
              {
-                const newMetric = JSON.parse(message.data) as Metrics;
+                const newMetrics = JSON.parse(message.data) as Metrics[];
                setMetrics((prevMetrics) => {
-                  return [newMetric, ...prevMetrics];
+                  return [...newMetrics, ...prevMetrics];
                });
              }
              break;
@@ -119,14 +135,14 @@ export function APIProvider({ children, autoStartAPIEvents = true }: APIProvider
          console.error(e.data, err);
        }
      };
+
      eventSource.onerror = () => {
        eventSource.close();
        retryCount++;
        const delay = Math.min(initialDelay * Math.pow(2, retryCount - 1), 5000);
+        setConnectionState("disconnected");
        setTimeout(connect, delay);
      };
-
-      apiEventSource.current = eventSource;
    };

    connect();
@@ -194,6 +210,7 @@ export function APIProvider({ children, autoStartAPIEvents = true }: APIProvider
      proxyLogs,
      upstreamLogs,
      metrics,
+      connectionStatus,
    }),
    [models, listModels, unloadAllModels, loadModel, enableAPIEvents, proxyLogs, upstreamLogs, metrics]
  );
@@ -1,5 +1,6 @@
 import { createContext, useContext, useEffect, type ReactNode, useMemo, useState } from "react";
 import { usePersistentState } from "../hooks/usePersistentState";
+import type { ConnectionState } from "../lib/types";

 type ScreenWidth = "xs" | "sm" | "md" | "lg" | "xl" | "2xl";
 type ThemeContextType = {
@@ -7,6 +8,11 @@ type ThemeContextType = {
  screenWidth: ScreenWidth;
  isNarrow: boolean;
  toggleTheme: () => void;
+
+  // for managing the window title and connection state information
+  appTitle: string;
+  setAppTitle: (title: string) => void;
+  setConnectionState: (state: ConnectionState) => void;
 };

 const ThemeContext = createContext<ThemeContextType | undefined>(undefined);
@@ -16,6 +22,17 @@ type ThemeProviderProps = {
 };

 export function ThemeProvider({ children }: ThemeProviderProps) {
+  const [appTitle, setAppTitle] = usePersistentState("app-title", "llama-swap");
+  const [connectionState, setConnectionState] = useState<ConnectionState>("disconnected");
+
+  /**
+   * Set the document.title with informative information
+   */
+  useEffect(() => {
+    const connectionIcon = connectionState === "connecting" ? "🟡" : connectionState === "connected" ? "🟢" : "🔴";
+    document.title = connectionIcon + " " + appTitle; // Set initial title
+  }, [appTitle, connectionState]);
+
  const [isDarkMode, setIsDarkMode] = usePersistentState<boolean>("theme", false);
  const [screenWidth, setScreenWidth] = useState<ScreenWidth>("md"); // Default to md

@@ -55,7 +72,19 @@ export function ThemeProvider({ children }: ThemeProviderProps) {
  }, [screenWidth]);

  return (
-    <ThemeContext.Provider value={{ isDarkMode, toggleTheme, screenWidth, isNarrow }}>{children}</ThemeContext.Provider>
+    <ThemeContext.Provider
+      value={{
+        isDarkMode,
+        toggleTheme,
+        screenWidth,
+        isNarrow,
+        appTitle,
+        setAppTitle,
+        setConnectionState,
+      }}
+    >
+      {children}
+    </ThemeContext.Provider>
  );
 }

@@ -0,0 +1 @@
+export type ConnectionState = "connected" | "connecting" | "disconnected";
@@ -3,11 +3,14 @@ import { createRoot } from "react-dom/client";
 import "./index.css";
 import App from "./App.tsx";
 import { ThemeProvider } from "./contexts/ThemeProvider";
+import { APIProvider } from "./contexts/APIProvider";

 createRoot(document.getElementById("root")!).render(
  <StrictMode>
    <ThemeProvider>
-      <App />
+      <APIProvider>
+        <App />
+      </APIProvider>
    </ThemeProvider>
  </StrictMode>
 );
@@ -1,4 +1,4 @@
-import { useState, useEffect } from "react";
+import { useMemo } from "react";
 import { useAPI } from "../contexts/APIProvider";

 const formatTimestamp = (timestamp: string): string => {
@@ -15,25 +15,10 @@ const formatDuration = (ms: number): string => {

 const ActivityPage = () => {
  const { metrics } = useAPI();
-  const [error, setError] = useState<string | null>(null);
-
-  useEffect(() => {
-    if (metrics.length > 0) {
-      setError(null);
-    }
+  const sortedMetrics = useMemo(() => {
+    return [...metrics].sort((a, b) => b.id - a.id);
  }, [metrics]);

-  if (error) {
-    return (
-      <div className="p-6">
-        <h1 className="text-2xl font-bold mb-4">Activity</h1>
-        <div className="bg-red-50 border border-red-200 rounded-md p-4">
-          <p className="text-red-800">{error}</p>
-        </div>
-      </div>
-    );
-  }
-
  return (
    <div className="p-6">
      <h1 className="text-2xl font-bold mb-4">Activity</h1>
@@ -47,6 +32,7 @@ const ActivityPage = () => {
          <table className="min-w-full divide-y">
            <thead>
              <tr>
+                <th className="px-4 py-3 text-left text-xs font-medium uppercase tracking-wider">Id</th>
                <th className="px-6 py-3 text-left text-xs font-medium uppercase tracking-wider">Timestamp</th>
                <th className="px-6 py-3 text-left text-xs font-medium uppercase tracking-wider">Model</th>
                <th className="px-6 py-3 text-left text-xs font-medium uppercase tracking-wider">Input Tokens</th>
@@ -57,8 +43,9 @@ const ActivityPage = () => {
              </tr>
            </thead>
            <tbody className="divide-y">
-              {metrics.map((metric, index) => (
-                <tr key={`${metric.id}-${index}`}>
+              {sortedMetrics.map((metric) => (
+                <tr key={`metric_${metric.id}`}>
+                  <td className="px-4 py-4 whitespace-nowrap text-sm">{metric.id + 1 /* un-zero index */}</td>
                  <td className="px-6 py-4 whitespace-nowrap text-sm">{formatTimestamp(metric.timestamp)}</td>
                  <td className="px-6 py-4 whitespace-nowrap text-sm">{metric.model}</td>
                  <td className="px-6 py-4 whitespace-nowrap text-sm">{metric.input_tokens.toLocaleString()}</td>
Author	SHA1	Message	Date
Benson Wong	a533aec736	small tweak to example config	2025-09-01 21:26:58 -07:00
Brett Profitt	97b17fc47d	Add ${MODEL_ID} macro (#226 ) The automatic ${MODEL_ID} macro includes the name of the model and can be used in Cmd and CmdStop.	2025-09-01 21:21:37 -07:00
Benson Wong	2457840698	Update README.md [skip ci]	2025-08-28 23:44:37 -07:00
Benson Wong	7f55494151	Update README.md [skip ci]	2025-08-28 22:47:28 -07:00
Benson Wong	831a90d3b0	Add different timeout scenarios to Process.checkHealthEndpoint #276 (#278 ) - add a TCP connection timeout of 500ms - increase HTTP client timeout to 5000ms In this new behaviour the upstream has 500ms to accept a tcp connection and 5000ms to respond to the HTTP request.	2025-08-28 22:03:14 -07:00
Yandrik	977f1856bb	add /completion endpoint (#275 ) * feat: add /completion endpoint * chore: reformat using gofmt	2025-08-28 21:41:02 -07:00
Benson Wong	52b329f7bc	Fix #277 race condition in ProcessGroup.ProxyRequest when swap=true	2025-08-28 21:38:40 -07:00
Benson Wong	57803fd3aa	Support llama-server's /infill endpoint (#272 ) Add support for llama-server's /infill endpoint and metrics gathering on the Activities page.	2025-08-27 08:36:05 -07:00
Benson Wong	c55d0cc842	Add docs for model.concurrencyLimit #263 [skip ci]	2025-08-22 16:08:37 -07:00
Benson Wong	7acbaf4712	Add connection status indicator in UI (#260 ) * show connection status as icon in UI title * make connection status event driven	2025-08-20 13:58:24 -07:00
Benson Wong	fcc5ad135a	UI: Allow editing of title (#246 ) - make <h1> title contentEditable - title setting persists across reloads in localStorage	2025-08-17 09:42:06 -07:00
Benson Wong	305e5a0031	improve example config [skip ci]	2025-08-17 09:19:04 -07:00
Benson Wong	04fc67354a	Improve Activity event handling in the UI (#254 ) Improve Activity event handling in the UI - fixes #252 found that the Activity page showed activity inconsistent with /api/metrics - Change data structure for event metrics to array. - Add Event stream connections status indicator	2025-08-15 21:44:08 -07:00
Benson Wong	4662cf7699	add 'unconfirmed bug' as default label in bug-report.md	2025-08-15 15:38:12 -07:00
				`@@ -0,0 +1 @@`
				`export type ConnectionState = "connected" \| "connecting" \| "disconnected";`