Support llama-server's /infill endpoint (#272 )

Add support for llama-server's /infill endpoint and metrics gathering on the Activities page.
Add docs for model.concurrencyLimit #263 [skip ci]
2025-08-27 08:36:05 -07:00 · 2025-08-22 16:08:37 -07:00 · 2025-08-20 13:58:24 -07:00 · 2025-08-17 09:42:06 -07:00 · 2025-08-17 09:19:04 -07:00 · 2025-08-15 21:44:08 -07:00
22 changed files with 605 additions and 129 deletions
@@ -2,7 +2,7 @@
 name: Bug Report
 about: I found a defect
 title: ''
-labels: bug
+labels: 'unconfirmed bug'
 assignees: ''

 ---
@@ -4,3 +4,4 @@ build/
 dist/
 .vscode
 .DS_Store
+.dev/
@@ -18,9 +18,11 @@ Written in golang, it is very easy to install (single binary with no dependencie
  - `v1/completions`
  - `v1/chat/completions`
  - `v1/embeddings`
-  - `v1/rerank`, `v1/reranking`, `rerank`
  - `v1/audio/speech` ([#36](https://github.com/mostlygeek/llama-swap/issues/36))
  - `v1/audio/transcriptions` ([docs](https://github.com/mostlygeek/llama-swap/issues/41#issuecomment-2722637867))
+- ✅ llama-server (llama.cpp) supported endpoints:
+  - `v1/rerank`, `v1/reranking`, `/rerank`
+  - `/infill` - for code infilling
 - ✅ llama-swap custom API endpoints
  - `/ui` - web UI
  - `/log` - remote log monitoring
@@ -31,8 +33,9 @@ Written in golang, it is very easy to install (single binary with no dependencie
 - ✅ Run multiple models at once with `Groups` ([#107](https://github.com/mostlygeek/llama-swap/issues/107))
 - ✅ Automatic unloading of models after timeout by setting a `ttl`
 - ✅ Use any local OpenAI compatible server (llama.cpp, vllm, tabbyAPI, etc)
- ✅ Docker and Podman support
+- ✅ Reliable Docker and Podman support with `cmdStart` and `cmdStop`
 - ✅ Full control over server settings per model
+- ✅ Preload models on startup with `hooks` ([#235](https://github.com/mostlygeek/llama-swap/pull/235))

 ## How does llama-swap work?

@@ -42,9 +45,9 @@ In the most basic configuration llama-swap handles one model at a time. For more

 ## config.yaml

-llama-swap is managed entirely through a yaml configuration file. 
+llama-swap is managed entirely through a yaml configuration file.

-It can be very minimal to start: 
+It can be very minimal to start:

 ```yaml
 models:
@@ -55,7 +58,7 @@ models:
      --port ${PORT}
 ```

-However, there are many more capabilities that llama-swap supports: 
+However, there are many more capabilities that llama-swap supports:

 - `groups` to run multiple models at once
 - `ttl` to automatically unload models
@@ -71,9 +74,13 @@ See the [configuration documentation](https://github.com/mostlygeek/llama-swap/w

 ## Web UI

-llama-swap ships with a real time web interface to monitor logs and status of models:
+llama-swap includes a real time web interface for monitoring logs and models:

-<img width="1786" height="1334" alt="image" src="https://github.com/user-attachments/assets/d6258cb9-1dad-40db-828f-2be860aec8fe" />
+<img width="1360" height="963" alt="image" src="https://github.com/user-attachments/assets/adef4a8e-de0b-49db-885a-8f6dedae6799" />
+
+The Activity Page shows recent requests:
+
+<img width="1360" height="963" alt="image" src="https://github.com/user-attachments/assets/5f3edee6-d03a-4ae5-ae06-b20ac1f135bd" />

 ## Installation

@@ -86,7 +93,7 @@ llama-swap can be installed in multiple ways

 ### Docker Install ([download images](https://github.com/mostlygeek/llama-swap/pkgs/container/llama-swap))

-Docker images with llama-swap and llama-server are built nightly. 
+Docker images with llama-swap and llama-server are built nightly.

 ```shell
 # use CPU inference comes with the example config above
@@ -133,10 +140,10 @@ $ docker run -it --rm --runtime nvidia -p 9292:8080 \

 ### Homebrew Install (macOS/Linux)

-The latest release of `llama-swap` can be installed via [Homebrew](https://brew.sh). 
+The latest release of `llama-swap` can be installed via [Homebrew](https://brew.sh).

 ```shell
-# Set up tap and install formula 
+# Set up tap and install formula
 brew tap mostlygeek/llama-swap
 brew install llama-swap
 # Run llama-swap
@@ -1,9 +1,17 @@
 # llama-swap YAML configuration example
 # -------------------------------------
 #
+# 💡 Tip - Use an LLM with this file!
+# ====================================
+#  This example configuration is written to be LLM friendly. Try
+#  copying this file into an LLM and asking it to explain or generate
+#  sections for you.
+# ====================================
+
+# Usage notes:
 # - Below are all the available configuration options for llama-swap.
-# - Settings with a default value, or noted as optional can be omitted.
-# - Settings that are marked required must be in your configuration file
+# - Settings noted as "required" must be in your configuration file
+# - Settings noted as "optional" can be omitted

 # healthCheckTimeout: number of seconds to wait for a model to be ready to serve requests
 # - optional, default: 120
@@ -27,9 +35,9 @@ metricsMaxInMemory: 1000
 # - it is automatically incremented for every model that uses it
 startPort: 10001

-# macros: sets a dictionary of string:string pairs
+# macros: a dictionary of string substitutions
 # - optional, default: empty dictionary
-# - these are reusable snippets
+# - macros are reusable snippets
 # - used in a model's cmd, cmdStop, proxy and checkEndpoint
 # - useful for reducing common configuration settings
 macros:
@@ -92,44 +100,55 @@ models:

    # checkEndpoint: URL path to check if the server is ready
    # - optional, default: /health
-    # - use "none" to skip endpoint ready checking
    # - endpoint is expected to return an HTTP 200 response
-    # - all requests wait until the endpoint is ready (or fails)
+    # - all requests wait until the endpoint is ready or fails
+    # - use "none" to skip endpoint health checking
    checkEndpoint: /custom-endpoint

-    # ttl: automatically unload the model after this many seconds
+    # ttl: automatically unload the model after ttl seconds
    # - optional, default: 0
    # - ttl values must be a value greater than 0
    # - a value of 0 disables automatic unloading of the model
    ttl: 60

-    # useModelName: overrides the model name that is sent to upstream server
+    # useModelName: override the model name that is sent to upstream server
    # - optional, default: ""
-    # - useful when the upstream server expects a specific model name or format
+    # - useful for when the upstream server expects a specific model name that
+    #   is different from the model's ID
    useModelName: "qwen:qwq"

    # filters: a dictionary of filter settings
    # - optional, default: empty dictionary
+    # - only strip_params is currently supported
    filters:
      # strip_params: a comma separated list of parameters to remove from the request
      # - optional, default: ""
-      # - useful for preventing overriding of default server params by requests
-      # - `model` parameter is never removed
+      # - useful for server side enforcement of sampling parameters
+      # - the `model` parameter can never be removed
      # - can be any JSON key in the request body
      # - recommended to stick to sampling parameters
      strip_params: "temperature, top_p, top_k"

+    # concurrencyLimit: overrides the allowed number of active parallel requests to a model
+    # - optional, default: 0
+    # - useful for limiting the number of active parallel requests a model can process
+    # - must be set per model
+    # - any number greater than 0 will override the internal default value of 10
+    # - any requests that exceeds the limit will receive an HTTP 429 Too Many Requests response
+    # - recommended to be omitted and the default used
+    concurrencyLimit: 0
+
  # Unlisted model example:
  "qwen-unlisted":
-    # unlisted: true or false
+    # unlisted: boolean, true or false
    # - optional, default: false
-    # - unlisted models do not show up in /v1/models or /upstream lists
+    # - unlisted models do not show up in /v1/models api requests
    # - can be requested as normal through all apis
    unlisted: true
    cmd: llama-server --port ${PORT} -m Llama-3.2-1B-Instruct-Q4_K_M.gguf -ngl 0

  # Docker example:
-  # container run times like Docker and Podman can also be used with a
+  # container run times like Docker and Podman can be used reliably with a
  # a combination of cmd and cmdStop.
  "docker-llama":
    proxy: "http://127.0.0.1:${PORT}"
@@ -142,24 +161,26 @@ models:
    # cmdStop: command to run to stop the model gracefully
    # - optional, default: ""
    # - useful for stopping commands managed by another system
-    # - on POSIX systems: a SIGTERM is sent for graceful shutdown
-    # - on Windows, taskkill is used
-    # - processes are given 5 seconds to shutdown until they are forcefully killed
    # - the upstream's process id is available in the ${PID} macro
+    #
+    # When empty, llama-swap has this default behaviour:
+    # - on POSIX systems: a SIGTERM signal is sent
+    # - on Windows, calls taskkill to stop the process
+    # - processes have 5 seconds to shutdown until forceful termination is attempted
    cmdStop: docker stop dockertest

 # groups: a dictionary of group settings
 # - optional, default: empty dictionary
-# - provide advanced controls over model swapping behaviour.
-# - Using groups some models can be kept loaded indefinitely, while others are swapped out.
-# - model ids must be defined in the Models section
+# - provides advanced controls over model swapping behaviour
+# - using groups some models can be kept loaded indefinitely, while others are swapped out
+# - model IDs must be defined in the Models section
 # - a model can only be a member of one group
 # - group behaviour is controlled via the `swap`, `exclusive` and `persistent` fields
 # - see issue #109 for details
 #
 # NOTE: the example below uses model names that are not defined above for demonstration purposes
 groups:
-  # group1 is same as the default behaviour of llama-swap where only one model is allowed
+  # group1 works the same as the default behaviour of llama-swap where only one model is allowed
  # to run a time across the whole llama-swap instance
  "group1":
    # swap: controls the model swapping behaviour in within the group
@@ -181,10 +202,13 @@ groups:
      - "qwen-unlisted"

  # Example:
-  # - in this group all the models can run at the same time
-  # - when a different group loads all running models in this group are unloaded
+  # - in group2 all models can run at the same time
+  # - when a different group is loaded it causes all running models in this group to unload
  "group2":
    swap: false
+
+    # exclusive: false does not unload other groups when a model in group2 is requested
+    # - the models in group2 will be loaded but will not unload any other groups
    exclusive: false
    members:
      - "docker-llama"
@@ -207,3 +231,19 @@ groups:
      - "forever-modelA"
      - "forever-modelB"
      - "forever-modelc"
+
+# hooks: a dictionary of event triggers and actions
+# - optional, default: empty dictionary
+# - the only supported hook is on_startup
+hooks:
+  # on_startup: a dictionary of actions to perform on startup
+  # - optional, default: empty dictionary
+  # - the only supported action is preload
+  on_startup:
+        # preload: a list of model ids to load on startup
+        # - optional, default: empty list
+        # - model names must match keys in the models sections
+        # - when preloading multiple models at once, define a group
+        #   otherwise models will be loaded and swapped out
+    preload:
+      - "llama"
@@ -0,0 +1,159 @@
+package main
+
+// created for issue: #252 https://github.com/mostlygeek/llama-swap/issues/252
+// this simple benchmark tool sends a lot of small chat completion requests to llama-swap
+// to make sure all the requests are accounted for.
+//
+// requests can be sent in parallel, and the tool will report the results.
+// usage: go run main.go -baseurl http://localhost:8080/v1 -model llama3 -requests 1000 -par 5
+
+import (
+	"bytes"
+	"flag"
+	"fmt"
+	"io"
+	"log"
+	"net/http"
+	"os"
+	"sync"
+	"time"
+)
+
+func main() {
+	// ----- CLI arguments ----------------------------------------------------
+	var (
+		baseurl         string
+		modelName       string
+		totalRequests   int
+		parallelization int
+	)
+
+	flag.StringVar(&baseurl, "baseurl", "http://localhost:8080/v1", "Base URL of the API (e.g., https://api.example.com)")
+	flag.StringVar(&modelName, "model", "", "Model name to use")
+	flag.IntVar(&totalRequests, "requests", 1, "Total number of requests to send")
+	flag.IntVar(&parallelization, "par", 1, "Maximum number of concurrent requests")
+	flag.Parse()
+
+	if baseurl == "" || modelName == "" {
+		fmt.Println("Error: both -baseurl and -model are required.")
+		flag.Usage()
+		os.Exit(1)
+	}
+	if totalRequests <= 0 {
+		fmt.Println("Error: -requests must be greater than 0.")
+		os.Exit(1)
+	}
+	if parallelization <= 0 {
+		fmt.Println("Error: -parallelization must be greater than 0.")
+		os.Exit(1)
+	}
+
+	// ----- HTTP client -------------------------------------------------------
+	client := &http.Client{
+		Timeout: 30 * time.Second,
+	}
+
+	// ----- Tracking response codes -------------------------------------------
+	statusCounts := make(map[int]int) // map[statusCode]count
+	var mu sync.Mutex                 // protects statusCounts
+
+	// ----- Request queue (buffered channel) ----------------------------------
+	requests := make(chan int, 10) // Buffered channel with capacity 10
+
+	// Goroutine to fill the request queue
+	go func() {
+		for i := 0; i < totalRequests; i++ {
+			requests <- i + 1
+		}
+		close(requests)
+	}()
+
+	// ----- Worker pool -------------------------------------------------------
+	var wg sync.WaitGroup
+	for i := 0; i < parallelization; i++ {
+		wg.Add(1)
+		go func(workerID int) {
+			defer wg.Done()
+
+			for reqID := range requests {
+				// Build request payload as a single line JSON string
+				payload := `{"model":"` + modelName + `","max_tokens":100,"stream":false,"messages":[{"role":"user","content":"write a snake game in python"}]}`
+
+				// Send POST request
+				req, err := http.NewRequest(http.MethodPost,
+					fmt.Sprintf("%s/chat/completions", baseurl),
+					bytes.NewReader([]byte(payload)))
+				if err != nil {
+					log.Printf("[worker %d][req %d] request creation error: %v", workerID, reqID, err)
+					mu.Lock()
+					statusCounts[-1]++
+					mu.Unlock()
+					continue
+				}
+				req.Header.Set("Content-Type", "application/json")
+
+				resp, err := client.Do(req)
+				if err != nil {
+					log.Printf("[worker %d][req %d] HTTP request error: %v", workerID, reqID, err)
+					mu.Lock()
+					statusCounts[-1]++
+					mu.Unlock()
+					continue
+				}
+				io.Copy(io.Discard, resp.Body)
+				resp.Body.Close()
+
+				// Record status code
+				mu.Lock()
+				statusCounts[resp.StatusCode]++
+				mu.Unlock()
+			}
+		}(i + 1)
+	}
+
+	// ----- Status ticker (prints every second) -------------------------------
+	done := make(chan struct{})
+	tickerDone := make(chan struct{})
+	go func() {
+		ticker := time.NewTicker(1 * time.Second)
+		startTime := time.Now()
+		for {
+			select {
+			case <-ticker.C:
+				mu.Lock()
+				// Compute how many requests have completed so far
+				completed := 0
+				for _, cnt := range statusCounts {
+					completed += cnt
+				}
+				// Calculate duration and progress
+				duration := time.Since(startTime)
+				progress := completed * 100 / totalRequests
+				fmt.Printf("Duration: %v, Completed: %d%% requests\n", duration, progress)
+				mu.Unlock()
+			case <-done:
+				duration := time.Since(startTime)
+				fmt.Printf("Duration: %v, Completed: %d%% requests\n", duration, 100)
+				close(tickerDone)
+				return
+			}
+		}
+	}()
+
+	// Wait for all workers to finish
+	wg.Wait()
+	close(done)  // stops the status-update goroutine
+	<-tickerDone // give ticker time to finish / print
+
+	// ----- Summary ------------------------------------------------------------
+	fmt.Println("\n\n=== HTTP response code summary ===")
+	mu.Lock()
+	for code, cnt := range statusCounts {
+		if code == -1 {
+			fmt.Printf("Client-side errors (no HTTP response): %d\n", cnt)
+		} else {
+			fmt.Printf("%d : %d\n", code, cnt)
+		}
+	}
+	mu.Unlock()
+}
@@ -138,6 +138,14 @@ func (c *GroupConfig) UnmarshalYAML(unmarshal func(interface{}) error) error {
 	return nil
 }

+type HooksConfig struct {
+	OnStartup HookOnStartup `yaml:"on_startup"`
+}
+
+type HookOnStartup struct {
+	Preload []string `yaml:"preload"`
+}
+
 type Config struct {
 	HealthCheckTimeout int                    `yaml:"healthCheckTimeout"`
 	LogRequests        bool                   `yaml:"logRequests"`
@@ -155,6 +163,9 @@ type Config struct {

 	// automatic port assignments
 	StartPort int `yaml:"startPort"`
+
+	// hooks, see: #209
+	Hooks HooksConfig `yaml:"hooks"`
 }

 func (c *Config) RealModelName(search string) (string, bool) {
@@ -330,6 +341,22 @@ func LoadConfigFromReader(r io.Reader) (Config, error) {
 		}
 	}

+	// clean up hooks preload
+	if len(config.Hooks.OnStartup.Preload) > 0 {
+		var toPreload []string
+		for _, modelID := range config.Hooks.OnStartup.Preload {
+			modelID = strings.TrimSpace(modelID)
+			if modelID == "" {
+				continue
+			}
+			if real, found := config.RealModelName(modelID); found {
+				toPreload = append(toPreload, real)
+			}
+		}
+
+		config.Hooks.OnStartup.Preload = toPreload
+	}
+
 	return config, nil
 }

@@ -100,6 +100,9 @@ func TestConfig_LoadPosix(t *testing.T) {
 	content := `
 macros:
  svr-path: "path/to/server"
+hooks:
+  on_startup:
+    preload: ["model1", "model2"]
 models:
  model1:
    cmd: path/to/cmd --arg1 one
@@ -163,6 +166,11 @@ groups:
 		Macros: map[string]string{
 			"svr-path": "path/to/server",
 		},
+		Hooks: HooksConfig{
+			OnStartup: HookOnStartup{
+				Preload: []string{"model1", "model2"},
+			},
+		},
 		Models: map[string]ModelConfig{
 			"model1": {
 				Cmd:           "path/to/cmd --arg1 one",
@@ -0,0 +1,27 @@
+package proxy
+
+import "net/http"
+
+// Custom discard writer that implements http.ResponseWriter but just discards everything
+type DiscardWriter struct {
+	header http.Header
+	status int
+}
+
+func (w *DiscardWriter) Header() http.Header {
+	if w.header == nil {
+		w.header = make(http.Header)
+	}
+	return w.header
+}
+
+func (w *DiscardWriter) Write(data []byte) (int, error) {
+	return len(data), nil
+}
+
+func (w *DiscardWriter) WriteHeader(code int) {
+	w.status = code
+}
+
+// Satisfy the http.Flusher interface for streaming responses
+func (w *DiscardWriter) Flush() {}
@@ -7,6 +7,7 @@ const ChatCompletionStatsEventID = 0x02
 const ConfigFileChangedEventID = 0x03
 const LogDataEventID = 0x04
 const TokenMetricsEventID = 0x05
+const ModelPreloadedEventID = 0x06

 type ProcessStateChangeEvent struct {
 	ProcessName string
@@ -48,3 +49,12 @@ type LogDataEvent struct {
 func (e LogDataEvent) Type() uint32 {
 	return LogDataEventID
 }
+
+type ModelPreloadedEvent struct {
+	ModelName string
+	Success   bool
+}
+
+func (e ModelPreloadedEvent) Type() uint32 {
+	return ModelPreloadedEventID
+}
@@ -13,9 +13,10 @@ import (
 )

 var (
-	nextTestPort int = 12000
-	portMutex    sync.Mutex
-	testLogger   = NewLogMonitorWriter(os.Stdout)
+	nextTestPort        int = 12000
+	portMutex           sync.Mutex
+	testLogger          = NewLogMonitorWriter(os.Stdout)
+	simpleResponderPath = getSimpleResponderPath()
 )

 // Check if the binary exists
@@ -69,13 +70,11 @@ func getTestSimpleResponderConfig(expectedMessage string) ModelConfig {
 }

 func getTestSimpleResponderConfigPort(expectedMessage string, port int) ModelConfig {
-	binaryPath := getSimpleResponderPath()
-
 	// Create a YAML string with just the values we want to set
 	yamlStr := fmt.Sprintf(`
 cmd: '%s --port %d --silent --respond %s'
 proxy: "http://127.0.0.1:%d"
-`, binaryPath, port, expectedMessage, port)
+`, simpleResponderPath, port, expectedMessage, port)

 	var cfg ModelConfig
 	if err := yaml.Unmarshal([]byte(yamlStr), &cfg); err != nil {
@@ -5,12 +5,20 @@ import (
 	"fmt"
 	"io"
 	"net/http"
+	"strings"
 	"time"

 	"github.com/gin-gonic/gin"
 	"github.com/tidwall/gjson"
 )

+type MetricsRecorder struct {
+	metricsMonitor *MetricsMonitor
+	realModelName  string
+	//	isStreaming    bool
+	startTime time.Time
+}
+
 // MetricsMiddleware sets up the MetricsResponseWriter for capturing upstream requests
 func MetricsMiddleware(pm *ProxyManager) gin.HandlerFunc {
 	return func(c *gin.Context) {
@@ -41,48 +49,48 @@ func MetricsMiddleware(pm *ProxyManager) gin.HandlerFunc {
 			metricsRecorder: &MetricsRecorder{
 				metricsMonitor: pm.metricsMonitor,
 				realModelName:  realModelName,
-				isStreaming:    gjson.GetBytes(bodyBytes, "stream").Bool(),
 				startTime:      time.Now(),
 			},
 		}
 		c.Writer = writer
 		c.Next()

-		rec := writer.metricsRecorder
-		rec.processBody(writer.body)
-	}
-}
+		// check for streaming response
+		if strings.Contains(c.Writer.Header().Get("Content-Type"), "text/event-stream") {
+			writer.metricsRecorder.processStreamingResponse(writer.body)
+		} else {
+			writer.metricsRecorder.processNonStreamingResponse(writer.body)
+		}

-type MetricsRecorder struct {
-	metricsMonitor *MetricsMonitor
-	realModelName  string
-	isStreaming    bool
-	startTime      time.Time
-}
-
-// processBody handles response processing after request completes
-func (rec *MetricsRecorder) processBody(body []byte) {
-	if rec.isStreaming {
-		rec.processStreamingResponse(body)
-	} else {
-		rec.processNonStreamingResponse(body)
 	}
 }

 func (rec *MetricsRecorder) parseAndRecordMetrics(jsonData gjson.Result) bool {
 	usage := jsonData.Get("usage")
-	if !usage.Exists() {
+	timings := jsonData.Get("timings")
+	if !usage.Exists() && !timings.Exists() {
 		return false
 	}

 	// default values
-	outputTokens := int(jsonData.Get("usage.completion_tokens").Int())
-	inputTokens := int(jsonData.Get("usage.prompt_tokens").Int())
+	outputTokens := 0
+	inputTokens := 0
+
+	// timings data
 	tokensPerSecond := -1.0
+	promptPerSecond := -1.0
 	durationMs := int(time.Since(rec.startTime).Milliseconds())

+	if usage.Exists() {
+		outputTokens = int(jsonData.Get("usage.completion_tokens").Int())
+		inputTokens = int(jsonData.Get("usage.prompt_tokens").Int())
+	}
+
 	// use llama-server's timing data for tok/sec and duration as it is more accurate
-	if timings := jsonData.Get("timings"); timings.Exists() {
+	if timings.Exists() {
+		inputTokens = int(jsonData.Get("timings.prompt_n").Int())
+		outputTokens = int(jsonData.Get("timings.predicted_n").Int())
+		promptPerSecond = jsonData.Get("timings.prompt_per_second").Float()
 		tokensPerSecond = jsonData.Get("timings.predicted_per_second").Float()
 		durationMs = int(jsonData.Get("timings.prompt_ms").Float() + jsonData.Get("timings.predicted_ms").Float())
 	}
@@ -92,6 +100,7 @@ func (rec *MetricsRecorder) parseAndRecordMetrics(jsonData gjson.Result) bool {
 		Model:           rec.realModelName,
 		InputTokens:     inputTokens,
 		OutputTokens:    outputTokens,
+		PromptPerSecond: promptPerSecond,
 		TokensPerSecond: tokensPerSecond,
 		DurationMs:      durationMs,
 	})
@@ -15,6 +15,7 @@ type TokenMetrics struct {
 	Model           string    `json:"model"`
 	InputTokens     int       `json:"input_tokens"`
 	OutputTokens    int       `json:"output_tokens"`
+	PromptPerSecond float64   `json:"prompt_per_second"`
 	TokensPerSecond float64   `json:"tokens_per_second"`
 	DurationMs      int       `json:"duration_ms"`
 }
@@ -15,6 +15,7 @@ import (
 	"time"

 	"github.com/gin-gonic/gin"
+	"github.com/mostlygeek/llama-swap/event"
 	"github.com/tidwall/gjson"
 	"github.com/tidwall/sjson"
 )
@@ -96,6 +97,35 @@ func New(config Config) *ProxyManager {
 	}

 	pm.setupGinEngine()
+
+	// run any startup hooks
+	if len(config.Hooks.OnStartup.Preload) > 0 {
+		// do it in the background, don't block startup -- not sure if good idea yet
+		go func() {
+			discardWriter := &DiscardWriter{}
+			for _, realModelName := range config.Hooks.OnStartup.Preload {
+				proxyLogger.Infof("Preloading model: %s", realModelName)
+				processGroup, _, err := pm.swapProcessGroup(realModelName)
+
+				if err != nil {
+					event.Emit(ModelPreloadedEvent{
+						ModelName: realModelName,
+						Success:   false,
+					})
+					proxyLogger.Errorf("Failed to preload model %s: %v", realModelName, err)
+					continue
+				} else {
+					req, _ := http.NewRequest("GET", "/", nil)
+					processGroup.ProxyRequest(realModelName, discardWriter, req)
+					event.Emit(ModelPreloadedEvent{
+						ModelName: realModelName,
+						Success:   true,
+					})
+				}
+			}
+		}()
+	}
+
 	return pm
 }

@@ -161,11 +191,17 @@ func (pm *ProxyManager) setupGinEngine() {
 	// Support legacy /v1/completions api, see issue #12
 	pm.ginEngine.POST("/v1/completions", mm, pm.proxyOAIHandler)

-	// Support embeddings
+	// Support embeddings and reranking
 	pm.ginEngine.POST("/v1/embeddings", mm, pm.proxyOAIHandler)
+
+	// llama-server's /reranking endpoint + aliases
+	pm.ginEngine.POST("/reranking", mm, pm.proxyOAIHandler)
+	pm.ginEngine.POST("/rerank", mm, pm.proxyOAIHandler)
 	pm.ginEngine.POST("/v1/rerank", mm, pm.proxyOAIHandler)
 	pm.ginEngine.POST("/v1/reranking", mm, pm.proxyOAIHandler)
-	pm.ginEngine.POST("/rerank", mm, pm.proxyOAIHandler)
+
+	// llama-server's /infill endpoint for code infilling
+	pm.ginEngine.POST("/infill", mm, pm.proxyOAIHandler)

 	// Support audio/speech endpoint
 	pm.ginEngine.POST("/v1/audio/speech", pm.proxyOAIHandler)
@@ -132,7 +132,7 @@ func (pm *ProxyManager) apiSendEvents(c *gin.Context) {
 		}
 	}

-	sendMetrics := func(metrics TokenMetrics) {
+	sendMetrics := func(metrics []TokenMetrics) {
 		jsonData, err := json.Marshal(metrics)
 		if err == nil {
 			select {
@@ -168,16 +168,14 @@ func (pm *ProxyManager) apiSendEvents(c *gin.Context) {
 	 * Send Metrics data
 	 */
 	defer event.On(func(e TokenMetricsEvent) {
-		sendMetrics(e.Metrics)
+		sendMetrics([]TokenMetrics{e.Metrics})
 	})()

 	// send initial batch of data
 	sendLogData("proxy", pm.proxyLogger.GetHistory())
 	sendLogData("upstream", pm.upstreamLogger.GetHistory())
 	sendModels()
-	for _, metrics := range pm.metricsMonitor.GetMetrics() {
-		sendMetrics(metrics)
-	}
+	sendMetrics(pm.metricsMonitor.GetMetrics())

 	for {
 		select {
@@ -14,6 +14,7 @@ import (
 	"testing"
 	"time"

+	"github.com/mostlygeek/llama-swap/event"
 	"github.com/stretchr/testify/assert"
 	"github.com/tidwall/gjson"
 )
@@ -832,3 +833,62 @@ func TestProxyManager_HealthEndpoint(t *testing.T) {
 	assert.Equal(t, http.StatusOK, rec.Code)
 	assert.Equal(t, "OK", rec.Body.String())
 }
+
+func TestProxyManager_StartupHooks(t *testing.T) {
+
+	// using real YAML as the configuration has gotten more complex
+	// is the right approach as LoadConfigFromReader() does a lot more
+	// than parse YAML now. Eventually migrate all tests to use this approach
+	configStr := strings.Replace(`
+logLevel: error
+hooks:
+  on_startup:
+    preload:
+      - model1
+      - model2
+groups:
+  preloadTestGroup:
+    swap: false
+    members:
+       - model1
+       - model2
+models:
+  model1:
+    cmd: ${simpleresponderpath} --port ${PORT} --silent --respond model1
+  model2:
+      cmd: ${simpleresponderpath} --port ${PORT} --silent --respond model2
+`, "${simpleresponderpath}", simpleResponderPath, -1)
+
+	// Create a test model configuration
+	config, err := LoadConfigFromReader(strings.NewReader(configStr))
+	if !assert.NoError(t, err, "Invalid configuration") {
+		return
+	}
+
+	preloadChan := make(chan ModelPreloadedEvent, 2) // buffer for 2 expected events
+
+	unsub := event.On(func(e ModelPreloadedEvent) {
+		preloadChan <- e
+	})
+
+	defer unsub()
+
+	// Create the proxy which should trigger preloading
+	proxy := New(config)
+	defer proxy.StopProcesses(StopWaitForInflightRequest)
+
+	for i := 0; i < 2; i++ {
+		select {
+		case <-preloadChan:
+		case <-time.After(5 * time.Second):
+			t.Fatal("timed out waiting for models to preload")
+		}
+	}
+	// make sure they are both loaded
+	_, foundGroup := proxy.processGroups["preloadTestGroup"]
+	if !assert.True(t, foundGroup, "preloadTestGroup should exist") {
+		return
+	}
+	assert.Equal(t, StateReady, proxy.processGroups["preloadTestGroup"].processes["model1"].CurrentState())
+	assert.Equal(t, StateReady, proxy.processGroups["preloadTestGroup"].processes["model2"].CurrentState())
+}
@@ -1,50 +1,78 @@
+import { useEffect, useCallback } from "react";
 import { BrowserRouter as Router, Routes, Route, Navigate, NavLink } from "react-router-dom";
 import { useTheme } from "./contexts/ThemeProvider";
-import { APIProvider } from "./contexts/APIProvider";
+import { useAPI } from "./contexts/APIProvider";
 import LogViewerPage from "./pages/LogViewer";
 import ModelPage from "./pages/Models";
 import ActivityPage from "./pages/Activity";
+import ConnectionStatusIcon from "./components/ConnectionStatus";
 import { RiSunFill, RiMoonFill } from "react-icons/ri";

 function App() {
-  const { isNarrow, toggleTheme, isDarkMode } = useTheme();
+  const { isNarrow, toggleTheme, isDarkMode, appTitle, setAppTitle, setConnectionState } = useTheme();
+  const handleTitleChange = useCallback(
+    (newTitle: string) => {
+      setAppTitle(newTitle.replace(/\n/g, "").trim().substring(0, 64) || "llama-swap");
+    },
+    [setAppTitle]
+  );
+
+  const { connectionStatus } = useAPI();
+
+  // Synchronize the window.title connections state with the actual connection state
+  useEffect(() => {
+    setConnectionState(connectionStatus);
+  }, [connectionStatus]);

  return (
    <Router basename="/ui/">
-      <APIProvider>
-        <div className="flex flex-col h-screen">
-          <nav className="bg-surface border-b border-border p-2 h-[75px]">
-            <div className="flex items-center justify-between mx-auto px-4 h-full">
-              {!isNarrow && <h1 className="flex items-center p-0">llama-swap</h1>}
-              <div className="flex items-center space-x-4">
-                <NavLink to="/" className={({ isActive }) => (isActive ? "navlink active" : "navlink")}>
-                  Logs
-                </NavLink>
-
-                <NavLink to="/models" className={({ isActive }) => (isActive ? "navlink active" : "navlink")}>
-                  Models
-                </NavLink>
-
-                <NavLink to="/activity" className={({ isActive }) => (isActive ? "navlink active" : "navlink")}>
-                  Activity
-                </NavLink>
-                <button className="" onClick={toggleTheme}>
-                  {isDarkMode ? <RiMoonFill /> : <RiSunFill />}
-                </button>
-              </div>
+      <div className="flex flex-col h-screen">
+        <nav className="bg-surface border-b border-border p-2 h-[75px]">
+          <div className="flex items-center justify-between mx-auto px-4 h-full">
+            {!isNarrow && (
+              <h1
+                contentEditable
+                suppressContentEditableWarning
+                className="flex items-center p-0 outline-none hover:bg-gray-100 dark:hover:bg-gray-700 rounded px-1"
+                onBlur={(e) => handleTitleChange(e.currentTarget.textContent || "(set title)")}
+                onKeyDown={(e) => {
+                  if (e.key === "Enter") {
+                    e.preventDefault();
+                    handleTitleChange(e.currentTarget.textContent || "(set title)");
+                    e.currentTarget.blur();
+                  }
+                }}
+              >
+                {appTitle}
+              </h1>
+            )}
+            <div className="flex items-center space-x-4">
+              <NavLink to="/" className={({ isActive }) => (isActive ? "navlink active" : "navlink")}>
+                Logs
+              </NavLink>
+              <NavLink to="/models" className={({ isActive }) => (isActive ? "navlink active" : "navlink")}>
+                Models
+              </NavLink>
+              <NavLink to="/activity" className={({ isActive }) => (isActive ? "navlink active" : "navlink")}>
+                Activity
+              </NavLink>
+              <button className="" onClick={toggleTheme}>
+                {isDarkMode ? <RiMoonFill /> : <RiSunFill />}
+              </button>
+              <ConnectionStatusIcon />
            </div>
-          </nav>
+          </div>
+        </nav>

-          <main className="flex-1 overflow-auto p-4">
-            <Routes>
-              <Route path="/" element={<LogViewerPage />} />
-              <Route path="/models" element={<ModelPage />} />
-              <Route path="/activity" element={<ActivityPage />} />
-              <Route path="*" element={<Navigate to="/" replace />} />
-            </Routes>
-          </main>
-        </div>
-      </APIProvider>
+        <main className="flex-1 overflow-auto p-4">
+          <Routes>
+            <Route path="/" element={<LogViewerPage />} />
+            <Route path="/models" element={<ModelPage />} />
+            <Route path="/activity" element={<ActivityPage />} />
+            <Route path="*" element={<Navigate to="/" replace />} />
+          </Routes>
+        </main>
+      </div>
    </Router>
  );
 }
@@ -0,0 +1,26 @@
+import { useAPI } from "../contexts/APIProvider";
+import { useMemo } from "react";
+
+const ConnectionStatusIcon = () => {
+  const { connectionStatus } = useAPI();
+
+  const eventStatusColor = useMemo(() => {
+    switch (connectionStatus) {
+      case "connected":
+        return "bg-green-500";
+      case "connecting":
+        return "bg-yellow-500";
+      case "disconnected":
+      default:
+        return "bg-red-500";
+    }
+  }, [connectionStatus]);
+
+  return (
+    <div className="flex items-center" title={`event stream: ${connectionStatus}`}>
+      <span className={`inline-block w-3 h-3 rounded-full ${eventStatusColor} mr-2`}></span>
+    </div>
+  );
+};
+
+export default ConnectionStatusIcon;
@@ -1,4 +1,5 @@
 import { useRef, createContext, useState, useContext, useEffect, useCallback, useMemo, type ReactNode } from "react";
+import type { ConnectionState } from "../lib/types";

 type ModelStatus = "ready" | "starting" | "stopping" | "stopped" | "shutdown" | "unknown";
 const LOG_LENGTH_LIMIT = 1024 * 100; /* 100KB of log data */
@@ -20,6 +21,7 @@ interface APIProviderType {
  proxyLogs: string;
  upstreamLogs: string;
  metrics: Metrics[];
+  connectionStatus: ConnectionState;
 }

 interface Metrics {
@@ -28,6 +30,7 @@ interface Metrics {
  model: string;
  input_tokens: number;
  output_tokens: number;
+  prompt_per_second: number;
  tokens_per_second: number;
  duration_ms: number;
 }
@@ -51,6 +54,7 @@ export function APIProvider({ children, autoStartAPIEvents = true }: APIProvider
  const [proxyLogs, setProxyLogs] = useState("");
  const [upstreamLogs, setUpstreamLogs] = useState("");
  const [metrics, setMetrics] = useState<Metrics[]>([]);
+  const [connectionStatus, setConnectionState] = useState<ConnectionState>("disconnected");
  const apiEventSource = useRef<EventSource | null>(null);

  const [models, setModels] = useState<Model[]>([]);
@@ -74,7 +78,20 @@ export function APIProvider({ children, autoStartAPIEvents = true }: APIProvider
    const initialDelay = 1000; // 1 second

    const connect = () => {
+      apiEventSource.current = null;
      const eventSource = new EventSource("/api/events");
+      setConnectionState("connecting");
+
+      eventSource.onopen = () => {
+        // clear everything out on connect to keep things in sync
+        setProxyLogs("");
+        setUpstreamLogs("");
+        setMetrics([]); // clear metrics on reconnect
+        setModels([]); // clear models on reconnect
+        apiEventSource.current = eventSource;
+        retryCount = 0;
+        setConnectionState("connected");
+      };

      eventSource.onmessage = (e: MessageEvent) => {
        try {
@@ -107,9 +124,9 @@ export function APIProvider({ children, autoStartAPIEvents = true }: APIProvider

            case "metrics":
              {
-                const newMetric = JSON.parse(message.data) as Metrics;
+                const newMetrics = JSON.parse(message.data) as Metrics[];
                setMetrics((prevMetrics) => {
-                  return [newMetric, ...prevMetrics];
+                  return [...newMetrics, ...prevMetrics];
                });
              }
              break;
@@ -118,14 +135,14 @@ export function APIProvider({ children, autoStartAPIEvents = true }: APIProvider
          console.error(e.data, err);
        }
      };
+
      eventSource.onerror = () => {
        eventSource.close();
        retryCount++;
        const delay = Math.min(initialDelay * Math.pow(2, retryCount - 1), 5000);
+        setConnectionState("disconnected");
        setTimeout(connect, delay);
      };
-
-      apiEventSource.current = eventSource;
    };

    connect();
@@ -193,6 +210,7 @@ export function APIProvider({ children, autoStartAPIEvents = true }: APIProvider
      proxyLogs,
      upstreamLogs,
      metrics,
+      connectionStatus,
    }),
    [models, listModels, unloadAllModels, loadModel, enableAPIEvents, proxyLogs, upstreamLogs, metrics]
  );
@@ -1,5 +1,6 @@
 import { createContext, useContext, useEffect, type ReactNode, useMemo, useState } from "react";
 import { usePersistentState } from "../hooks/usePersistentState";
+import type { ConnectionState } from "../lib/types";

 type ScreenWidth = "xs" | "sm" | "md" | "lg" | "xl" | "2xl";
 type ThemeContextType = {
@@ -7,6 +8,11 @@ type ThemeContextType = {
  screenWidth: ScreenWidth;
  isNarrow: boolean;
  toggleTheme: () => void;
+
+  // for managing the window title and connection state information
+  appTitle: string;
+  setAppTitle: (title: string) => void;
+  setConnectionState: (state: ConnectionState) => void;
 };

 const ThemeContext = createContext<ThemeContextType | undefined>(undefined);
@@ -16,6 +22,17 @@ type ThemeProviderProps = {
 };

 export function ThemeProvider({ children }: ThemeProviderProps) {
+  const [appTitle, setAppTitle] = usePersistentState("app-title", "llama-swap");
+  const [connectionState, setConnectionState] = useState<ConnectionState>("disconnected");
+
+  /**
+   * Set the document.title with informative information
+   */
+  useEffect(() => {
+    const connectionIcon = connectionState === "connecting" ? "🟡" : connectionState === "connected" ? "🟢" : "🔴";
+    document.title = connectionIcon + " " + appTitle; // Set initial title
+  }, [appTitle, connectionState]);
+
  const [isDarkMode, setIsDarkMode] = usePersistentState<boolean>("theme", false);
  const [screenWidth, setScreenWidth] = useState<ScreenWidth>("md"); // Default to md

@@ -55,7 +72,19 @@ export function ThemeProvider({ children }: ThemeProviderProps) {
  }, [screenWidth]);

  return (
-    <ThemeContext.Provider value={{ isDarkMode, toggleTheme, screenWidth, isNarrow }}>{children}</ThemeContext.Provider>
+    <ThemeContext.Provider
+      value={{
+        isDarkMode,
+        toggleTheme,
+        screenWidth,
+        isNarrow,
+        appTitle,
+        setAppTitle,
+        setConnectionState,
+      }}
+    >
+      {children}
+    </ThemeContext.Provider>
  );
 }

@@ -0,0 +1 @@
+export type ConnectionState = "connected" | "connecting" | "disconnected";
@@ -3,11 +3,14 @@ import { createRoot } from "react-dom/client";
 import "./index.css";
 import App from "./App.tsx";
 import { ThemeProvider } from "./contexts/ThemeProvider";
+import { APIProvider } from "./contexts/APIProvider";

 createRoot(document.getElementById("root")!).render(
  <StrictMode>
    <ThemeProvider>
-      <App />
+      <APIProvider>
+        <App />
+      </APIProvider>
    </ThemeProvider>
  </StrictMode>
 );
@@ -1,4 +1,4 @@
-import { useState, useEffect } from "react";
+import { useMemo } from "react";
 import { useAPI } from "../contexts/APIProvider";

 const formatTimestamp = (timestamp: string): string => {
@@ -15,25 +15,10 @@ const formatDuration = (ms: number): string => {

 const ActivityPage = () => {
  const { metrics } = useAPI();
-  const [error, setError] = useState<string | null>(null);
-
-  useEffect(() => {
-    if (metrics.length > 0) {
-      setError(null);
-    }
+  const sortedMetrics = useMemo(() => {
+    return [...metrics].sort((a, b) => b.id - a.id);
  }, [metrics]);

-  if (error) {
-    return (
-      <div className="p-6">
-        <h1 className="text-2xl font-bold mb-4">Activity</h1>
-        <div className="bg-red-50 border border-red-200 rounded-md p-4">
-          <p className="text-red-800">{error}</p>
-        </div>
-      </div>
-    );
-  }
-
  return (
    <div className="p-6">
      <h1 className="text-2xl font-bold mb-4">Activity</h1>
@@ -47,21 +32,25 @@ const ActivityPage = () => {
          <table className="min-w-full divide-y">
            <thead>
              <tr>
+                <th className="px-4 py-3 text-left text-xs font-medium uppercase tracking-wider">Id</th>
                <th className="px-6 py-3 text-left text-xs font-medium uppercase tracking-wider">Timestamp</th>
                <th className="px-6 py-3 text-left text-xs font-medium uppercase tracking-wider">Model</th>
                <th className="px-6 py-3 text-left text-xs font-medium uppercase tracking-wider">Input Tokens</th>
                <th className="px-6 py-3 text-left text-xs font-medium uppercase tracking-wider">Output Tokens</th>
+                <th className="px-6 py-3 text-left text-xs font-medium uppercase tracking-wider">Prompt Processing</th>
                <th className="px-6 py-3 text-left text-xs font-medium uppercase tracking-wider">Generation Speed</th>
                <th className="px-6 py-3 text-left text-xs font-medium uppercase tracking-wider">Duration</th>
              </tr>
            </thead>
            <tbody className="divide-y">
-              {metrics.map((metric, index) => (
-                <tr key={`${metric.id}-${index}`}>
+              {sortedMetrics.map((metric) => (
+                <tr key={`metric_${metric.id}`}>
+                  <td className="px-4 py-4 whitespace-nowrap text-sm">{metric.id + 1 /* un-zero index */}</td>
                  <td className="px-6 py-4 whitespace-nowrap text-sm">{formatTimestamp(metric.timestamp)}</td>
                  <td className="px-6 py-4 whitespace-nowrap text-sm">{metric.model}</td>
                  <td className="px-6 py-4 whitespace-nowrap text-sm">{metric.input_tokens.toLocaleString()}</td>
                  <td className="px-6 py-4 whitespace-nowrap text-sm">{metric.output_tokens.toLocaleString()}</td>
+                  <td className="px-6 py-4 whitespace-nowrap text-sm">{formatSpeed(metric.prompt_per_second)}</td>
                  <td className="px-6 py-4 whitespace-nowrap text-sm">{formatSpeed(metric.tokens_per_second)}</td>
                  <td className="px-6 py-4 whitespace-nowrap text-sm">{formatDuration(metric.duration_ms)}</td>
                </tr>
Author	SHA1	Message	Date
Benson Wong	57803fd3aa	Support llama-server's /infill endpoint (#272 ) Add support for llama-server's /infill endpoint and metrics gathering on the Activities page.	2025-08-27 08:36:05 -07:00
Benson Wong	c55d0cc842	Add docs for model.concurrencyLimit #263 [skip ci]	2025-08-22 16:08:37 -07:00
Benson Wong	7acbaf4712	Add connection status indicator in UI (#260 ) * show connection status as icon in UI title * make connection status event driven	2025-08-20 13:58:24 -07:00
Benson Wong	fcc5ad135a	UI: Allow editing of title (#246 ) - make <h1> title contentEditable - title setting persists across reloads in localStorage	2025-08-17 09:42:06 -07:00
Benson Wong	305e5a0031	improve example config [skip ci]	2025-08-17 09:19:04 -07:00
Benson Wong	04fc67354a	Improve Activity event handling in the UI (#254 ) Improve Activity event handling in the UI - fixes #252 found that the Activity page showed activity inconsistent with /api/metrics - Change data structure for event metrics to array. - Add Event stream connections status indicator	2025-08-15 21:44:08 -07:00
Benson Wong	4662cf7699	add 'unconfirmed bug' as default label in bug-report.md	2025-08-15 15:38:12 -07:00
Benson Wong	5dc6b3e6d9	Add barebones but working implementation of model preload (#209 , #235 ) Add barebones but working implementation of model preload * add config test for Preload hook * improve TestProxyManager_StartupHooks * docs for new hook configuration * add a .dev to .gitignore	2025-08-14 10:27:28 -07:00
Benson Wong	74c69f39ef	Add prompt processing metrics (#250 ) - capture prompt processing metrics - display prompt processing metrics on UI Activity page	2025-08-14 10:02:16 -07:00
Benson Wong	a186318892	Update Readme, Add screenshot for Activities page [skip ci]	2025-08-08 13:39:46 -07:00
Benson Wong	c4e4d5e1e9	Update Readme UI Screenshot [skip ci]	2025-08-08 13:33:47 -07:00
				`@@ -0,0 +1 @@`
				`export type ConnectionState = "connected" \| "connecting" \| "disconnected";`