Improve LogMonitor to handle empty writes and ensure buffer immutability

- Add a check to return immediately if the write buffer is empty - Create a copy of new history data to ensure it is immutable - Update the `GetHistory` method to use the `any` type for the buffer interface - Add a test case to verify that the buffer remains unchanged even if the original message is modified after writing
Merge pull request #8 from mostlygeek/improve-upstream-monitoring-issue5
2024-11-02 10:41:23 -07:00 · 2024-11-01 15:28:06 -07:00 · 2024-11-01 14:32:39 -07:00 · 2024-11-01 09:42:37 -07:00 · 2024-10-31 12:16:54 -07:00 · 2024-10-30 21:02:30 -07:00
7 changed files with 330 additions and 30 deletions
@@ -9,6 +9,9 @@ all: mac linux simple-responder
 clean:
 	rm -rf $(BUILD_DIR)

+test:
+	go test -v ./proxy
+
 # Build OSX binary
 mac:
 	@echo "Building Mac binary..."
@@ -1,12 +1,14 @@
 # llama-swap

-[llama.cpp's server](https://github.com/ggerganov/llama.cpp/tree/master/examples/server) can't swap models, so let's swap llama-server instead!
+![llama-swap header image](header.jpeg)
+
+[llama.cpp's server](https://github.com/ggerganov/llama.cpp/tree/master/examples/server) can't swap models on demand. So let's swap the server on demand instead!

 llama-swap is a proxy server that sits in front of llama-server. When a request for `/v1/chat/completions` comes in it will extract the `model` requested and change the underlying llama-server automatically.

 - ✅ easy to deploy: single binary with no dependencies
 - ✅ full control over llama-server's startup settings
- ✅ ❤️ for nvidia P40 users who are rely on llama.cpp for inference
+- ✅ ❤️ for users who are rely on llama.cpp for LLM inference

 ## config.yaml

@@ -20,10 +22,10 @@ healthCheckTimeout: 60
 # define valid model values and the upstream server start
 models:
  "llama":
-    cmd: "llama-server --port 8999 -m Llama-3.2-1B-Instruct-Q4_K_M.gguf"
+    cmd: llama-server --port 8999 -m Llama-3.2-1B-Instruct-Q4_K_M.gguf

-    # Where to proxy to, important it matches this format
-    proxy: "http://127.0.0.1:8999"
+    # where to reach the server started by cmd
+    proxy: http://127.0.0.1:8999

    # aliases model names to use this configuration for
    aliases:
@@ -35,14 +37,19 @@ models:
    #
    # use "none" to skip endpoint checking. This may cause requests to fail
    # until the server is ready
-    checkEndpoint: "/custom-endpoint"
+    checkEndpoint: /custom-endpoint

  "qwen":
    # environment variables to pass to the command
    env:
      - "CUDA_VISIBLE_DEVICES=0"
-    cmd: "llama-server --port 8999 -m path/to/Qwen2.5-1.5B-Instruct-Q4_K_M.gguf"
-    proxy: "http://127.0.0.1:8999"
+
+    # multiline for readability
+    cmd: >
+      llama-server --port 8999
+      --model path/to/Qwen2.5-1.5B-Instruct-Q4_K_M.gguf
+
+    proxy: http://127.0.0.1:8999
 ```

 ## Installation
@@ -52,6 +59,26 @@ models:
    * _Note: Windows currently untested._
 1. Run the binary with `llama-swap --config path/to/config.yaml`

+## Monitoring Logs
+
+The `/logs` endpoint is available to monitor what llama-swap is doing. It will send the last 10KB of logs. Useful for monitoring the output of llama-server. It also supports streaming of logs.
+
+Usage:
+
+```
+# basic, sends up to the last 10KB of logs
+curl http://host/logs'
+
+# add `stream` to stream new logs as they come in
+curl -Ns 'http://host/logs?stream'
+
+# add `skip` to skip history (only useful if used with stream)
+curl -Ns 'http://host/logs?stream&skip'
+
+# will output nothing :)
+curl -Ns 'http://host/logs?skip'
+```
+
 ## Systemd Unit Files

 Use this unit file to start llama-swap on boot. This is only tested on Ubuntu.
@@ -1,6 +1,6 @@
 # Seconds to wait for llama.cpp to be available to serve requests
 # Default (and minimum): 15 seconds
-healthCheckTimeout: 60
+healthCheckTimeout: 15

 models:
  "llama":
@@ -35,7 +35,10 @@ models:
    # until the upstream server is ready for traffic
    checkEndpoint: none

-  # don't use this, just for testing if things are broken
+  # don't use these, just for testing if things are broken
  "broken":
    cmd: models/llama-server-osx --port 8999 -m models/doesnotexist.gguf
-    proxy: http://127.0.0.1:8999
+    proxy: http://127.0.0.1:8999
+  "broken_timeout":
+    cmd: models/llama-server-osx --port 8999 -m models/Qwen2.5-1.5B-Instruct-Q4_K_M.gguf
+    proxy: http://127.0.0.1:9000
@@ -0,0 +1,96 @@
+package proxy
+
+import (
+	"container/ring"
+	"io"
+	"os"
+	"sync"
+)
+
+type LogMonitor struct {
+	clients  map[chan []byte]bool
+	mu       sync.RWMutex
+	buffer   *ring.Ring
+	bufferMu sync.RWMutex
+
+	// typically this can be os.Stdout
+	stdout io.Writer
+}
+
+func NewLogMonitor() *LogMonitor {
+	return NewLogMonitorWriter(os.Stdout)
+}
+
+func NewLogMonitorWriter(stdout io.Writer) *LogMonitor {
+	return &LogMonitor{
+		clients: make(map[chan []byte]bool),
+		buffer:  ring.New(10 * 1024), // keep 10KB of buffered logs
+		stdout:  stdout,
+	}
+}
+
+func (w *LogMonitor) Write(p []byte) (n int, err error) {
+	if len(p) == 0 {
+		return 0, nil
+	}
+
+	n, err = w.stdout.Write(p)
+	if err != nil {
+		return n, err
+	}
+
+	w.bufferMu.Lock()
+	bufferCopy := make([]byte, len(p))
+	copy(bufferCopy, p)
+	w.buffer.Value = bufferCopy
+	w.buffer = w.buffer.Next()
+	w.bufferMu.Unlock()
+
+	w.broadcast(p)
+	return n, nil
+}
+
+func (w *LogMonitor) GetHistory() []byte {
+	w.bufferMu.RLock()
+	defer w.bufferMu.RUnlock()
+
+	var history []byte
+	w.buffer.Do(func(p any) {
+		if p != nil {
+			if content, ok := p.([]byte); ok {
+				history = append(history, content...)
+			}
+		}
+	})
+	return history
+}
+
+func (w *LogMonitor) Subscribe() chan []byte {
+	w.mu.Lock()
+	defer w.mu.Unlock()
+
+	ch := make(chan []byte, 100)
+	w.clients[ch] = true
+	return ch
+}
+
+func (w *LogMonitor) Unsubscribe(ch chan []byte) {
+	w.mu.Lock()
+	defer w.mu.Unlock()
+
+	delete(w.clients, ch)
+	close(ch)
+}
+
+func (w *LogMonitor) broadcast(msg []byte) {
+	w.mu.RLock()
+	defer w.mu.RUnlock()
+
+	for client := range w.clients {
+		select {
+		case client <- msg:
+		default:
+			// If client buffer is full, skip
+		}
+	}
+}
@@ -0,0 +1,95 @@
+package proxy
+
+import (
+	"bytes"
+	"io"
+	"sync"
+	"testing"
+)
+
+func TestLogMonitor(t *testing.T) {
+	logMonitor := NewLogMonitorWriter(io.Discard)
+
+	// Test subscription
+	client1 := logMonitor.Subscribe()
+	client2 := logMonitor.Subscribe()
+
+	defer logMonitor.Unsubscribe(client1)
+	defer logMonitor.Unsubscribe(client2)
+
+	client1Messages := make([]byte, 0)
+	client2Messages := make([]byte, 0)
+
+	var wg sync.WaitGroup
+	wg.Add(1)
+
+	go func() {
+		defer wg.Done()
+		for {
+			select {
+			case data := <-client1:
+				client1Messages = append(client1Messages, data...)
+			case data := <-client2:
+				client2Messages = append(client2Messages, data...)
+			default:
+				return
+			}
+		}
+	}()
+
+	logMonitor.Write([]byte("1"))
+	logMonitor.Write([]byte("2"))
+	logMonitor.Write([]byte("3"))
+
+	// Wait for the goroutine to finish
+	wg.Wait()
+
+	// Check the buffer
+	expectedHistory := "123"
+	history := string(logMonitor.GetHistory())
+
+	if history != expectedHistory {
+		t.Errorf("Expected history: %s, got: %s", expectedHistory, history)
+	}
+
+	c1Data := string(client1Messages)
+	if c1Data != expectedHistory {
+		t.Errorf("Client1 expected %s, got: %s", expectedHistory, c1Data)
+	}
+
+	c2Data := string(client2Messages)
+	if c2Data != expectedHistory {
+		t.Errorf("Client2 expected %s, got: %s", expectedHistory, c2Data)
+	}
+}
+
+func TestWrite_ImmutableBuffer(t *testing.T) {
+	// Create a new LogMonitor instance
+	lm := NewLogMonitorWriter(io.Discard)
+
+	// Prepare a message to write
+	msg := []byte("Hello, World!")
+	lenmsg := len(msg)
+
+	// Write the message to the LogMonitor
+	n, err := lm.Write(msg)
+	if err != nil {
+		t.Fatalf("Write failed: %v", err)
+	}
+
+	if n != lenmsg {
+		t.Errorf("Expected %d bytes written but got %d", lenmsg, n)
+	}
+
+	// Change the original message
+	msg[0] = 'B' // This should not affect the buffer
+
+	// Get the history from the LogMonitor
+	history := lm.GetHistory()
+
+	// Check that the history contains the original message, not the modified one
+	expected := []byte("Hello, World!")
+	if !bytes.Equal(history, expected) {
+		t.Errorf("Expected history to be %q, got %q", expected, history)
+	}
+}
@@ -8,7 +8,6 @@ import (
 	"io"
 	"net/http"
 	"net/url"
-	"os"
 	"os/exec"
 	"strings"
 	"sync"
@@ -22,10 +21,11 @@ type ProxyManager struct {
 	config        *Config
 	currentCmd    *exec.Cmd
 	currentConfig ModelConfig
+	logMonitor    *LogMonitor
 }

 func New(config *Config) *ProxyManager {
-	return &ProxyManager{config: config}
+	return &ProxyManager{config: config, logMonitor: NewLogMonitor()}
 }

 func (pm *ProxyManager) HandleFunc(w http.ResponseWriter, r *http.Request) {
@@ -36,12 +36,55 @@ func (pm *ProxyManager) HandleFunc(w http.ResponseWriter, r *http.Request) {
 		pm.proxyChatRequest(w, r)
 	} else if r.URL.Path == "/v1/models" {
 		pm.listModels(w, r)
+	} else if r.URL.Path == "/logs" {
+		pm.streamLogs(w, r)
 	} else {
 		pm.proxyRequest(w, r)
 	}
 }

-func (pm *ProxyManager) listModels(w http.ResponseWriter, r *http.Request) {
+func (pm *ProxyManager) streamLogs(w http.ResponseWriter, r *http.Request) {
+	w.Header().Set("Content-Type", "text/plain")
+	w.Header().Set("Transfer-Encoding", "chunked")
+	w.Header().Set("X-Content-Type-Options", "nosniff")
+
+	ch := pm.logMonitor.Subscribe()
+	defer pm.logMonitor.Unsubscribe(ch)
+
+	notify := r.Context().Done()
+	flusher, ok := w.(http.Flusher)
+	if !ok {
+		http.Error(w, "Streaming unsupported", http.StatusInternalServerError)
+		return
+	}
+
+	skipHistory := r.URL.Query().Has("skip")
+	if !skipHistory {
+		// Send history first
+		history := pm.logMonitor.GetHistory()
+		if len(history) != 0 {
+			w.Write(history)
+			flusher.Flush()
+		}
+	}
+
+	if !r.URL.Query().Has("stream") {
+		return
+	}
+
+	// Stream new logs
+	for {
+		select {
+		case msg := <-ch:
+			w.Write(msg)
+			flusher.Flush()
+		case <-notify:
+			return
+		}
+	}
+}
+
+func (pm *ProxyManager) listModels(w http.ResponseWriter, _ *http.Request) {
 	data := []interface{}{}
 	for id := range pm.config.Models {
 		data = append(data, map[string]interface{}{
@@ -92,8 +135,12 @@ func (pm *ProxyManager) swapModel(requestedModel string) error {
 		return fmt.Errorf("unable to get sanitized command: %v", err)
 	}
 	cmd := exec.Command(args[0], args[1:]...)
-	cmd.Stdout = os.Stdout
-	cmd.Stderr = os.Stderr
+
+	// logMonitor only writes to stdout
+	// so the upstream's stderr will go to os.Stdout
+	cmd.Stdout = pm.logMonitor
+	cmd.Stderr = pm.logMonitor
+
 	cmd.Env = modelConfig.Env

 	err = cmd.Start()
@@ -102,14 +149,28 @@ func (pm *ProxyManager) swapModel(requestedModel string) error {
 	}
 	pm.currentCmd = cmd

-	if err := pm.checkHealthEndpoint(); err != nil {
+	// watch for the command to exist
+	cmdCtx, cancel := context.WithCancelCause(context.Background())
+
+	// monitor the command's exist status
+	go func() {
+		err := cmd.Wait()
+		if err != nil {
+			cancel(fmt.Errorf("command [%s] %s", strings.Join(cmd.Args, " "), err.Error()))
+		} else {
+			cancel(nil)
+		}
+	}()
+
+	// wait for checkHealthEndpoint
+	if err := pm.checkHealthEndpoint(cmdCtx); err != nil {
 		return err
 	}

 	return nil
 }

-func (pm *ProxyManager) checkHealthEndpoint() error {
+func (pm *ProxyManager) checkHealthEndpoint(cmdCtx context.Context) error {

 	if pm.currentConfig.Proxy == "" {
 		return fmt.Errorf("no upstream available to check /health")
@@ -132,6 +193,7 @@ func (pm *ProxyManager) checkHealthEndpoint() error {
 	if err != nil {
 		return fmt.Errorf("failed to create health url with with %s and path %s", proxyTo, checkEndpoint)
 	}
+
 	client := &http.Client{}
 	startTime := time.Now()

@@ -140,33 +202,47 @@ func (pm *ProxyManager) checkHealthEndpoint() error {
 		if err != nil {
 			return err
 		}
-		ctx, cancel := context.WithTimeout(req.Context(), 250*time.Millisecond)
+
+		ctx, cancel := context.WithTimeout(cmdCtx, 250*time.Millisecond)
 		defer cancel()
 		req = req.WithContext(ctx)
 		resp, err := client.Do(req)
-		if err != nil {
-			if strings.Contains(err.Error(), "connection refused") {

-				// if TCP dial can't connect any HTTP response after 5 seconds
-				// exit quickly.
-				if time.Since(startTime) > 5*time.Second {
-					return fmt.Errorf("health check endpoint took more than 5 seconds to respond")
-				}
+		ttl := (maxDuration - time.Since(startTime)).Seconds()
+
+		if err != nil {
+			// check if the context was cancelled
+			select {
+			case <-ctx.Done():
+				return context.Cause(ctx)
+			default:
 			}

-			if time.Since(startTime) >= maxDuration {
+			// wait a bit longer for TCP connection issues
+			if strings.Contains(err.Error(), "connection refused") {
+				fmt.Fprintf(pm.logMonitor, "Connection refused on %s, ttl %.0fs\n", healthURL, ttl)
+
+				time.Sleep(5 * time.Second)
+			} else {
+				time.Sleep(time.Second)
+			}
+
+			if ttl < 0 {
 				return fmt.Errorf("failed to check health from: %s", healthURL)
 			}
-			time.Sleep(time.Second)
+
 			continue
 		}
+
 		defer resp.Body.Close()
 		if resp.StatusCode == http.StatusOK {
 			return nil
 		}
-		if time.Since(startTime) >= maxDuration {
+
+		if ttl < 0 {
 			return fmt.Errorf("failed to check health from: %s", healthURL)
 		}
+
 		time.Sleep(time.Second)
 	}
 }
@@ -189,7 +265,7 @@ func (pm *ProxyManager) proxyChatRequest(w http.ResponseWriter, r *http.Request)
 	}

 	if err := pm.swapModel(model); err != nil {
-		http.Error(w, fmt.Sprintf("unable to swap to model: %s", err.Error()), http.StatusNotFound)
+		http.Error(w, fmt.Sprintf("unable to swap to model, %s", err.Error()), http.StatusNotFound)
 		return
 	}
Author	SHA1	Message	Date
Benson Wong	63d4a7d0eb	Improve LogMonitor to handle empty writes and ensure buffer immutability - Add a check to return immediately if the write buffer is empty - Create a copy of new history data to ensure it is immutable - Update the `GetHistory` method to use the `any` type for the buffer interface - Add a test case to verify that the buffer remains unchanged even if the original message is modified after writing	2024-11-02 10:41:23 -07:00
Benson Wong	f45469f7ff	Merge pull request #8 from mostlygeek/improve-upstream-monitoring-issue5 Improvements to handling of the upstream process so errors happen whenever one of these is first: the health check timeout is reached waiting for the upstream process to be ready the upstream process exits unexpectedly With this change llama-swap is more compatible with use cases like containerized upstream services (#5) which pull the container before HTTP endpoints are ready.	2024-11-01 15:28:06 -07:00
Benson Wong	34f9fd7340	Improve timeout and exit handling of child processes. fix #3 and #5 llama-swap only waited a maximum of 5 seconds for an upstream HTTP server to be available. If it took longer than that it will error out the request. Now it will wait up to the configured healthCheckTimeout or the upstream process unexpectedly exits.	2024-11-01 14:32:39 -07:00
Benson Wong	8448efa7fc	revise health check logic to not error on 5 second timeout	2024-11-01 09:42:37 -07:00
Benson Wong	8cf2a389d8	Refactor log implementation - use []byte instead of unnecessary string conversions - make LogManager.Broadcast private - make LogManager.GetHistory public - add tests	2024-10-31 12:16:54 -07:00
Benson Wong	0f133f5b74	Add /logs endpoint to monitor upstream processes - outputs last 10KB of logs from upstream processes - supports streaming	2024-10-30 21:02:30 -07:00
Benson Wong	1510b3fbd9	clean up README	2024-10-22 10:37:45 -07:00
Benson Wong	0f8a8e70f1	add header image	2024-10-22 10:30:30 -07:00