Merge pull request #8 from mostlygeek/improve-upstream-monitoring-issue5

Improvements to handling of the upstream process so errors happen whenever one of these is first: the health check timeout is reached waiting for the upstream process to be ready the upstream process exits unexpectedly With this change llama-swap is more compatible with use cases like containerized upstream services (#5) which pull the container before HTTP endpoints are ready.
Improve timeout and exit handling of child processes. fix #3 and #5
2024-11-01 15:28:06 -07:00 · 2024-11-01 14:32:39 -07:00 · 2024-11-01 09:42:37 -07:00 · 2024-10-31 12:16:54 -07:00 · 2024-10-30 21:02:30 -07:00 · 2024-10-22 10:37:45 -07:00
7 changed files with 314 additions and 30 deletions
@@ -9,6 +9,9 @@ all: mac linux simple-responder
 clean:
 	rm -rf $(BUILD_DIR)
 test:
 	go test -v ./proxy
 # Build OSX binary
 mac:
 	@echo "Building Mac binary..."
@@ -1,12 +1,14 @@
 # llama-swap
-[llama.cpp's server](https://github.com/ggerganov/llama.cpp/tree/master/examples/server) can't swap models, so let's swap llama-server instead!
+![llama-swap header image](header.jpeg)
 [llama.cpp's server](https://github.com/ggerganov/llama.cpp/tree/master/examples/server) can't swap models on demand. So let's swap the server on demand instead!
 llama-swap is a proxy server that sits in front of llama-server. When a request for `/v1/chat/completions` comes in it will extract the `model` requested and change the underlying llama-server automatically.
 - ✅ easy to deploy: single binary with no dependencies
 - ✅ full control over llama-server's startup settings
- ✅ ❤️ for nvidia P40 users who are rely on llama.cpp for inference
+- ✅ ❤️ for users who are rely on llama.cpp for LLM inference
 ## config.yaml
@@ -20,10 +22,10 @@ healthCheckTimeout: 60
 # define valid model values and the upstream server start
 models:
  "llama":
-    cmd: "llama-server --port 8999 -m Llama-3.2-1B-Instruct-Q4_K_M.gguf"
+    cmd: llama-server --port 8999 -m Llama-3.2-1B-Instruct-Q4_K_M.gguf
-    # Where to proxy to, important it matches this format
+    # where to reach the server started by cmd
-    proxy: "http://127.0.0.1:8999"
+    proxy: http://127.0.0.1:8999
    # aliases model names to use this configuration for
    aliases:
@@ -35,14 +37,19 @@ models:
    #
    # use "none" to skip endpoint checking. This may cause requests to fail
    # until the server is ready
-    checkEndpoint: "/custom-endpoint"
+    checkEndpoint: /custom-endpoint
  "qwen":
    # environment variables to pass to the command
    env:
      - "CUDA_VISIBLE_DEVICES=0"
-    cmd: "llama-server --port 8999 -m path/to/Qwen2.5-1.5B-Instruct-Q4_K_M.gguf"
+
-    proxy: "http://127.0.0.1:8999"
+    # multiline for readability
    cmd: >
      llama-server --port 8999
      --model path/to/Qwen2.5-1.5B-Instruct-Q4_K_M.gguf
    proxy: http://127.0.0.1:8999
 ```
 ## Installation
@@ -52,6 +59,26 @@ models:
    * _Note: Windows currently untested._
 1. Run the binary with `llama-swap --config path/to/config.yaml`
 ## Monitoring Logs
 The `/logs` endpoint is available to monitor what llama-swap is doing. It will send the last 10KB of logs. Useful for monitoring the output of llama-server. It also supports streaming of logs.
 Usage:
 ```
 # basic, sends up to the last 10KB of logs
 curl http://host/logs'
 # add `stream` to stream new logs as they come in
 curl -Ns 'http://host/logs?stream'
 # add `skip` to skip history (only useful if used with stream)
 curl -Ns 'http://host/logs?stream&skip'
 # will output nothing :)
 curl -Ns 'http://host/logs?skip'
 ```
 ## Systemd Unit Files
 Use this unit file to start llama-swap on boot. This is only tested on Ubuntu.
@@ -1,6 +1,6 @@
 # Seconds to wait for llama.cpp to be available to serve requests
 # Default (and minimum): 15 seconds
-healthCheckTimeout: 60
+healthCheckTimeout: 15
 models:
  "llama":
@@ -35,7 +35,10 @@ models:
    # until the upstream server is ready for traffic
    checkEndpoint: none
-  # don't use this, just for testing if things are broken
+  # don't use these, just for testing if things are broken
  "broken":
    cmd: models/llama-server-osx --port 8999 -m models/doesnotexist.gguf
    proxy: http://127.0.0.1:8999
  "broken_timeout":
    cmd: models/llama-server-osx --port 8999 -m models/Qwen2.5-1.5B-Instruct-Q4_K_M.gguf
    proxy: http://127.0.0.1:9000
@@ -0,0 +1,90 @@
 package proxy
 import (
 	"container/ring"
 	"io"
 	"os"
 	"sync"
 )
 type LogMonitor struct {
 	clients  map[chan []byte]bool
 	mu       sync.RWMutex
 	buffer   *ring.Ring
 	bufferMu sync.RWMutex
 	// typically this can be os.Stdout
 	stdout io.Writer
 }
 func NewLogMonitor() *LogMonitor {
 	return NewLogMonitorWriter(os.Stdout)
 }
 func NewLogMonitorWriter(stdout io.Writer) *LogMonitor {
 	return &LogMonitor{
 		clients: make(map[chan []byte]bool),
 		buffer:  ring.New(10 * 1024), // keep 10KB of buffered logs
 		stdout:  stdout,
 	}
 }
 func (w *LogMonitor) Write(p []byte) (n int, err error) {
 	n, err = w.stdout.Write(p)
 	if err != nil {
 		return n, err
 	}
 	w.bufferMu.Lock()
 	w.buffer.Value = p
 	w.buffer = w.buffer.Next()
 	w.bufferMu.Unlock()
 	w.broadcast(p)
 	return n, nil
 }
 func (w *LogMonitor) GetHistory() []byte {
 	w.bufferMu.RLock()
 	defer w.bufferMu.RUnlock()
 	var history []byte
 	w.buffer.Do(func(p interface{}) {
 		if p != nil {
 			if content, ok := p.([]byte); ok {
 				history = append(history, content...)
 			}
 		}
 	})
 	return history
 }
 func (w *LogMonitor) Subscribe() chan []byte {
 	w.mu.Lock()
 	defer w.mu.Unlock()
 	ch := make(chan []byte, 100)
 	w.clients[ch] = true
 	return ch
 }
 func (w *LogMonitor) Unsubscribe(ch chan []byte) {
 	w.mu.Lock()
 	defer w.mu.Unlock()
 	delete(w.clients, ch)
 	close(ch)
 }
 func (w *LogMonitor) broadcast(msg []byte) {
 	w.mu.RLock()
 	defer w.mu.RUnlock()
 	for client := range w.clients {
 		select {
 		case client <- msg:
 		default:
 			// If client buffer is full, skip
 		}
 	}
 }
@@ -0,0 +1,63 @@
 package proxy
 import (
 	"io"
 	"sync"
 	"testing"
 )
 func TestLogMonitor(t *testing.T) {
 	logMonitor := NewLogMonitorWriter(io.Discard)
 	// Test subscription
 	client1 := logMonitor.Subscribe()
 	client2 := logMonitor.Subscribe()
 	defer logMonitor.Unsubscribe(client1)
 	defer logMonitor.Unsubscribe(client2)
 	client1Messages := make([]byte, 0)
 	client2Messages := make([]byte, 0)
 	var wg sync.WaitGroup
 	wg.Add(1)
 	go func() {
 		defer wg.Done()
 		for {
 			select {
 			case data := <-client1:
 				client1Messages = append(client1Messages, data...)
 			case data := <-client2:
 				client2Messages = append(client2Messages, data...)
 			default:
 				return
 			}
 		}
 	}()
 	logMonitor.Write([]byte("1"))
 	logMonitor.Write([]byte("2"))
 	logMonitor.Write([]byte("3"))
 	// Wait for the goroutine to finish
 	wg.Wait()
 	// Check the buffer
 	expectedHistory := "123"
 	history := string(logMonitor.GetHistory())
 	if history != expectedHistory {
 		t.Errorf("Expected history: %s, got: %s", expectedHistory, history)
 	}
 	c1Data := string(client1Messages)
 	if c1Data != expectedHistory {
 		t.Errorf("Client1 expected %s, got: %s", expectedHistory, c1Data)
 	}
 	c2Data := string(client2Messages)
 	if c2Data != expectedHistory {
 		t.Errorf("Client2 expected %s, got: %s", expectedHistory, c2Data)
 	}
 }
@@ -8,7 +8,6 @@ import (
 	"io"
 	"net/http"
 	"net/url"
 	"os"
 	"os/exec"
 	"strings"
 	"sync"
@@ -22,24 +21,90 @@ type ProxyManager struct {
 	config        *Config
 	currentCmd    *exec.Cmd
 	currentConfig ModelConfig
 	logMonitor    *LogMonitor
 }
 func New(config *Config) *ProxyManager {
-	return &ProxyManager{config: config}
+	return &ProxyManager{config: config, logMonitor: NewLogMonitor()}
 }
 func (pm *ProxyManager) HandleFunc(w http.ResponseWriter, r *http.Request) {
 	// https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md#api-endpoints
 	if r.URL.Path == "/v1/chat/completions" {
 		// extracts the `model` from json body
 		pm.proxyChatRequest(w, r)
 	} else if r.URL.Path == "/v1/models" {
 		pm.listModels(w, r)
 	} else if r.URL.Path == "/logs" {
 		pm.streamLogs(w, r)
 	} else {
 		pm.proxyRequest(w, r)
 	}
 }
 func (pm *ProxyManager) streamLogs(w http.ResponseWriter, r *http.Request) {
 	w.Header().Set("Content-Type", "text/plain")
 	w.Header().Set("Transfer-Encoding", "chunked")
 	w.Header().Set("X-Content-Type-Options", "nosniff")
 	ch := pm.logMonitor.Subscribe()
 	defer pm.logMonitor.Unsubscribe(ch)
 	notify := r.Context().Done()
 	flusher, ok := w.(http.Flusher)
 	if !ok {
 		http.Error(w, "Streaming unsupported", http.StatusInternalServerError)
 		return
 	}
 	skipHistory := r.URL.Query().Has("skip")
 	if !skipHistory {
 		// Send history first
 		history := pm.logMonitor.GetHistory()
 		if len(history) != 0 {
 			w.Write(history)
 			flusher.Flush()
 		}
 	}
 	if !r.URL.Query().Has("stream") {
 		return
 	}
 	// Stream new logs
 	for {
 		select {
 		case msg := <-ch:
 			w.Write(msg)
 			flusher.Flush()
 		case <-notify:
 			return
 		}
 	}
 }
 func (pm *ProxyManager) listModels(w http.ResponseWriter, _ *http.Request) {
 	data := []interface{}{}
 	for id := range pm.config.Models {
 		data = append(data, map[string]interface{}{
 			"id":       id,
 			"object":   "model",
 			"created":  time.Now().Unix(),
 			"owned_by": "llama-swap",
 		})
 	}
 	// Set the Content-Type header to application/json
 	w.Header().Set("Content-Type", "application/json")
 	// Encode the data as JSON and write it to the response writer
 	if err := json.NewEncoder(w).Encode(map[string]interface{}{"data": data}); err != nil {
 		http.Error(w, "Error encoding JSON", http.StatusInternalServerError)
 		return
 	}
 }
 func (pm *ProxyManager) swapModel(requestedModel string) error {
 	pm.Lock()
 	defer pm.Unlock()
@@ -70,8 +135,12 @@ func (pm *ProxyManager) swapModel(requestedModel string) error {
 		return fmt.Errorf("unable to get sanitized command: %v", err)
 	}
 	cmd := exec.Command(args[0], args[1:]...)
-	cmd.Stdout = os.Stdout
+
-	cmd.Stderr = os.Stderr
+	// logMonitor only writes to stdout
 	// so the upstream's stderr will go to os.Stdout
 	cmd.Stdout = pm.logMonitor
 	cmd.Stderr = pm.logMonitor
 	cmd.Env = modelConfig.Env
 	err = cmd.Start()
@@ -80,14 +149,28 @@ func (pm *ProxyManager) swapModel(requestedModel string) error {
 	}
 	pm.currentCmd = cmd
-	if err := pm.checkHealthEndpoint(); err != nil {
+	// watch for the command to exist
 	cmdCtx, cancel := context.WithCancelCause(context.Background())
 	// monitor the command's exist status
 	go func() {
 		err := cmd.Wait()
 		if err != nil {
 			cancel(fmt.Errorf("command [%s] %s", strings.Join(cmd.Args, " "), err.Error()))
 		} else {
 			cancel(nil)
 		}
 	}()
 	// wait for checkHealthEndpoint
 	if err := pm.checkHealthEndpoint(cmdCtx); err != nil {
 		return err
 	}
 	return nil
 }
-func (pm *ProxyManager) checkHealthEndpoint() error {
+func (pm *ProxyManager) checkHealthEndpoint(cmdCtx context.Context) error {
 	if pm.currentConfig.Proxy == "" {
 		return fmt.Errorf("no upstream available to check /health")
@@ -110,6 +193,7 @@ func (pm *ProxyManager) checkHealthEndpoint() error {
 	if err != nil {
 		return fmt.Errorf("failed to create health url with with %s and path %s", proxyTo, checkEndpoint)
 	}
 	client := &http.Client{}
 	startTime := time.Now()
@@ -118,33 +202,47 @@ func (pm *ProxyManager) checkHealthEndpoint() error {
 		if err != nil {
 			return err
 		}
-		ctx, cancel := context.WithTimeout(req.Context(), 250*time.Millisecond)
+
 		ctx, cancel := context.WithTimeout(cmdCtx, 250*time.Millisecond)
 		defer cancel()
 		req = req.WithContext(ctx)
 		resp, err := client.Do(req)
 		ttl := (maxDuration - time.Since(startTime)).Seconds()
 		if err != nil {
 			// check if the context was cancelled
 			select {
 			case <-ctx.Done():
 				return context.Cause(ctx)
 			default:
 			}
 			// wait a bit longer for TCP connection issues
 			if strings.Contains(err.Error(), "connection refused") {
 				fmt.Fprintf(pm.logMonitor, "Connection refused on %s, ttl %.0fs\n", healthURL, ttl)
-				// if TCP dial can't connect any HTTP response after 5 seconds
+				time.Sleep(5 * time.Second)
-				// exit quickly.
+			} else {
-				if time.Since(startTime) > 5*time.Second {
+				time.Sleep(time.Second)
 					return fmt.Errorf("health check endpoint took more than 5 seconds to respond")
 				}
 			}
-			if time.Since(startTime) >= maxDuration {
+			if ttl < 0 {
 				return fmt.Errorf("failed to check health from: %s", healthURL)
 			}
-			time.Sleep(time.Second)
+
 			continue
 		}
 		defer resp.Body.Close()
 		if resp.StatusCode == http.StatusOK {
 			return nil
 		}
-		if time.Since(startTime) >= maxDuration {
+
 		if ttl < 0 {
 			return fmt.Errorf("failed to check health from: %s", healthURL)
 		}
 		time.Sleep(time.Second)
 	}
 }
@@ -167,7 +265,7 @@ func (pm *ProxyManager) proxyChatRequest(w http.ResponseWriter, r *http.Request)
 	}
 	if err := pm.swapModel(model); err != nil {
-		http.Error(w, fmt.Sprintf("unable to swap to model: %s", err.Error()), http.StatusNotFound)
+		http.Error(w, fmt.Sprintf("unable to swap to model, %s", err.Error()), http.StatusNotFound)
 		return
 	}
Author	SHA1	Message	Date
Benson Wong	f45469f7ff	Merge pull request #8 from mostlygeek/improve-upstream-monitoring-issue5 Improvements to handling of the upstream process so errors happen whenever one of these is first: the health check timeout is reached waiting for the upstream process to be ready the upstream process exits unexpectedly With this change llama-swap is more compatible with use cases like containerized upstream services (#5) which pull the container before HTTP endpoints are ready.	2024-11-01 15:28:06 -07:00
Benson Wong	34f9fd7340	Improve timeout and exit handling of child processes. fix #3 and #5 llama-swap only waited a maximum of 5 seconds for an upstream HTTP server to be available. If it took longer than that it will error out the request. Now it will wait up to the configured healthCheckTimeout or the upstream process unexpectedly exits.	2024-11-01 14:32:39 -07:00
Benson Wong	8448efa7fc	revise health check logic to not error on 5 second timeout	2024-11-01 09:42:37 -07:00
Benson Wong	8cf2a389d8	Refactor log implementation - use []byte instead of unnecessary string conversions - make LogManager.Broadcast private - make LogManager.GetHistory public - add tests	2024-10-31 12:16:54 -07:00
Benson Wong	0f133f5b74	Add /logs endpoint to monitor upstream processes - outputs last 10KB of logs from upstream processes - supports streaming	2024-10-30 21:02:30 -07:00
Benson Wong	1510b3fbd9	clean up README	2024-10-22 10:37:45 -07:00
Benson Wong	0f8a8e70f1	add header image	2024-10-22 10:30:30 -07:00
Benson Wong	6c3819022c	Add compatibility with OpenAI /v1/models endpoint to list models	2024-10-21 15:38:12 -07:00