change profile split character to : (colon) (#21 )

- change from `/` to `:` for multiple models loaded as part of a profile - breaking change now, but allows for more compatibility with other inference engines that may have model references like `coding:Qwen/Qwen-2.5-Coder-32B`
improve cmd parsing (#22 )
2024-12-01 09:10:50 -08:00 · 2024-12-01 09:02:58 -08:00 · 2024-11-30 15:24:42 -08:00 · 2024-11-28 22:12:44 -08:00 · 2024-11-28 22:07:22 -08:00 · 2024-11-28 22:06:29 -08:00
13 changed files with 534 additions and 96 deletions
@@ -8,12 +8,13 @@ Features:

 - ✅ Easy to deploy: single binary with no dependencies
 - ✅ Single yaml configuration file
- ✅ Automatically switching between models
+- ✅ Automatic switching between models
 - ✅ Full control over llama.cpp server settings per model
 - ✅ OpenAI API support (`v1/completions` and `v1/chat/completions`)
 - ✅ Multiple GPU support
 - ✅ Run multiple models at once with `profiles`
 - ✅ Remote log monitoring at `/log`
+- ✅ Automatic unloading of models from GPUs after timeout

 ## config.yaml

@@ -71,6 +72,8 @@ profiles:
    - "llama"
 ```

+More [examples](examples/README.md) are available for different use cases.
+
 ## Installation

 1. Create a configuration file, see [config.example.yaml](config.example.yaml)
@@ -0,0 +1,9 @@
+# Example Configurations
+
+Learning by example is best.
+
+Here in the `examples/` folder are llama-swap configurations that can be used on your local LLM server.
+
+## List
+
+* [Speculative Decoding](speculative-decoding/README.md) - using a small draft model can increase inference speeds from 20% to 40%. This example includes a configurations Qwen2.5-Coder-32B (2.5x increase) and Llama-3.1-70B (1.4x increase) in the best cases.
@@ -0,0 +1,124 @@
+# Speculative Decoding
+
+Speculative decoding can significantly improve the tokens per second. However, this comes at the cost of increased VRAM usage for the draft model. The examples provided are based on a server with three P40s and one 3090.
+
+## Coding Use Case
+
+This example uses Qwen2.5 Coder 32B with the 0.5B model as a draft. A quantization of Q8_0 was chosen for the draft model, as quantization has a greater impact on smaller models.
+
+The models used are:
+
+* [Bartowski Qwen2.5-Coder-32B-Instruct](https://huggingface.co/bartowski/Qwen2.5-Coder-32B-Instruct-GGUF)
+* [Bartowski Qwen2.5-Coder-0.5B-Instruct](https://huggingface.co/bartowski/Qwen2.5-Coder-0.5B-Instruct-GGUF)
+
+The llama-swap configuration is as follows:
+
+```yaml
+models:
+  "qwen-coder-32b-q4":
+    # main model on 3090, draft on P40 #1
+    cmd: >
+      /mnt/nvme/llama-server/llama-server-be0e35
+      --host 127.0.0.1 --port 9503
+      --flash-attn --metrics
+      --slots
+      --model /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf
+      -ngl 99
+      --ctx-size 19000
+      --model-draft /mnt/nvme/models/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf
+      -ngld 99
+      --draft-max 16
+      --draft-min 4
+      --draft-p-min 0.4
+      --device CUDA0
+      --device-draft CUDA1
+    proxy: "http://127.0.0.1:9503"
+```
+
+In this configuration, two GPUs are used: a 3090 (CUDA0) for the main model and a P40 (CUDA1) for the draft model. Although both models can fit on the 3090, relocating the draft model to the P40 freed up space for a larger context size. Despite the P40 being about 1/3rd the speed of the 3090, the small model still improved tokens per second.
+
+Multiple tests were run with various parameters, and the fastest result was chosen for the configuration. In all tests, the 0.5B model produced the largest improvements to tokens per second.
+
+Baseline: 33.92 tokens/second on 3090 without a draft model.
+
+| draft-max | draft-min | draft-p-min | python | TS | swift |
+|-----------|-----------|-------------|--------|----|-------|
+| 16 | 1 | 0.9 | 71.64 | 55.55 | 48.06 |
+| 16 | 1 | 0.4 | 83.21 | 58.55 | 45.50 |
+| 16 | 1 | 0.1 | 79.72 | 55.66 | 43.94 |
+| 16 | 2 | 0.9 | 68.47 | 55.13 | 43.12 |
+| 16 | 2 | 0.4 | 82.82 | 57.42 | 48.83 |
+| 16 | 2 | 0.1 | 81.68 | 51.37 | 45.72 |
+| 16 | 4 | 0.9 | 66.44 | 48.49 | 42.40 |
+| 16 | 4 | 0.4 | _83.62_ (fastest)| _58.29_ | _50.17_ |
+| 16 | 4 | 0.1 | 82.46 | 51.45 | 40.71 |
+| 8 | 1 | 0.4 | 67.07 | 55.17 | 48.46 |
+| 4 | 1 | 0.4 | 50.13 | 44.96 | 40.79 |
+
+The test script can be found in this [gist](https://gist.github.com/mostlygeek/da429769796ac8a111142e75660820f1). It is a simple curl script that prompts generating a snake game in Python, TypeScript, or Swift. Evaluation metrics were pulled from llama.cpp's logs.
+
+```bash
+for lang in "python" "typescript" "swift"; do
+    echo "Generating Snake Game in $lang using $model"
+    curl -s --url http://localhost:8080/v1/chat/completions -d "{\"messages\": [{\"role\": \"system\", \"content\": \"you only write code.\"}, {\"role\": \"user\", \"content\": \"write snake game in $lang\"}], \"temperature\": 0.1, \"model\":\"$model\"}" > /dev/null
+done
+```
+
+Python consistently outperformed Swift in all tests, likely due to the 0.5B draft model being more proficient in generating Python code accepted by the larger 32B model.
+
+## Chat
+
+This configuration is for a regular chat use case. It produces approximately 13 tokens/second in typical use, up from ~9 tokens/second with only the 3xP40s. This is great news for P40 owners.
+
+The models used are:
+
+* [Bartowski Meta-Llama-3.1-70B-Instruct-GGUF](https://huggingface.co/bartowski/Meta-Llama-3.1-70B-Instruct-GGUF)
+* [Bartowski Llama-3.2-3B-Instruct-GGUF](https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF)
+
+```yaml
+models:
+  "llama-70B":
+    cmd: >
+      /mnt/nvme/llama-server/llama-server-be0e35
+      --host 127.0.0.1 --port 9602
+      --flash-attn --metrics
+      --split-mode row
+      --ctx-size 80000
+      --model /mnt/nvme/models/Meta-Llama-3.1-70B-Instruct-Q4_K_L.gguf
+      -ngl 99
+      --model-draft /mnt/nvme/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf
+      -ngld 99
+      --draft-max 16
+      --draft-min 1
+      --draft-p-min 0.4
+      --device-draft CUDA0
+      --tensor-split 0,1,1,1
+```
+
+In this configuration, Llama-3.1-70B is split across three P40s, and Llama-3.2-3B is on the 3090.
+
+Some flags deserve further explanation:
+
+* `--split-mode row` - increases inference speeds using multiple P40s by about 30%. This is a P40-specific feature.
+* `--tensor-split 0,1,1,1` - controls how the main model is split across the GPUs. This means 0% on the 3090 and an even split across the P40s. A value of `--tensor-split 0,5,4,1` would mean 0% on the 3090, 50%, 40%, and 10% respectively across the other P40s. However, this would exceed the available VRAM.
+* `--ctx-size 80000` - the maximum context size that can fit in the remaining VRAM.
+
+## What is CUDA0, CUDA1, CUDA2, CUDA3?
+
+These devices are the IDs used by llama.cpp.
+
+```bash
+$ ./llama-server --list-devices
+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 4 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: Tesla P40, compute capability 6.1, VMM: yes
+  Device 2: Tesla P40, compute capability 6.1, VMM: yes
+  Device 3: Tesla P40, compute capability 6.1, VMM: yes
+Available devices:
+  CUDA0: NVIDIA GeForce RTX 3090 (24154 MiB, 23892 MiB free)
+  CUDA1: Tesla P40 (24438 MiB, 24290 MiB free)
+  CUDA2: Tesla P40 (24438 MiB, 24290 MiB free)
+  CUDA3: Tesla P40 (24438 MiB, 24290 MiB free)
+```
@@ -20,6 +20,7 @@ require (
 	github.com/go-playground/universal-translator v0.18.1 // indirect
 	github.com/go-playground/validator/v10 v10.20.0 // indirect
 	github.com/goccy/go-json v0.10.2 // indirect
+	github.com/google/shlex v0.0.0-20191202100458-e7afc7fbc510 // indirect
 	github.com/json-iterator/go v1.1.12 // indirect
 	github.com/klauspost/cpuid/v2 v2.2.7 // indirect
 	github.com/leodido/go-urn v1.4.0 // indirect
@@ -24,6 +24,8 @@ github.com/go-playground/validator/v10 v10.20.0/go.mod h1:dbuPbCMFw/DrkbEynArYaC
 github.com/goccy/go-json v0.10.2 h1:CrxCmQqYDkv1z7lO7Wbh2HN93uovUHgrECaO5ZrCXAU=
 github.com/goccy/go-json v0.10.2/go.mod h1:6MelG93GURQebXPDq3khkgXZkazVtN9CRI+MGFi0w8I=
 github.com/google/gofuzz v1.0.0/go.mod h1:dBl0BpW6vV/+mYPU4Po3pmUjxk6FQPldtuIdl/M65Eg=
+github.com/google/shlex v0.0.0-20191202100458-e7afc7fbc510 h1:El6M4kTTCOh6aBiKaUGG7oYTSPP8MxqL4YI3kZKwcP4=
+github.com/google/shlex v0.0.0-20191202100458-e7afc7fbc510/go.mod h1:pupxD2MaaD3pAXIBCelhxNneeOaAeabZDe5s4K6zSpQ=
 github.com/json-iterator/go v1.1.12 h1:PV8peI4a0ysnczrg+LtxykD8LfKY9ML6u2jnxaEnrnM=
 github.com/json-iterator/go v1.1.12/go.mod h1:e30LSqwooZae/UwlEbR2852Gd8hjQvJoHmT4TnhNGBo=
 github.com/klauspost/cpuid/v2 v2.0.9/go.mod h1:FInQzS24/EEf25PyTYn52gqo7WaD8xa0213Md/qVLRg=
@@ -3,60 +3,137 @@ package main
 import (
 	"flag"
 	"fmt"
+	"io"
+	"log"
 	"net/http"
 	"os"
+	"os/signal"
+	"syscall"
+	"time"
+
+	"github.com/gin-gonic/gin"
 )

 func main() {
+	gin.SetMode(gin.TestMode)
 	// Define a command-line flag for the port
 	port := flag.String("port", "8080", "port to listen on")

 	// Define a command-line flag for the response message
 	responseMessage := flag.String("respond", "hi", "message to respond with")

+	silent := flag.Bool("silent", false, "disable all logging")
+
 	flag.Parse() // Parse the command-line flags

-	responseMessageHandler := func(w http.ResponseWriter, r *http.Request) {
-		// Set the header to text/plain
-		w.Header().Set("Content-Type", "text/plain")
-		fmt.Fprintln(w, *responseMessage)
-	}
+	// Create a new Gin router
+	r := gin.New()

 	// Set up the handler function using the provided response message
-	http.HandleFunc("/v1/chat/completions", responseMessageHandler)
-	http.HandleFunc("/v1/completions", responseMessageHandler)
-	http.HandleFunc("/test", responseMessageHandler)
+	r.POST("/v1/chat/completions", func(c *gin.Context) {
+		c.Header("Content-Type", "text/plain")

-	http.HandleFunc("/env", func(w http.ResponseWriter, r *http.Request) {
-		w.Header().Set("Content-Type", "text/plain")
-		fmt.Fprintln(w, *responseMessage)
+		// add a wait to simulate a slow query
+		if wait, err := time.ParseDuration(c.Query("wait")); err == nil {
+			time.Sleep(wait)
+		}
+
+		c.String(200, *responseMessage)
+	})
+
+	r.POST("/v1/completions", func(c *gin.Context) {
+		c.Header("Content-Type", "text/plain")
+		c.String(200, *responseMessage)
+	})
+
+	r.GET("/slow-respond", func(c *gin.Context) {
+		echo := c.Query("echo")
+		delay := c.Query("delay")
+
+		if echo == "" {
+			echo = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
+		}
+
+		// Parse the duration
+		if delay == "" {
+			delay = "100ms"
+		}
+
+		t, err := time.ParseDuration(delay)
+		if err != nil {
+			c.Header("Content-Type", "text/plain")
+			c.String(http.StatusBadRequest, fmt.Sprintf("Invalid duration: %s", err))
+			return
+		}
+
+		c.Header("Content-Type", "text/plain")
+		for _, char := range echo {
+			c.Writer.Write([]byte(string(char)))
+			c.Writer.Flush()
+
+			// wait
+			<-time.After(t)
+		}
+	})
+
+	r.GET("/test", func(c *gin.Context) {
+		c.Header("Content-Type", "text/plain")
+		c.String(200, *responseMessage)
+	})
+
+	r.GET("/env", func(c *gin.Context) {
+		c.Header("Content-Type", "text/plain")
+		c.String(200, *responseMessage)

 		// Get environment variables
 		envVars := os.Environ()

 		// Write each environment variable to the response
 		for _, envVar := range envVars {
-			fmt.Fprintln(w, envVar)
+			c.String(200, envVar)
 		}
 	})

 	// Set up the /health endpoint handler function
-	http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
-		w.Header().Set("Content-Type", "application/json")
-		response := `{"status": "ok"}`
-		w.Write([]byte(response))
+	r.GET("/health", func(c *gin.Context) {
+		c.Header("Content-Type", "application/json")
+		c.JSON(200, gin.H{"status": "ok"})
 	})

-	http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
-		w.Header().Set("Content-Type", "text/plain")
-		fmt.Fprintf(w, "%s %s", r.Method, r.URL.Path)
+	r.GET("/", func(c *gin.Context) {
+		c.Header("Content-Type", "text/plain")
+		c.String(200, fmt.Sprintf("%s %s", c.Request.Method, c.Request.URL.Path))
 	})

 	address := "127.0.0.1:" + *port // Address with the specified port
-	fmt.Printf("Server is listening on port %s\n", *port)

-	// Start the server and log any error if it occurs
-	if err := http.ListenAndServe(address, nil); err != nil {
-		fmt.Printf("Error starting server: %s\n", err)
+	srv := &http.Server{
+		Addr:    address,
+		Handler: r.Handler(),
 	}
+
+	// Disable logging if the --silent flag is set
+	if *silent {
+		gin.SetMode(gin.ReleaseMode)
+		gin.DefaultWriter = io.Discard
+		log.SetOutput(io.Discard)
+	}
+
+	go func() {
+		log.Printf("simple-responder listening on %s\n", address)
+		// service connections
+		if err := srv.ListenAndServe(); err != nil && err != http.ErrServerClosed {
+			log.Fatalf("simple-responder err: %s\n", err)
+		}
+	}()
+
+	// Wait for interrupt signal to gracefully shutdown the server with
+	// a timeout of 5 seconds.
+	quit := make(chan os.Signal, 1)
+	// kill (no param) default send syscall.SIGTERM
+	// kill -2 is syscall.SIGINT
+	// kill -9 is syscall.SIGKILL but can't be catch, so don't need add it
+	signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)
+	<-quit
+	log.Println("simple-responder shutting down")
 }
@@ -5,6 +5,7 @@ import (
 	"os"
 	"strings"

+	"github.com/google/shlex"
 	"gopkg.in/yaml.v3"
 )

@@ -81,7 +82,10 @@ func SanitizeCommand(cmdStr string) ([]string, error) {
 	cmdStr = strings.ReplaceAll(cmdStr, "\\\n", " ")

 	// Split the command into arguments
-	args := strings.Fields(cmdStr)
+	args, err := shlex.Split(cmdStr)
+	if err != nil {
+		return nil, err
+	}

 	// Ensure the command is not empty
 	if len(args) == 0 {
@@ -148,17 +148,26 @@ func TestConfig_FindConfig(t *testing.T) {
 }

 func TestConfig_SanitizeCommand(t *testing.T) {
-	// Test a simple command
-	args, err := SanitizeCommand("python model1.py")
-	assert.NoError(t, err)
-	assert.Equal(t, []string{"python", "model1.py"}, args)

 	// Test a command with spaces and newlines
-	args, err = SanitizeCommand(`python model1.py \
-    --arg1 value1 \
-    --arg2 value2`)
+	args, err := SanitizeCommand(`python model1.py \
+    -a "double quotes" \
+    --arg2 'single quotes'
+	-s
+	--arg3 123 \
+	--arg4 '"string in string"'
+	-c "'single quoted'"
+	`)
 	assert.NoError(t, err)
-	assert.Equal(t, []string{"python", "model1.py", "--arg1", "value1", "--arg2", "value2"}, args)
+	assert.Equal(t, []string{
+		"python", "model1.py",
+		"-a", "double quotes",
+		"--arg2", "single quotes",
+		"-s",
+		"--arg3", "123",
+		"--arg4", `"string in string"`,
+		"-c", `'single quoted'`,
+	}, args)

 	// Test an empty command
 	args, err = SanitizeCommand("")
@@ -51,7 +51,7 @@ func getTestSimpleResponderConfigPort(expectedMessage string, port int) ModelCon

 	// Create a process configuration
 	return ModelConfig{
-		Cmd:           fmt.Sprintf("%s --port %d --respond '%s'", binaryPath, port, expectedMessage),
+		Cmd:           fmt.Sprintf("%s --port %d --silent --respond %s", binaryPath, port, expectedMessage),
 		Proxy:         fmt.Sprintf("http://127.0.0.1:%d", port),
 		CheckEndpoint: "/health",
 	}
@@ -14,6 +14,14 @@ import (
 	"time"
 )

+type ProcessState string
+
+const (
+	StateStopped ProcessState = ProcessState("stopped")
+	StateReady   ProcessState = ProcessState("ready")
+	StateFailed  ProcessState = ProcessState("failed")
+)
+
 type Process struct {
 	sync.Mutex

@@ -23,8 +31,12 @@ type Process struct {
 	logMonitor         *LogMonitor
 	healthCheckTimeout int

-	isRunning          bool
 	lastRequestHandled time.Time
+
+	stateMutex sync.RWMutex
+	state      ProcessState
+
+	inFlightRequests sync.WaitGroup
 }

 func NewProcess(ID string, healthCheckTimeout int, config ModelConfig, logMonitor *LogMonitor) *Process {
@@ -34,16 +46,22 @@ func NewProcess(ID string, healthCheckTimeout int, config ModelConfig, logMonito
 		cmd:                nil,
 		logMonitor:         logMonitor,
 		healthCheckTimeout: healthCheckTimeout,
+		state:              StateStopped,
 	}
 }

-// start the process and check it for errors
+// start the process and returns when it is ready
 func (p *Process) start() error {
-	p.Lock()
-	defer p.Unlock()

-	if p.isRunning {
-		return fmt.Errorf("process already running")
+	p.stateMutex.Lock()
+	defer p.stateMutex.Unlock()
+
+	if p.state == StateReady {
+		return nil
+	}
+
+	if p.state == StateFailed {
+		return fmt.Errorf("process is in a failed state and can not be restarted")
 	}

 	args, err := p.config.SanitizedCommand()
@@ -57,34 +75,47 @@ func (p *Process) start() error {
 	p.cmd.Env = p.config.Env

 	err = p.cmd.Start()
-	p.isRunning = true

 	if err != nil {
 		return err
 	}

-	// watch for the command to exit
-	cmdCtx, cancel := context.WithCancelCause(context.Background())
+	// One of three things can happen at this stage:
+	// 1. The command exits unexpectedly
+	// 2. The health check fails
+	// 3. The health check passes
+	//
+	// only in the third case will the process be considered Ready to accept
+	healthCheckContext, cancelHealthCheck := context.WithCancelCause(context.Background())
+	defer cancelHealthCheck(nil) // clean up
+	cmdWaitChan := make(chan error, 1)
+	healthCheckChan := make(chan error, 1)

-	// monitor the command's exit status. Usually this happens if
-	// the process exited unexpectedly
 	go func() {
-		err := p.cmd.Wait()
-		if err != nil {
-			cancel(fmt.Errorf("command [%s] %s", strings.Join(p.cmd.Args, " "), err.Error()))
-		} else {
-			cancel(nil)
-		}
-
-		p.isRunning = false
+		// possible cmd exits early
+		cmdWaitChan <- p.cmd.Wait()
 	}()

-	// wait a bit for process to start before checking the health endpoint
-	time.Sleep(250 * time.Millisecond)
+	go func() {
+		<-time.After(250 * time.Millisecond) // give process a bit of time to start
+		healthCheckChan <- p.checkHealthEndpoint(healthCheckContext)
+	}()

-	// wait for checkHealthEndpoint
-	if err := p.checkHealthEndpoint(cmdCtx); err != nil {
+	select {
+	case err := <-cmdWaitChan:
+		p.state = StateFailed
+		if err != nil {
+			err = fmt.Errorf("command [%s] %s", strings.Join(p.cmd.Args, " "), err.Error())
+		} else {
+			err = fmt.Errorf("command [%s] exited unexpected", strings.Join(p.cmd.Args, " "))
+		}
+		cancelHealthCheck(err)
 		return err
+	case err := <-healthCheckChan:
+		if err != nil {
+			p.state = StateFailed
+			return err
+		}
 	}

 	if p.config.UnloadAfter > 0 {
@@ -106,27 +137,64 @@ func (p *Process) start() error {
 		}()
 	}

+	p.state = StateReady
 	return nil
 }

 func (p *Process) Stop() {
-	p.Lock()
-	defer p.Unlock()
+	// wait for any inflight requests before proceeding
+	p.inFlightRequests.Wait()

-	if !p.isRunning || p.cmd == nil || p.cmd.Process == nil {
+	p.stateMutex.Lock()
+	defer p.stateMutex.Unlock()
+
+	if p.state != StateReady {
 		return
 	}

+	if p.cmd == nil || p.cmd.Process == nil {
+		// this situation should never happen... but if it does just update the state
+		fmt.Fprintf(p.logMonitor, "!!! State is Ready but Command is nil.")
+		p.state = StateStopped
+		return
+	}
+
+	// Pretty sure this stopping code needs some work for windows and
+	// will be a source of pain in the future.
+
 	p.cmd.Process.Signal(syscall.SIGTERM)
-	p.cmd.Process.Wait()
-	p.isRunning = false
+	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+	defer cancel()
+
+	done := make(chan error, 1)
+	go func() {
+		done <- p.cmd.Wait()
+	}()
+
+	select {
+	case <-ctx.Done():
+		fmt.Printf("!!! process for %s timed out waiting to stop\n", p.ID)
+		p.cmd.Process.Kill()
+		p.cmd.Wait()
+	case err := <-done:
+		if err != nil {
+			if err.Error() != "wait: no child processes" {
+				// possible that simple-responder for testing is just not
+				// existing right, so suppress those errors.
+				fmt.Printf("!!! process for %s stopped with error > %v\n", p.ID, err)
+			}
+		}
+	}
+	p.state = StateStopped
 }

-func (p *Process) IsRunning() bool {
-	return p.isRunning
+func (p *Process) CurrentState() ProcessState {
+	p.stateMutex.RLock()
+	defer p.stateMutex.RUnlock()
+	return p.state
 }

-func (p *Process) checkHealthEndpoint(cmdCtx context.Context) error {
+func (p *Process) checkHealthEndpoint(ctxFromStart context.Context) error {
 	if p.config.Proxy == "" {
 		return fmt.Errorf("no upstream available to check /health")
 	}
@@ -158,7 +226,7 @@ func (p *Process) checkHealthEndpoint(cmdCtx context.Context) error {
 			return err
 		}

-		ctx, cancel := context.WithTimeout(cmdCtx, time.Second)
+		ctx, cancel := context.WithTimeout(ctxFromStart, time.Second)
 		defer cancel()
 		req = req.WithContext(ctx)
 		resp, err := client.Do(req)
@@ -205,7 +273,11 @@ func (p *Process) checkHealthEndpoint(cmdCtx context.Context) error {
 }

 func (p *Process) ProxyRequest(w http.ResponseWriter, r *http.Request) {
-	if !p.isRunning {
+
+	p.inFlightRequests.Add(1)
+	defer p.inFlightRequests.Done()
+
+	if p.CurrentState() != StateReady {
 		if err := p.start(); err != nil {
 			errstr := fmt.Sprintf("unable to start process: %s", err)
 			http.Error(w, errstr, http.StatusInternalServerError)
@@ -1,9 +1,12 @@
 package proxy

 import (
+	"fmt"
 	"io"
 	"net/http"
 	"net/http/httptest"
+	"os"
+	"sync"
 	"testing"
 	"time"

@@ -23,9 +26,9 @@ func TestProcess_AutomaticallyStartsUpstream(t *testing.T) {
 	w := httptest.NewRecorder()

 	// process is automatically started
-	assert.False(t, process.IsRunning())
+	assert.Equal(t, StateStopped, process.CurrentState())
 	process.ProxyRequest(w, req)
-	assert.True(t, process.IsRunning())
+	assert.Equal(t, StateReady, process.CurrentState())

 	assert.Equal(t, http.StatusOK, w.Code, "Expected status code %d, got %d", http.StatusOK, w.Code)
 	assert.Contains(t, w.Body.String(), expectedMessage)
@@ -49,7 +52,7 @@ func TestProcess_AutomaticallyStartsUpstream(t *testing.T) {
 func TestProcess_BrokenModelConfig(t *testing.T) {
 	// Create a process configuration
 	config := ModelConfig{
-		Cmd:           "nonexistant-command",
+		Cmd:           "nonexistent-command",
 		Proxy:         "http://127.0.0.1:9913",
 		CheckEndpoint: "/health",
 	}
@@ -84,13 +87,67 @@ func TestProcess_UnloadAfterTTL(t *testing.T) {

 	// Proxy the request (auto start)
 	process.ProxyRequest(w, req)
+
 	assert.Equal(t, http.StatusOK, w.Code, "Expected status code %d, got %d", http.StatusOK, w.Code)
 	assert.Contains(t, w.Body.String(), expectedMessage)

-	assert.True(t, process.IsRunning())
+	assert.Equal(t, StateReady, process.CurrentState())

 	// wait 5 seconds
 	time.Sleep(5 * time.Second)
-
-	assert.False(t, process.IsRunning())
+	assert.Equal(t, StateStopped, process.CurrentState())
+}
+
+// issue #19
+func TestProcess_HTTPRequestsHaveTimeToFinish(t *testing.T) {
+	if testing.Short() {
+		t.Skip("skipping long test")
+	}
+
+	expectedMessage := "12345"
+	config := getTestSimpleResponderConfig(expectedMessage)
+	process := NewProcess("t", 10, config, NewLogMonitorWriter(os.Stdout))
+	defer process.Stop()
+
+	results := map[string]string{
+		"12345": "",
+		"abcde": "",
+		"fghij": "",
+	}
+
+	var wg sync.WaitGroup
+	var mu sync.Mutex
+
+	for key := range results {
+		wg.Add(1)
+		go func(key string) {
+			defer wg.Done()
+			// send a request that should take 5 * 200ms (1 second) to complete
+			req := httptest.NewRequest("GET", fmt.Sprintf("/slow-respond?echo=%s&delay=200ms", key), nil)
+			w := httptest.NewRecorder()
+
+			process.ProxyRequest(w, req)
+
+			if w.Code != http.StatusOK {
+				t.Errorf("Expected status OK, got %d for key %s", w.Code, key)
+			}
+
+			mu.Lock()
+			results[key] = w.Body.String()
+			mu.Unlock()
+
+		}(key)
+	}
+
+	// stop the requests in the middle
+	go func() {
+		<-time.After(500 * time.Millisecond)
+		process.Stop()
+	}()
+
+	wg.Wait()
+
+	for key, result := range results {
+		assert.Equal(t, key, result)
+	}
 }
@@ -14,6 +14,10 @@ import (
 	"github.com/gin-gonic/gin"
 )

+const (
+	PROFILE_SPLIT_CHAR = ":"
+)
+
 type ProxyManager struct {
 	sync.Mutex

@@ -69,9 +73,14 @@ func (pm *ProxyManager) StopProcesses() {

 // for internal usage
 func (pm *ProxyManager) stopProcesses() {
+	if len(pm.currentProcesses) == 0 {
+		return
+	}
+
 	for _, process := range pm.currentProcesses {
 		process.Stop()
 	}
+
 	pm.currentProcesses = make(map[string]*Process)
 }

@@ -101,15 +110,15 @@ func (pm *ProxyManager) swapModel(requestedModel string) (*Process, error) {
 	defer pm.Unlock()

 	// Check if requestedModel contains a /
-	groupName, modelName := "", requestedModel
-	if idx := strings.Index(requestedModel, "/"); idx != -1 {
-		groupName = requestedModel[:idx]
+	profileName, modelName := "", requestedModel
+	if idx := strings.Index(requestedModel, PROFILE_SPLIT_CHAR); idx != -1 {
+		profileName = requestedModel[:idx]
 		modelName = requestedModel[idx+1:]
 	}

-	if groupName != "" {
-		if _, found := pm.config.Profiles[groupName]; !found {
-			return nil, fmt.Errorf("model group not found %s", groupName)
+	if profileName != "" {
+		if _, found := pm.config.Profiles[profileName]; !found {
+			return nil, fmt.Errorf("model group not found %s", profileName)
 		}
 	}

@@ -120,7 +129,8 @@ func (pm *ProxyManager) swapModel(requestedModel string) (*Process, error) {
 	}

 	// exit early when already running, otherwise stop everything and swap
-	requestedProcessKey := groupName + "/" + realModelName
+	requestedProcessKey := ProcessKeyName(profileName, realModelName)
+
 	if process, found := pm.currentProcesses[requestedProcessKey]; found {
 		return process, nil
 	}
@@ -128,25 +138,25 @@ func (pm *ProxyManager) swapModel(requestedModel string) (*Process, error) {
 	// stop all running models
 	pm.stopProcesses()

-	if groupName == "" {
+	if profileName == "" {
 		modelConfig, modelID, found := pm.config.FindConfig(realModelName)
 		if !found {
 			return nil, fmt.Errorf("could not find configuration for %s", realModelName)
 		}

 		process := NewProcess(modelID, pm.config.HealthCheckTimeout, modelConfig, pm.logMonitor)
-		processKey := groupName + "/" + modelID
+		processKey := ProcessKeyName(profileName, modelID)
 		pm.currentProcesses[processKey] = process
 	} else {
-		for _, modelName := range pm.config.Profiles[groupName] {
+		for _, modelName := range pm.config.Profiles[profileName] {
 			if realModelName, found := pm.config.RealModelName(modelName); found {
 				modelConfig, modelID, found := pm.config.FindConfig(realModelName)
 				if !found {
-					return nil, fmt.Errorf("could not find configuration for %s in group %s", realModelName, groupName)
+					return nil, fmt.Errorf("could not find configuration for %s in group %s", realModelName, profileName)
 				}

 				process := NewProcess(modelID, pm.config.HealthCheckTimeout, modelConfig, pm.logMonitor)
-				processKey := groupName + "/" + modelID
+				processKey := ProcessKeyName(profileName, modelID)
 				pm.currentProcesses[processKey] = process
 			}
 		}
@@ -185,7 +195,6 @@ func (pm *ProxyManager) proxyChatRequestHandler(c *gin.Context) {

 		process.ProxyRequest(c.Writer, c.Request)
 	}
-
 }

 func (pm *ProxyManager) proxyNoRouteHandler(c *gin.Context) {
@@ -197,3 +206,7 @@ func (pm *ProxyManager) proxyNoRouteHandler(c *gin.Context) {

 	c.AbortWithError(http.StatusBadRequest, fmt.Errorf("no strategy to handle request"))
 }
+
+func ProcessKeyName(groupName, modelName string) string {
+	return groupName + PROFILE_SPLIT_CHAR + modelName
+}
@@ -5,7 +5,9 @@ import (
 	"fmt"
 	"net/http"
 	"net/http/httptest"
+	"sync"
 	"testing"
+	"time"

 	"github.com/stretchr/testify/assert"
 )
@@ -31,7 +33,7 @@ func TestProxyManager_SwapProcessCorrectly(t *testing.T) {
 		assert.Equal(t, http.StatusOK, w.Code)
 		assert.Contains(t, w.Body.String(), modelName)

-		_, exists := proxy.currentProcesses["/"+modelName]
+		_, exists := proxy.currentProcesses[ProcessKeyName("", modelName)]
 		assert.True(t, exists, "expected %s key in currentProcesses", modelName)

 	}
@@ -41,21 +43,31 @@ func TestProxyManager_SwapProcessCorrectly(t *testing.T) {
 }

 func TestProxyManager_SwapMultiProcess(t *testing.T) {
+
+	model1 := "path1/model1"
+	model2 := "path2/model2"
+
+	profileModel1 := ProcessKeyName("test", model1)
+	profileModel2 := ProcessKeyName("test", model2)
+
 	config := &Config{
 		HealthCheckTimeout: 15,
 		Models: map[string]ModelConfig{
-			"model1": getTestSimpleResponderConfig("model1"),
-			"model2": getTestSimpleResponderConfig("model2"),
+			model1: getTestSimpleResponderConfig("model1"),
+			model2: getTestSimpleResponderConfig("model2"),
 		},
 		Profiles: map[string][]string{
-			"test": {"model1", "model2"},
+			"test": {model1, model2},
 		},
 	}

 	proxy := New(config)
 	defer proxy.StopProcesses()

-	for modelID, requestedModel := range map[string]string{"model1": "test/model1", "model2": "test/model2"} {
+	for modelID, requestedModel := range map[string]string{
+		"model1": profileModel1,
+		"model2": profileModel2,
+	} {
 		reqBody := fmt.Sprintf(`{"model":"%s"}`, requestedModel)
 		req := httptest.NewRequest("POST", "/v1/chat/completions", bytes.NewBufferString(reqBody))
 		w := httptest.NewRecorder()
@@ -67,10 +79,65 @@ func TestProxyManager_SwapMultiProcess(t *testing.T) {

 	// make sure there's two loaded models
 	assert.Len(t, proxy.currentProcesses, 2)
-	_, exists := proxy.currentProcesses["test/model1"]
-	assert.True(t, exists, "expected test/model1 key in currentProcesses")
-
-	_, exists = proxy.currentProcesses["test/model2"]
-	assert.True(t, exists, "expected test/model2 key in currentProcesses")
+	_, exists := proxy.currentProcesses[profileModel1]
+	assert.True(t, exists, "expected "+profileModel1+" key in currentProcesses")

+	_, exists = proxy.currentProcesses[profileModel2]
+	assert.True(t, exists, "expected "+profileModel2+" key in currentProcesses")
+}
+
+// When a request for a different model comes in ProxyManager should wait until
+// the first request is complete before swapping. Both requests should complete
+func TestProxyManager_SwapMultiProcessParallelRequests(t *testing.T) {
+	if testing.Short() {
+		t.Skip("skipping slow test")
+	}
+
+	config := &Config{
+		HealthCheckTimeout: 15,
+		Models: map[string]ModelConfig{
+			"model1": getTestSimpleResponderConfig("model1"),
+			"model2": getTestSimpleResponderConfig("model2"),
+			"model3": getTestSimpleResponderConfig("model3"),
+		},
+	}
+
+	proxy := New(config)
+	defer proxy.StopProcesses()
+
+	results := map[string]string{}
+
+	var wg sync.WaitGroup
+	var mu sync.Mutex
+
+	for key := range config.Models {
+		wg.Add(1)
+		go func(key string) {
+			defer wg.Done()
+
+			reqBody := fmt.Sprintf(`{"model":"%s"}`, key)
+			req := httptest.NewRequest("POST", "/v1/chat/completions?wait=1000ms", bytes.NewBufferString(reqBody))
+			w := httptest.NewRecorder()
+
+			proxy.HandlerFunc(w, req)
+
+			if w.Code != http.StatusOK {
+				t.Errorf("Expected status OK, got %d for key %s", w.Code, key)
+			}
+
+			mu.Lock()
+
+			results[key] = w.Body.String()
+			mu.Unlock()
+		}(key)
+
+		<-time.After(time.Millisecond)
+	}
+
+	wg.Wait()
+	assert.Len(t, results, len(config.Models))
+
+	for key, result := range results {
+		assert.Equal(t, key, result)
+	}
 }
Author	SHA1	Message	Date
Benson Wong	04b4760e7e	change profile split character to : (colon) (#21 ) - change from `/` to `:` for multiple models loaded as part of a profile - breaking change now, but allows for more compatibility with other inference engines that may have model references like `coding:Qwen/Qwen-2.5-Coder-32B`	2024-12-01 09:10:50 -08:00
Benson Wong	9fc5d5b5eb	improve cmd parsing (#22 ) Switch from using a naive strings.Fields() to shlex.Split() for parsing the model startup command into a string[]. This makes parsing much more reliable around newlines, quotes, etc.	2024-12-01 09:02:58 -08:00
Benson Wong	cf82b3c633	Improve Concurrency and Parallel Request Handling (#19 ) Rewrite the swap behaviour so that in-flight requests block process swapping until they are completed. Additionally: - add tests for parallel requests with proxy.ProxyManager and proxy.Process - improve Process startup behaviour and simplified the code - stopping of processes are sent SIGTERM and have 5 seconds to terminate, before they are killed	2024-11-30 15:24:42 -08:00
Benson Wong	e363f8f498	clean up writing with AI :b	2024-11-28 22:12:44 -08:00
Benson Wong	c9629cf3a2	add speculative decoding example	2024-11-28 22:07:22 -08:00
Benson Wong	50426935a4	.	2024-11-28 22:06:29 -08:00
Benson Wong	2fceb78e8d	Add examples	2024-11-28 22:05:41 -08:00
Ikko Eltociear Ashimine	9a81c53664	chore: update process_test.go (#17 ) nonexistant -> nonexistent	2024-11-26 10:20:16 -08:00
Benson Wong	716d37de82	Update README.md fix grammar	2024-11-25 12:35:00 -08:00