Add Access-Control-Allow-Origin CORS header to /v1/models endpoint

- match behavior of llama.cpp where the Origin in request is used - add test for listModelsHandler
add example: optimizing code generation
2024-12-03 15:53:59 -08:00 · 2024-12-03 10:25:43 -08:00 · 2024-12-01 10:13:31 -08:00
7 changed files with 244 additions and 8 deletions
@@ -64,7 +64,7 @@ models:
 #
 # Tips:
 #  - each model must be listening on a unique address and port
-#  - the model name is in this format: "profile_name/model", like "coding/qwen"
+#  - the model name is in this format: "profile_name:model", like "coding:qwen"
 #  - the profile will load and unload all models in the profile at the same time
 profiles:
  coding:
@@ -1,9 +1,6 @@
-# Example Configurations
+# Example Configs and Use Cases

-Learning by example is best.
-
-Here in the `examples/` folder are llama-swap configurations that can be used on your local LLM server.
-
-## List
+A collections of usecases and examples for getting the most out of llama-swap.

 * [Speculative Decoding](speculative-decoding/README.md) - using a small draft model can increase inference speeds from 20% to 40%. This example includes a configurations Qwen2.5-Coder-32B (2.5x increase) and Llama-3.1-70B (1.4x increase) in the best cases.
+* [Optimizing Code Generation](benchmark-snakegame/README.md) - find the optimal settings for your machine. This example demonstrates defining multiple configurations and testing which one is fastest.
@@ -0,0 +1,123 @@
+# Optimizing Code Generation with llama-swap
+
+Finding the best mix of settings for your hardware can be time consuming. This example demonstrates using a custom configuration file to automate testing different scenarios to find the an optimal configuration.
+
+The benchmark writes a snake game in Python, TypeScript, and Swift using the Qwen 2.5 Coder models. The experiments were done using a 3090 and a P40.
+
+**Benchmark Scenarios**
+
+Three scenarios are tested:
+
+- 3090-only: Just the main model on the 3090
+- 3090-with-draft: the main and draft models on the 3090
+- 3090-P40-draft: the main model on the 3090 with the draft model offloaded to the P40
+
+**Available Devices**
+
+Use the following command to list available devices IDs for the configuration:
+
+```
+$ /mnt/nvme/llama-server/llama-server-f3252055 --list-devices
+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 4 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: Tesla P40, compute capability 6.1, VMM: yes
+  Device 2: Tesla P40, compute capability 6.1, VMM: yes
+  Device 3: Tesla P40, compute capability 6.1, VMM: yes
+Available devices:
+  CUDA0: NVIDIA GeForce RTX 3090 (24154 MiB, 406 MiB free)
+  CUDA1: Tesla P40 (24438 MiB, 22942 MiB free)
+  CUDA2: Tesla P40 (24438 MiB, 24144 MiB free)
+  CUDA3: Tesla P40 (24438 MiB, 24144 MiB free)
+```
+
+**Configuration**
+
+The configuration file, `benchmark-config.yaml`, defines the three scenarios:
+
+```yaml
+models:
+  "3090-only":
+    proxy: "http://127.0.0.1:9503"
+    cmd: >
+      /mnt/nvme/llama-server/llama-server-f3252055
+      --host 127.0.0.1 --port 9503
+      --flash-attn
+      --slots
+
+      --model /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf
+      -ngl 99
+      --device CUDA0
+
+      --ctx-size 32768
+      --cache-type-k q8_0 --cache-type-v q8_0
+
+  "3090-with-draft":
+    proxy: "http://127.0.0.1:9503"
+    # --ctx-size 28500 max that can fit on 3090 after draft model
+    cmd: >
+      /mnt/nvme/llama-server/llama-server-f3252055
+      --host 127.0.0.1 --port 9503
+      --flash-attn
+      --slots
+
+      --model /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf
+      -ngl 99
+      --device CUDA0
+
+      --model-draft /mnt/nvme/models/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf
+      -ngld 99
+      --draft-max 16
+      --draft-min 4
+      --draft-p-min 0.4
+      --device-draft CUDA0
+
+      --ctx-size 28500
+      --cache-type-k q8_0 --cache-type-v q8_0
+
+  "3090-P40-draft":
+    proxy: "http://127.0.0.1:9503"
+    cmd: >
+      /mnt/nvme/llama-server/llama-server-f3252055
+      --host 127.0.0.1 --port 9503
+      --flash-attn --metrics
+      --slots
+      --model /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf
+      -ngl 99
+      --device CUDA0
+
+      --model-draft /mnt/nvme/models/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf
+      -ngld 99
+      --draft-max 16
+      --draft-min 4
+      --draft-p-min 0.4
+      --device-draft CUDA1
+
+      --ctx-size 32768
+      --cache-type-k q8_0 --cache-type-v q8_0
+```
+
+> Note in the `3090-with-draft` scenario the `--ctx-size` had to be reduced from 32768 to to accommodate the draft model.
+
+
+**Running the Benchmark**
+
+To run the benchmark, execute the following commands:
+
+1. `llama-swap -config benchmark-config.yaml`
+1. `./run-benchmark.sh http://localhost:8080 "3090-only" "3090-with-draft" "3090-P40-draft"`
+
+The [benchmark script](run-benchmark.sh) generates a CSV output of the results, which can be converted to a Markdown table for readability.
+
+**Results (tokens/second)**
+
+| model           | python | typescript | swift |
+|-----------------|--------|------------|-------|
+| 3090-only       | 34.03  | 34.01      | 34.01 |
+| 3090-with-draft | 106.65 | 70.48      | 57.89 |
+| 3090-P40-draft  | 81.54  | 60.35      | 46.50 |
+
+Many different factors, like the programming language, can have big impacts on the performance gains. However, with a custom configuration file for benchmarking it is easy to test the different variations to discover what's best for your hardware.
+
+Happy coding!
@@ -0,0 +1,43 @@
+#!/usr/bin/env bash
+
+# This script generates a CSV file showing the token/second for generating a Snake Game in python, typescript and swift
+# It was created to test the effects of speculative decoding and the various draft settings on performance.
+#
+# Writing code with a low temperature seems to provide fairly consistent logic.
+#
+# Usage: ./benchmark.sh <url> <model1> [model2 ...]
+# Example: ./benchmark.sh http://localhost:8080 model1 model2
+
+if [ "$#" -lt 2 ]; then
+    echo "Usage: $0 <url> <model1> [model2 ...]"
+    exit 1
+fi
+
+url=$1; shift
+
+echo "model,python,typescript,swift"
+
+for model in "$@"; do
+
+    echo -n "$model,"
+
+    for lang in "python" "typescript" "swift"; do
+        response=$(curl -s --url "$url/v1/chat/completions" -d "{\"messages\": [{\"role\": \"system\", \"content\": \"you only write code.\"}, {\"role\": \"user\", \"content\": \"write snake game in $lang\"}], \"temperature\": 0.1, \"model\":\"$model\"}")
+        if [ $? -ne 0 ]; then
+            time="error"
+        else
+            time=$(curl -s --url "$url/logs" | grep -oE '\d+(?:\.\d+)? tokens per second' | awk '{print $1}' | tail -n 1)
+            if [ $? -ne 0 ]; then
+                time="error"
+            fi
+        fi
+
+        if [ "$lang" != "swift" ]; then
+            echo -n "$time,"
+        else
+            echo -n "$time"
+        fi
+    done
+
+    echo ""
+done
@@ -101,7 +101,7 @@ func TestProcess_UnloadAfterTTL(t *testing.T) {
 // issue #19
 func TestProcess_HTTPRequestsHaveTimeToFinish(t *testing.T) {
 	if testing.Short() {
-		t.Skip("skipping long test")
+		t.Skip("skipping slow test")
 	}

 	expectedMessage := "12345"
@@ -98,6 +98,10 @@ func (pm *ProxyManager) listModelsHandler(c *gin.Context) {
 	// Set the Content-Type header to application/json
 	c.Header("Content-Type", "application/json")

+	if origin := c.Request.Header.Get("Origin"); origin != "" {
+		c.Header("Access-Control-Allow-Origin", origin)
+	}
+
 	// Encode the data as JSON and write it to the response writer
 	if err := json.NewEncoder(c.Writer).Encode(map[string]interface{}{"data": data}); err != nil {
 		c.AbortWithError(http.StatusInternalServerError, fmt.Errorf("error encoding JSON"))
@@ -2,6 +2,7 @@ package proxy

 import (
 	"bytes"
+	"encoding/json"
 	"fmt"
 	"net/http"
 	"net/http/httptest"
@@ -141,3 +142,71 @@ func TestProxyManager_SwapMultiProcessParallelRequests(t *testing.T) {
 		assert.Equal(t, key, result)
 	}
 }
+
+func TestProxyManager_ListModelsHandler(t *testing.T) {
+	config := &Config{
+		HealthCheckTimeout: 15,
+		Models: map[string]ModelConfig{
+			"model1": getTestSimpleResponderConfig("model1"),
+			"model2": getTestSimpleResponderConfig("model2"),
+			"model3": getTestSimpleResponderConfig("model3"),
+		},
+	}
+
+	proxy := New(config)
+
+	// Create a test request
+	req := httptest.NewRequest("GET", "/v1/models", nil)
+	req.Header.Add("Origin", "i-am-the-origin")
+	w := httptest.NewRecorder()
+
+	// Call the listModelsHandler
+	proxy.HandlerFunc(w, req)
+
+	// Check the response status code
+	assert.Equal(t, http.StatusOK, w.Code)
+
+	// Check for Access-Control-Allow-Origin
+	assert.Equal(t, req.Header.Get("Origin"), w.Result().Header.Get("Access-Control-Allow-Origin"))
+
+	// Parse the JSON response
+	var response struct {
+		Data []map[string]interface{} `json:"data"`
+	}
+	if err := json.Unmarshal(w.Body.Bytes(), &response); err != nil {
+		t.Fatalf("Failed to parse JSON response: %v", err)
+	}
+
+	// Check the number of models returned
+	assert.Len(t, response.Data, 3)
+
+	// Check the details of each model
+	expectedModels := map[string]struct{}{
+		"model1": {},
+		"model2": {},
+		"model3": {},
+	}
+
+	for _, model := range response.Data {
+		modelID, ok := model["id"].(string)
+		assert.True(t, ok, "model ID should be a string")
+		_, exists := expectedModels[modelID]
+		assert.True(t, exists, "unexpected model ID: %s", modelID)
+		delete(expectedModels, modelID)
+
+		object, ok := model["object"].(string)
+		assert.True(t, ok, "object should be a string")
+		assert.Equal(t, "model", object)
+
+		created, ok := model["created"].(float64)
+		assert.True(t, ok, "created should be a number")
+		assert.Greater(t, created, float64(0)) // Assuming the timestamp is positive
+
+		ownedBy, ok := model["owned_by"].(string)
+		assert.True(t, ok, "owned_by should be a string")
+		assert.Equal(t, "llama-swap", ownedBy)
+	}
+
+	// Ensure all expected models were returned
+	assert.Empty(t, expectedModels, "not all expected models were returned")
+}
Author	SHA1	Message	Date
Benson Wong	18c134624d	Add Access-Control-Allow-Origin CORS header to /v1/models endpoint - match behavior of llama.cpp where the Origin in request is used - add test for listModelsHandler	2024-12-03 15:53:59 -08:00
Benson Wong	da2326bdc7	add example: optimizing code generation	2024-12-03 10:25:43 -08:00
Benson Wong	da46545630	fix profile example in README	2024-12-01 10:13:31 -08:00