Compare commits

...

8 Commits

Author SHA1 Message Date
Benson Wong 574fdfabb4 UI improvements (#213)
* use two column for logs view on wider screens

* hide log controls when panel is minimized
2025-07-31 11:59:21 -07:00
Benson Wong 5172cb2e12 Update docs in Readme [skip ci] 2025-07-30 11:51:14 -07:00
Benson Wong 5672cb03fd Update github actions for notifying homebrew build (#212)
Combine homebrew-llama-swap event with the release action
2025-07-30 11:29:03 -07:00
Benson Wong 0f583163f7 add /health (#211) 2025-07-30 10:37:10 -07:00
Benson Wong 7905fa9ea3 Update trigger-homebrew-update.yml [skip ci] 2025-07-30 10:13:49 -07:00
Ian Sebastian Mathew bbaf172956 add trigger to rebuild homebrew formula (#210) 2025-07-30 10:12:21 -07:00
Benson Wong fd50932dbc Decouple MetricsMiddleware from downstream handlers (#206)
* Decouple MetricsMiddleware from downstream handlers

Remove ls-real-model-name optimization. Within proxyOAIHandler the
request body's bytes are required for various rewriting features
anyways. This negated any benefits from trying not to parse it twice.
2025-07-27 10:36:06 -07:00
Gaël James 8c693e7fcf Add endpoint aliases for reranking models (#201)
* Add endpoint aliases for reranking models
* Add MetricsMiddleware to the previous reranking endpoint
* Fix the embeddings endpoint not having model set
2025-07-24 08:32:47 -07:00
6 changed files with 127 additions and 40 deletions
+33 -3
View File
@@ -7,6 +7,10 @@ on:
# Allows manual triggering of the workflow # Allows manual triggering of the workflow
workflow_dispatch: workflow_dispatch:
inputs:
tag:
description: 'Tag version to release (e.g. v144)'
required: true
permissions: permissions:
contents: write contents: write
@@ -20,15 +24,15 @@ jobs:
uses: actions/checkout@v4 uses: actions/checkout@v4
with: with:
fetch-depth: 0 fetch-depth: 0
ref: ${{ github.event.inputs.tag || github.ref }}
- -
name: Set up Go name: Set up Go
uses: actions/setup-go@v5 uses: actions/setup-go@v5
- -
name: Set up Node.js name: Set up Node.js
uses: actions/setup-node@v4 uses: actions/setup-node@v4
with: with:
node-version: '23' # or your preferred version node-version: '23'
- -
name: Install dependencies and build UI name: Install dependencies and build UI
run: | run: |
@@ -46,4 +50,30 @@ jobs:
version: '~> v2' version: '~> v2'
args: release --clean args: release --clean
env: env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
trigger-tap-update:
runs-on: ubuntu-latest
needs: goreleaser
steps:
- name: "Resolve tag to dispatch"
id: tag
run: |
if [[ "${{ github.event_name }}" == "workflow_dispatch" ]]; then
echo "tag=${{ github.event.inputs.tag }}" >> "$GITHUB_OUTPUT"
else
echo "tag=${{ github.ref_name }}" >> "$GITHUB_OUTPUT"
fi
- name: "Trigger tap repository update"
uses: peter-evans/repository-dispatch@v2
with:
token: ${{ secrets.TAP_REPO_PAT }}
repository: mostlygeek/homebrew-llama-swap
event-type: new-release
client-payload: |
{
"release": {
"tag_name": "${{ steps.tag.outputs.tag }}"
}
}
+29 -6
View File
@@ -18,7 +18,7 @@ Written in golang, it is very easy to install (single binary with no dependencie
- `v1/completions` - `v1/completions`
- `v1/chat/completions` - `v1/chat/completions`
- `v1/embeddings` - `v1/embeddings`
- `v1/rerank` - `v1/rerank`, `v1/reranking`, `rerank`
- `v1/audio/speech` ([#36](https://github.com/mostlygeek/llama-swap/issues/36)) - `v1/audio/speech` ([#36](https://github.com/mostlygeek/llama-swap/issues/36))
- `v1/audio/transcriptions` ([docs](https://github.com/mostlygeek/llama-swap/issues/41#issuecomment-2722637867)) - `v1/audio/transcriptions` ([docs](https://github.com/mostlygeek/llama-swap/issues/41#issuecomment-2722637867))
- ✅ llama-swap custom API endpoints - ✅ llama-swap custom API endpoints
@@ -27,6 +27,7 @@ Written in golang, it is very easy to install (single binary with no dependencie
- `/upstream/:model_id` - direct access to upstream HTTP server ([demo](https://github.com/mostlygeek/llama-swap/pull/31)) - `/upstream/:model_id` - direct access to upstream HTTP server ([demo](https://github.com/mostlygeek/llama-swap/pull/31))
- `/unload` - manually unload running models ([#58](https://github.com/mostlygeek/llama-swap/issues/58)) - `/unload` - manually unload running models ([#58](https://github.com/mostlygeek/llama-swap/issues/58))
- `/running` - list currently running models ([#61](https://github.com/mostlygeek/llama-swap/issues/61)) - `/running` - list currently running models ([#61](https://github.com/mostlygeek/llama-swap/issues/61))
- `/health` - just returns "OK"
- ✅ Run multiple models at once with `Groups` ([#107](https://github.com/mostlygeek/llama-swap/issues/107)) - ✅ Run multiple models at once with `Groups` ([#107](https://github.com/mostlygeek/llama-swap/issues/107))
- ✅ Automatic unloading of models after timeout by setting a `ttl` - ✅ Automatic unloading of models after timeout by setting a `ttl`
- ✅ Use any local OpenAI compatible server (llama.cpp, vllm, tabbyAPI, etc) - ✅ Use any local OpenAI compatible server (llama.cpp, vllm, tabbyAPI, etc)
@@ -74,10 +75,18 @@ llama-swap ships with a real time web interface to monitor logs and status of mo
<img width="1786" height="1334" alt="image" src="https://github.com/user-attachments/assets/d6258cb9-1dad-40db-828f-2be860aec8fe" /> <img width="1786" height="1334" alt="image" src="https://github.com/user-attachments/assets/d6258cb9-1dad-40db-828f-2be860aec8fe" />
## Installation
## Docker Install ([download images](https://github.com/mostlygeek/llama-swap/pkgs/container/llama-swap)) llama-swap can be installed in multiple ways
Docker is the quickest way to try out llama-swap: 1. Docker
2. Homebrew (OSX and Linux)
3. From release binaries
4. From source
### Docker Install ([download images](https://github.com/mostlygeek/llama-swap/pkgs/container/llama-swap))
Docker images with llama-swap and llama-server are built nightly.
```shell ```shell
# use CPU inference comes with the example config above # use CPU inference comes with the example config above
@@ -99,7 +108,7 @@ $ curl -s http://localhost:9292/v1/chat/completions \
``` ```
<details> <details>
<summary>Docker images are built nightly for cuda, intel, vulcan, etc ...</summary> <summary>Docker images are built nightly with llama-server for cuda, intel, vulcan and musa.</summary>
They include: They include:
@@ -122,9 +131,23 @@ $ docker run -it --rm --runtime nvidia -p 9292:8080 \
</details> </details>
## Bare metal Install ([download](https://github.com/mostlygeek/llama-swap/releases)) ### Homebrew Install (macOS/Linux)
Pre-built binaries are available for Linux, Mac, Windows and FreeBSD. These are automatically published and are likely a few hours ahead of the docker releases. The baremetal install works with any OpenAI compatible server, not just llama-server. The latest release of `llama-swap` can be installed via [Homebrew](https://brew.sh).
```shell
# Set up tap and install formula
brew tap mostlygeek/llama-swap
brew install llama-swap
# Run llama-swap
llama-swap --config path/to/config.yaml --listen localhost:8080
```
This will install the `llama-swap` binary and make it available in your path. See the [configuration documentation](https://github.com/mostlygeek/llama-swap/wiki/Configuration)
### Pre-built Binaries ([download](https://github.com/mostlygeek/llama-swap/releases))
Binaries are available for Linux, Mac, Windows and FreeBSD. These are automatically published and are likely a few hours ahead of the docker releases. The binary install works with any OpenAI compatible server, not just llama-server.
1. Download a [release](https://github.com/mostlygeek/llama-swap/releases) appropriate for your OS and architecture. 1. Download a [release](https://github.com/mostlygeek/llama-swap/releases) appropriate for your OS and architecture.
1. Create a configuration file, see the [configuration documentation](https://github.com/mostlygeek/llama-swap/wiki/Configuration). 1. Create a configuration file, see the [configuration documentation](https://github.com/mostlygeek/llama-swap/wiki/Configuration).
+3 -1
View File
@@ -17,6 +17,7 @@ func MetricsMiddleware(pm *ProxyManager) gin.HandlerFunc {
bodyBytes, err := io.ReadAll(c.Request.Body) bodyBytes, err := io.ReadAll(c.Request.Body)
if err != nil { if err != nil {
pm.sendErrorResponse(c, http.StatusBadRequest, "could not ready request body") pm.sendErrorResponse(c, http.StatusBadRequest, "could not ready request body")
c.Abort()
return return
} }
c.Request.Body = io.NopCloser(bytes.NewBuffer(bodyBytes)) c.Request.Body = io.NopCloser(bytes.NewBuffer(bodyBytes))
@@ -24,15 +25,16 @@ func MetricsMiddleware(pm *ProxyManager) gin.HandlerFunc {
requestedModel := gjson.GetBytes(bodyBytes, "model").String() requestedModel := gjson.GetBytes(bodyBytes, "model").String()
if requestedModel == "" { if requestedModel == "" {
pm.sendErrorResponse(c, http.StatusBadRequest, "missing or invalid 'model' key") pm.sendErrorResponse(c, http.StatusBadRequest, "missing or invalid 'model' key")
c.Abort()
return return
} }
realModelName, found := pm.config.RealModelName(requestedModel) realModelName, found := pm.config.RealModelName(requestedModel)
if !found { if !found {
pm.sendErrorResponse(c, http.StatusBadRequest, fmt.Sprintf("could not find real modelID for %s", requestedModel)) pm.sendErrorResponse(c, http.StatusBadRequest, fmt.Sprintf("could not find real modelID for %s", requestedModel))
c.Abort()
return return
} }
c.Set("ls-real-model-name", realModelName)
writer := &MetricsResponseWriter{ writer := &MetricsResponseWriter{
ResponseWriter: c.Writer, ResponseWriter: c.Writer,
+17 -5
View File
@@ -14,6 +14,7 @@ import (
"time" "time"
"github.com/gin-gonic/gin" "github.com/gin-gonic/gin"
"github.com/tidwall/gjson"
"github.com/tidwall/sjson" "github.com/tidwall/sjson"
) )
@@ -160,8 +161,10 @@ func (pm *ProxyManager) setupGinEngine() {
pm.ginEngine.POST("/v1/completions", mm, pm.proxyOAIHandler) pm.ginEngine.POST("/v1/completions", mm, pm.proxyOAIHandler)
// Support embeddings // Support embeddings
pm.ginEngine.POST("/v1/embeddings", pm.proxyOAIHandler) pm.ginEngine.POST("/v1/embeddings", mm, pm.proxyOAIHandler)
pm.ginEngine.POST("/v1/rerank", pm.proxyOAIHandler) pm.ginEngine.POST("/v1/rerank", mm, pm.proxyOAIHandler)
pm.ginEngine.POST("/v1/reranking", mm, pm.proxyOAIHandler)
pm.ginEngine.POST("/rerank", mm, pm.proxyOAIHandler)
// Support audio/speech endpoint // Support audio/speech endpoint
pm.ginEngine.POST("/v1/audio/speech", pm.proxyOAIHandler) pm.ginEngine.POST("/v1/audio/speech", pm.proxyOAIHandler)
@@ -188,6 +191,9 @@ func (pm *ProxyManager) setupGinEngine() {
pm.ginEngine.GET("/unload", pm.unloadAllModelsHandler) pm.ginEngine.GET("/unload", pm.unloadAllModelsHandler)
pm.ginEngine.GET("/running", pm.listRunningProcessesHandler) pm.ginEngine.GET("/running", pm.listRunningProcessesHandler)
pm.ginEngine.GET("/health", func(c *gin.Context) {
c.String(http.StatusOK, "OK")
})
pm.ginEngine.GET("/favicon.ico", func(c *gin.Context) { pm.ginEngine.GET("/favicon.ico", func(c *gin.Context) {
if data, err := reactStaticFS.ReadFile("ui_dist/favicon.ico"); err == nil { if data, err := reactStaticFS.ReadFile("ui_dist/favicon.ico"); err == nil {
@@ -365,9 +371,15 @@ func (pm *ProxyManager) proxyOAIHandler(c *gin.Context) {
return return
} }
realModelName := c.GetString("ls-real-model-name") // Should be set in MetricsMiddleware requestedModel := gjson.GetBytes(bodyBytes, "model").String()
if realModelName == "" { if requestedModel == "" {
pm.sendErrorResponse(c, http.StatusInternalServerError, "ls-real-model-name not set") pm.sendErrorResponse(c, http.StatusBadRequest, "missing or invalid 'model' key")
return
}
realModelName, found := pm.config.RealModelName(requestedModel)
if !found {
pm.sendErrorResponse(c, http.StatusBadRequest, fmt.Sprintf("could not find real modelID for %s", requestedModel))
return return
} }
+18
View File
@@ -755,3 +755,21 @@ func TestProxyManager_MiddlewareWritesMetrics_Streaming(t *testing.T) {
assert.Greater(t, lastMetric.TokensPerSecond, 0.0, "tokens per second should be greater than 0") assert.Greater(t, lastMetric.TokensPerSecond, 0.0, "tokens per second should be greater than 0")
assert.Greater(t, lastMetric.DurationMs, 0, "duration should be greater than 0") assert.Greater(t, lastMetric.DurationMs, 0, "duration should be greater than 0")
} }
func TestProxyManager_HealthEndpoint(t *testing.T) {
config := AddDefaultGroupToConfig(Config{
HealthCheckTimeout: 15,
Models: map[string]ModelConfig{
"model1": getTestSimpleResponderConfig("model1"),
},
LogLevel: "error",
})
proxy := New(config)
defer proxy.StopProcesses(StopWaitForInflightRequest)
req := httptest.NewRequest("GET", "/health", nil)
rec := httptest.NewRecorder()
proxy.ServeHTTP(rec, req)
assert.Equal(t, http.StatusOK, rec.Code)
assert.Equal(t, "OK", rec.Body.String())
}
+27 -25
View File
@@ -6,7 +6,7 @@ const LogViewer = () => {
const { proxyLogs, upstreamLogs } = useAPI(); const { proxyLogs, upstreamLogs } = useAPI();
return ( return (
<div className="flex flex-col gap-5" style={{ height: "calc(100vh - 125px)" }}> <div className="flex flex-col lg:flex-row gap-5" style={{ height: "calc(100vh - 125px)" }}>
<LogPanel id="proxy" title="Proxy Logs" logData={proxyLogs} /> <LogPanel id="proxy" title="Proxy Logs" logData={proxyLogs} />
<LogPanel id="upstream" title="Upstream Logs" logData={upstreamLogs} /> <LogPanel id="upstream" title="Upstream Logs" logData={upstreamLogs} />
</div> </div>
@@ -90,34 +90,36 @@ export const LogPanel = ({ id, title, logData, className }: LogPanelProps) => {
<div className="flex flex-col md:flex-row md:items-center md:justify-between gap-4"> <div className="flex flex-col md:flex-row md:items-center md:justify-between gap-4">
{/* Title - Always full width on mobile, normal on desktop */} {/* Title - Always full width on mobile, normal on desktop */}
<div className="w-full md:w-auto" onClick={() => setIsCollapsed(!isCollapsed)}> <div className="w-full md:w-auto" onClick={() => setIsCollapsed(!isCollapsed)}>
<h3 className="m-0 text-lg">{title}</h3> <h3 className="m-0 text-lg p-0">{title}</h3>
</div> </div>
<div className="flex flex-col sm:flex-row gap-4 w-full md:w-auto"> {!isCollapsed && (
{/* Sizing Buttons - Stacks vertically on mobile */} <div className="flex flex-col sm:flex-row gap-4 w-full md:w-auto">
<div className="flex flex-wrap gap-2"> {/* Sizing Buttons - Stacks vertically on mobile */}
<button className="btn" onClick={toggleFontSize}> <div className="flex flex-wrap gap-2">
font: {fontSize} <button className="btn" onClick={toggleFontSize}>
</button> font: {fontSize}
<button className="btn" onClick={() => setTextWrap((prev) => !prev)}> </button>
{wrapText ? "wrap" : "wrap off"} <button className="btn" onClick={() => setTextWrap((prev) => !prev)}>
</button> {wrapText ? "wrap" : "wrap off"}
</div> </button>
</div>
{/* Filtering Options - Full width on mobile, normal on desktop */} {/* Filtering Options - Full width on mobile, normal on desktop */}
<div className="flex flex-1 min-w-0 gap-2"> <div className="flex flex-1 min-w-0 gap-2">
<input <input
type="text" type="text"
className="flex-1 min-w-[120px] text-sm border p-2 rounded" className="flex-1 min-w-[120px] text-sm border p-2 rounded"
placeholder="Filter logs..." placeholder="Filter logs..."
value={filterRegex} value={filterRegex}
onChange={(e) => setFilterRegex(e.target.value)} onChange={(e) => setFilterRegex(e.target.value)}
/> />
<button className="btn" onClick={() => setFilterRegex("")}> <button className="btn" onClick={() => setFilterRegex("")}>
Clear Clear
</button> </button>
</div>
</div> </div>
</div> )}
</div> </div>
</div> </div>