Compare commits

...

12 Commits

Author SHA1 Message Date
Benson Wong d3f329f924 proxy: Improve logging performance and allow separate log streaming (#421)
Replace container/ring.Ring with a custom circularBuffer that uses a
single contiguous []byte slice. This fixes the original implementation
which created 10,240 ring elements instead of 10KB of storage.

GetHistory is now 139x faster (145μs → 1μs) and uses 117x less memory
(1.2MB → 10KB). Allocations reduced from 2 to 1 per write operation.

Create a LogMonitor per proxy.Process, replacing the usage
of a shared one. The buffer in LogMonitor is lazy allocated on the first
call to Write and freed when the Process is stopped. This reduces
unnecessary memory usage when a model is not active.

The /logs/stream/{model_id} endpoint was added to stream logs from a
specific process.
2025-12-18 21:49:25 -08:00
Benson Wong 98879b38c1 docker: add /app to $PATH (#424)
Make it so llama-server can be called directly instead of with the full
path at /app/llama-server.

Fixes #423
Ref: #233
2025-12-06 22:58:29 -08:00
Benson Wong 7b3b0f5eae move header images around [skip ci] 2025-12-02 19:40:42 -08:00
Benson Wong 021ccceef1 README: update hero image 2025-12-02 19:37:03 -08:00
Benson Wong f03871c50a Update README.md
- add supported anthropic API 
- add example for docker hot reload support
2025-12-02 19:03:01 -08:00
Ryan Steed dc00d17abe docs: add documentation for non-root container images and security considerations (#416)
* docs: add documentation for non-root container images and security considerations
* docs: move container security section to dedicated file and update README links
2025-12-02 08:52:26 -08:00
Benson Wong dea98733c3 proxy: extract metrics for v1/messages (#419) 2025-11-29 23:51:20 -08:00
Benson Wong bccce5fa19 go.mod,ui/package-lock.json: dependency and security updates (#418) 2025-11-29 22:27:22 -08:00
Benson Wong c968da1b73 proxy: add support for anthropic v1/messages api (#417)
* proxy: add support for anthropic v1/messages api
* proxy: restrict loading message to /v1/chat/completions
2025-11-29 22:09:07 -08:00
Ryan Steed a883d68d4f feat: Add support for custom llama.cpp base image and forked llama-swap repositories (#396)
* feat: Add support for custom llama.cpp base image and forked llama-swap repositories

- Introduce BASE_LLAMACPP_IMAGE env var to customize llama.cpp base image
- Introduce LS_REPO env var to customize llama-swap source
- Use GITHUB_REPOSITORY env var to automatically detect forked repos
- Update container tagging to use dynamic repo paths
- Pass build args for BASE_IMAGE and LS_REPO to Containerfile
- Enable flexible release downloads from forked repositories

* chore: quote entire curl options, appease coderabbitai
2025-11-29 20:59:15 -08:00
Ryan Steed b1dec8b735 docker: build both root and non-root container images (#412)
Change the user back to root for containers. Additionally, built a "non-root" labeled container for users who wish to have the additional security of running llama-swap as a lower privileged user.
2025-11-25 10:44:13 -08:00
Nikesh Parajuli 06523d8c1e feat: add platform-specific process attributes support (#411)
Fixes issues on Windows showing new windows for every process llama-swap spawns.
2025-11-24 21:39:56 -08:00
20 changed files with 664 additions and 157 deletions
+27 -12
View File
@@ -1,4 +1,4 @@
![llama-swap header image](header2.png) ![llama-swap header image](docs/assets/hero3.webp)
![GitHub Downloads (all assets, all releases)](https://img.shields.io/github/downloads/mostlygeek/llama-swap/total) ![GitHub Downloads (all assets, all releases)](https://img.shields.io/github/downloads/mostlygeek/llama-swap/total)
![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/mostlygeek/llama-swap/go-ci.yml) ![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/mostlygeek/llama-swap/go-ci.yml)
![GitHub Repo stars](https://img.shields.io/github/stars/mostlygeek/llama-swap) ![GitHub Repo stars](https://img.shields.io/github/stars/mostlygeek/llama-swap)
@@ -13,7 +13,7 @@ Built in Go for performance and simplicity, llama-swap has zero dependencies and
- ✅ Easy to deploy and configure: one binary, one configuration file. no external dependencies - ✅ Easy to deploy and configure: one binary, one configuration file. no external dependencies
- ✅ On-demand model switching - ✅ On-demand model switching
- ✅ Use any local OpenAI compatible server (llama.cpp, vllm, tabbyAPI, etc) - ✅ Use any local OpenAI compatible server (llama.cpp, vllm, tabbyAPI, etc.)
- future proof, upgrade your inference servers at any time. - future proof, upgrade your inference servers at any time.
- ✅ OpenAI API supported endpoints: - ✅ OpenAI API supported endpoints:
- `v1/completions` - `v1/completions`
@@ -21,6 +21,8 @@ Built in Go for performance and simplicity, llama-swap has zero dependencies and
- `v1/embeddings` - `v1/embeddings`
- `v1/audio/speech` ([#36](https://github.com/mostlygeek/llama-swap/issues/36)) - `v1/audio/speech` ([#36](https://github.com/mostlygeek/llama-swap/issues/36))
- `v1/audio/transcriptions` ([docs](https://github.com/mostlygeek/llama-swap/issues/41#issuecomment-2722637867)) - `v1/audio/transcriptions` ([docs](https://github.com/mostlygeek/llama-swap/issues/41#issuecomment-2722637867))
- ✅ Anthropic API supported endpoints:
- `v1/messages`
- ✅ llama-server (llama.cpp) supported endpoints - ✅ llama-server (llama.cpp) supported endpoints
- `v1/rerank`, `v1/reranking`, `/rerank` - `v1/rerank`, `v1/reranking`, `/rerank`
- `/infill` - for code infilling - `/infill` - for code infilling
@@ -44,7 +46,6 @@ llama-swap includes a real time web interface for monitoring logs and controllin
<img width="1164" height="745" alt="image" src="https://github.com/user-attachments/assets/bacf3f9d-819f-430b-9ed2-1bfaa8d54579" /> <img width="1164" height="745" alt="image" src="https://github.com/user-attachments/assets/bacf3f9d-819f-430b-9ed2-1bfaa8d54579" />
The Activity Page shows recent requests: The Activity Page shows recent requests:
<img width="1360" height="963" alt="image" src="https://github.com/user-attachments/assets/5f3edee6-d03a-4ae5-ae06-b20ac1f135bd" /> <img width="1360" height="963" alt="image" src="https://github.com/user-attachments/assets/5f3edee6-d03a-4ae5-ae06-b20ac1f135bd" />
@@ -61,7 +62,7 @@ llama-swap can be installed in multiple ways
### Docker Install ([download images](https://github.com/mostlygeek/llama-swap/pkgs/container/llama-swap)) ### Docker Install ([download images](https://github.com/mostlygeek/llama-swap/pkgs/container/llama-swap))
Nightly container images with llama-swap and llama-server are built for multiple platforms (cuda, vulkan, intel, etc). Nightly container images with llama-swap and llama-server are built for multiple platforms (cuda, vulkan, intel, etc.) including [non-root variants with improved security](docs/container-security.md).
```shell ```shell
$ docker pull ghcr.io/mostlygeek/llama-swap:cuda $ docker pull ghcr.io/mostlygeek/llama-swap:cuda
@@ -71,6 +72,14 @@ $ docker run -it --rm --runtime nvidia -p 9292:8080 \
-v /path/to/models:/models \ -v /path/to/models:/models \
-v /path/to/custom/config.yaml:/app/config.yaml \ -v /path/to/custom/config.yaml:/app/config.yaml \
ghcr.io/mostlygeek/llama-swap:cuda ghcr.io/mostlygeek/llama-swap:cuda
# configuration hot reload supported with a
# directory volume mount
$ docker run -it --rm --runtime nvidia -p 9292:8080 \
-v /path/to/models:/models \
-v /path/to/custom/config.yaml:/app/config.yaml \
-v /path/to/config:/config \
ghcr.io/mostlygeek/llama-swap:cuda -config /config/config.yaml -watch-config
``` ```
<details> <details>
@@ -89,6 +98,9 @@ docker pull ghcr.io/mostlygeek/llama-swap:musa
# tagged llama-swap, platform and llama-server version images # tagged llama-swap, platform and llama-server version images
docker pull ghcr.io/mostlygeek/llama-swap:v166-cuda-b6795 docker pull ghcr.io/mostlygeek/llama-swap:v166-cuda-b6795
# non-root cuda
docker pull ghcr.io/mostlygeek/llama-swap:cuda-non-root
``` ```
</details> </details>
@@ -191,23 +203,26 @@ As a safeguard, llama-swap also sets `X-Accel-Buffering: no` on SSE responses. H
## Monitoring Logs on the CLI ## Monitoring Logs on the CLI
```shell ```sh
# sends up to the last 10KB of logs # sends up to the last 10KB of logs
curl http://host/logs' $ curl http://host/logs
# streams combined logs # streams combined logs
curl -Ns 'http://host/logs/stream' curl -Ns http://host/logs/stream
# just llama-swap's logs # stream llama-swap's proxy status logs
curl -Ns 'http://host/logs/stream/proxy' curl -Ns http://host/logs/stream/proxy
# just upstream's logs # stream logs from upstream processes that llama-swap loads
curl -Ns 'http://host/logs/stream/upstream' curl -Ns http://host/logs/stream/upstream
# stream logs only from a specific model
curl -Ns http://host/logs/stream/{model_id}
# stream and filter logs with linux pipes # stream and filter logs with linux pipes
curl -Ns http://host/logs/stream | grep 'eval time' curl -Ns http://host/logs/stream | grep 'eval time'
# skips history and just streams new log entries # appending ?no-history will disable sending buffered history first
curl -Ns 'http://host/logs/stream?no-history' curl -Ns 'http://host/logs/stream?no-history'
``` ```
@@ -0,0 +1,85 @@
# Replace ring.Ring with Efficient Circular Byte Buffer
## Overview
Replace the inefficient `container/ring.Ring` implementation in `logMonitor.go` with a simple circular byte buffer that uses a single contiguous `[]byte` slice. This eliminates per-write allocations, improves cache locality, and correctly implements a 10KB buffer.
## Current Issues
1. `ring.New(10 * 1024)` creates 10,240 ring **elements**, not 10KB of storage
2. Every `Write()` call allocates a new `[]byte` slice inside the lock
3. `GetHistory()` iterates all 10,240 elements and appends repeatedly (geometric reallocs)
4. Linked list structure has poor cache locality and pointer overhead
## Design Requirements
### New CircularBuffer Type
Create a simple circular byte buffer with:
- Single pre-allocated `[]byte` of fixed capacity (10KB)
- `head` and `size` integers to track write position and data length
- No per-write allocations
### API Requirements
The new buffer must support:
1. **Write(p []byte)** - Append bytes, overwriting oldest data when full
2. **GetHistory() []byte** - Return all buffered data in correct order (oldest to newest)
### Implementation Details
```go
type circularBuffer struct {
data []byte // pre-allocated capacity
head int // next write position
size int // current number of bytes stored (0 to cap)
}
```
**Write logic:**
- If `len(p) >= capacity`: just keep the last `capacity` bytes
- Otherwise: write bytes at `head`, wrapping around if needed
- Update `head` and `size` accordingly
- Data is copied into the internal buffer (not stored by reference)
**GetHistory logic:**
- Calculate start position: `(head - size + cap) % cap`
- If not wrapped: single slice copy
- If wrapped: two copies (end of buffer + beginning)
- Returns a **new slice** (copy), not a view into internal buffer
### Immutability Guarantees (must preserve)
Per existing tests:
1. Modifying input `[]byte` after `Write()` must not affect stored data
2. `GetHistory()` returns independent copy - modifications don't affect buffer
## Files to Modify
- `proxy/logMonitor.go` - Replace `buffer *ring.Ring` with new circular buffer
## Testing Plan
Existing tests in `logMonitor_test.go` should continue to pass:
- `TestLogMonitor` - Basic write/read and subscriber notification
- `TestWrite_ImmutableBuffer` - Verify writes don't affect returned history
- `TestWrite_LogTimeFormat` - Timestamp formatting
Add new tests:
- Test buffer wrap-around behavior
- Test large writes that exceed buffer capacity
- Test exact capacity boundary conditions
## Checklist
- [ ] Create `circularBuffer` struct in `logMonitor.go`
- [ ] Implement `Write()` method for circular buffer
- [ ] Implement `GetHistory()` method for circular buffer
- [ ] Update `LogMonitor` struct to use new buffer
- [ ] Update `NewLogMonitorWriter()` to initialize new buffer
- [ ] Update `LogMonitor.Write()` to use new buffer
- [ ] Update `LogMonitor.GetHistory()` to use new buffer
- [ ] Remove `"container/ring"` import
- [ ] Run `make test-dev` to verify existing tests pass
- [ ] Add wrap-around test case
- [ ] Run `make test-all` for final validation
+31 -7
View File
@@ -20,9 +20,17 @@ if [[ -z "$GITHUB_TOKEN" ]]; then
exit 1 exit 1
fi fi
# Set llama.cpp base image, customizable using the BASE_LLAMACPP_IMAGE environment
# variable, this permits testing with forked llama.cpp repositories
BASE_IMAGE=${BASE_LLAMACPP_IMAGE:-ghcr.io/ggml-org/llama.cpp}
# Set llama-swap repository, automatically uses GITHUB_REPOSITORY variable
# to enable easy container builds on forked repos
LS_REPO=${GITHUB_REPOSITORY:-mostlygeek/llama-swap}
# the most recent llama-swap tag # the most recent llama-swap tag
# have to strip out the 'v' due to .tar.gz file naming # have to strip out the 'v' due to .tar.gz file naming
LS_VER=$(curl -s https://api.github.com/repos/mostlygeek/llama-swap/releases/latest | jq -r .tag_name | sed 's/v//') LS_VER=$(curl -s https://api.github.com/repos/${LS_REPO}/releases/latest | jq -r .tag_name | sed 's/v//')
if [ "$ARCH" == "cpu" ]; then if [ "$ARCH" == "cpu" ]; then
# cpu only containers just use the server tag # cpu only containers just use the server tag
@@ -45,11 +53,27 @@ if [[ -z "$LCPP_TAG" ]]; then
exit 1 exit 1
fi fi
CONTAINER_TAG="ghcr.io/mostlygeek/llama-swap:v${LS_VER}-${ARCH}-${LCPP_TAG}" for CONTAINER_TYPE in non-root root; do
CONTAINER_LATEST="ghcr.io/mostlygeek/llama-swap:${ARCH}" CONTAINER_TAG="ghcr.io/${LS_REPO}:v${LS_VER}-${ARCH}-${LCPP_TAG}"
echo "Building ${CONTAINER_TAG} $LS_VER" CONTAINER_LATEST="ghcr.io/${LS_REPO}:${ARCH}"
docker build -f llama-swap.Containerfile --build-arg BASE_TAG=${BASE_TAG} --build-arg LS_VER=${LS_VER} -t ${CONTAINER_TAG} -t ${CONTAINER_LATEST} . USER_UID=0
if [ "$PUSH_IMAGES" == "true" ]; then USER_GID=0
USER_HOME=/root
if [ "$CONTAINER_TYPE" == "non-root" ]; then
CONTAINER_TAG="${CONTAINER_TAG}-non-root"
CONTAINER_LATEST="${CONTAINER_LATEST}-non-root"
USER_UID=10001
USER_GID=10001
USER_HOME=/app
fi
echo "Building $CONTAINER_TYPE $CONTAINER_TAG $LS_VER"
docker build -f llama-swap.Containerfile --build-arg BASE_TAG=${BASE_TAG} --build-arg LS_VER=${LS_VER} --build-arg UID=${USER_UID} \
--build-arg LS_REPO=${LS_REPO} --build-arg GID=${USER_GID} --build-arg USER_HOME=${USER_HOME} -t ${CONTAINER_TAG} -t ${CONTAINER_LATEST} \
--build-arg BASE_IMAGE=${BASE_IMAGE} .
if [ "$PUSH_IMAGES" == "true" ]; then
docker push ${CONTAINER_TAG} docker push ${CONTAINER_TAG}
docker push ${CONTAINER_LATEST} docker push ${CONTAINER_LATEST}
fi fi
done
+10 -4
View File
@@ -1,8 +1,10 @@
ARG BASE_IMAGE=ghcr.io/ggml-org/llama.cpp
ARG BASE_TAG=server-cuda ARG BASE_TAG=server-cuda
FROM ghcr.io/ggml-org/llama.cpp:${BASE_TAG} FROM ${BASE_IMAGE}:${BASE_TAG}
# has to be after the FROM # has to be after the FROM
ARG LS_VER=170 ARG LS_VER=170
ARG LS_REPO=mostlygeek/llama-swap
# Set default UID/GID arguments # Set default UID/GID arguments
ARG UID=10001 ARG UID=10001
@@ -27,10 +29,14 @@ RUN chown --recursive $UID:$GID $HOME /app
USER $UID:$GID USER $UID:$GID
WORKDIR /app WORKDIR /app
# Add /app to PATH
ENV PATH="/app:${PATH}"
RUN \ RUN \
curl -LO https://github.com/mostlygeek/llama-swap/releases/download/v"${LS_VER}"/llama-swap_"${LS_VER}"_linux_amd64.tar.gz && \ curl -LO "https://github.com/${LS_REPO}/releases/download/v${LS_VER}/llama-swap_${LS_VER}_linux_amd64.tar.gz" && \
tar -zxf llama-swap_"${LS_VER}"_linux_amd64.tar.gz && \ tar -zxf "llama-swap_${LS_VER}_linux_amd64.tar.gz" && \
rm llama-swap_"${LS_VER}"_linux_amd64.tar.gz rm "llama-swap_${LS_VER}_linux_amd64.tar.gz"
COPY --chown=$UID:$GID config.example.yaml /app/config.yaml COPY --chown=$UID:$GID config.example.yaml /app/config.yaml

Before

Width:  |  Height:  |  Size: 261 KiB

After

Width:  |  Height:  |  Size: 261 KiB

Before

Width:  |  Height:  |  Size: 351 KiB

After

Width:  |  Height:  |  Size: 351 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 198 KiB

+9
View File
@@ -0,0 +1,9 @@
## Container Security
For convenience, the default container images use the **root** user within the container. This permits simplified access to host resources including volume mounts and hardware devices under `/dev/dri` (_for Vulkan support_). But this can widen the attack surface to privilege escalation exploits.
Alternative images, tagged as `non-root`, are also available. For example, `llama-swap:cpu-non-root` uses the unprivileged **app** user by default. Depending on deployment requirements, additional configuration may be necessary to ensure that the container retains access to required hosts resources. This might entail customizing host filesystem permissions/ownership appropriately or injecting host group membership into the container.
Docker offers a [system-wide option enabling user namespace remapping](https://docs.docker.com/engine/security/userns-remap/) to accommodate situations were a **root** container user is required but also mentions that _"The best way to prevent privilege-escalation attacks from within a container is to configure your container's applications to run as unprivileged users."_ Podman offers similar capability, per-container, to [set UID/GID mapping in a new user namespace](https://docs.podman.io/en/latest/markdown/podman-run.1.html#set-uid-gid-mapping-in-a-new-user-namespace).
The Large Language Model (_LLM/AI_) ecosystem is rapidly evolving and [serious security vulnerabilities have surfaced in the past](https://huggingface.co/docs/hub/security-pickle). These alternative _non-root_ images could reduce the impact of future unknown problems. However, proper planning and configuration is recommended to utilize them.
+5 -5
View File
@@ -1,6 +1,6 @@
module github.com/mostlygeek/llama-swap module github.com/mostlygeek/llama-swap
go 1.23.0 go 1.25.4
require ( require (
github.com/billziss-gh/golib v0.2.0 github.com/billziss-gh/golib v0.2.0
@@ -37,9 +37,9 @@ require (
github.com/twitchyliquid64/golang-asm v0.15.1 // indirect github.com/twitchyliquid64/golang-asm v0.15.1 // indirect
github.com/ugorji/go/codec v1.2.12 // indirect github.com/ugorji/go/codec v1.2.12 // indirect
golang.org/x/arch v0.8.0 // indirect golang.org/x/arch v0.8.0 // indirect
golang.org/x/crypto v0.36.0 // indirect golang.org/x/crypto v0.45.0 // indirect
golang.org/x/net v0.38.0 // indirect golang.org/x/net v0.47.0 // indirect
golang.org/x/sys v0.31.0 // indirect golang.org/x/sys v0.38.0 // indirect
golang.org/x/text v0.23.0 // indirect golang.org/x/text v0.31.0 // indirect
google.golang.org/protobuf v1.34.1 // indirect google.golang.org/protobuf v1.34.1 // indirect
) )
+8 -8
View File
@@ -80,16 +80,16 @@ github.com/ugorji/go/codec v1.2.12/go.mod h1:UNopzCgEMSXjBc6AOMqYvWC1ktqTAfzJZUZ
golang.org/x/arch v0.0.0-20210923205945-b76863e36670/go.mod h1:5om86z9Hs0C8fWVUuoMHwpExlXzs5Tkyp9hOrfG7pp8= golang.org/x/arch v0.0.0-20210923205945-b76863e36670/go.mod h1:5om86z9Hs0C8fWVUuoMHwpExlXzs5Tkyp9hOrfG7pp8=
golang.org/x/arch v0.8.0 h1:3wRIsP3pM4yUptoR96otTUOXI367OS0+c9eeRi9doIc= golang.org/x/arch v0.8.0 h1:3wRIsP3pM4yUptoR96otTUOXI367OS0+c9eeRi9doIc=
golang.org/x/arch v0.8.0/go.mod h1:FEVrYAQjsQXMVJ1nsMoVVXPZg6p2JE2mx8psSWTDQys= golang.org/x/arch v0.8.0/go.mod h1:FEVrYAQjsQXMVJ1nsMoVVXPZg6p2JE2mx8psSWTDQys=
golang.org/x/crypto v0.36.0 h1:AnAEvhDddvBdpY+uR+MyHmuZzzNqXSe/GvuDeob5L34= golang.org/x/crypto v0.45.0 h1:jMBrvKuj23MTlT0bQEOBcAE0mjg8mK9RXFhRH6nyF3Q=
golang.org/x/crypto v0.36.0/go.mod h1:Y4J0ReaxCR1IMaabaSMugxJES1EpwhBHhv2bDHklZvc= golang.org/x/crypto v0.45.0/go.mod h1:XTGrrkGJve7CYK7J8PEww4aY7gM3qMCElcJQ8n8JdX4=
golang.org/x/net v0.38.0 h1:vRMAPTMaeGqVhG5QyLJHqNDwecKTomGeqbnfZyKlBI8= golang.org/x/net v0.47.0 h1:Mx+4dIFzqraBXUugkia1OOvlD6LemFo1ALMHjrXDOhY=
golang.org/x/net v0.38.0/go.mod h1:ivrbrMbzFq5J41QOQh0siUuly180yBYtLp+CKbEaFx8= golang.org/x/net v0.47.0/go.mod h1:/jNxtkgq5yWUGYkaZGqo27cfGZ1c5Nen03aYrrKpVRU=
golang.org/x/sys v0.5.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= golang.org/x/sys v0.5.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.6.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= golang.org/x/sys v0.6.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.31.0 h1:ioabZlmFYtWhL+TRYpcnNlLwhyxaM9kWTDEmfnprqik= golang.org/x/sys v0.38.0 h1:3yZWxaJjBmCWXqhN1qh02AkOnCQ1poK6oF+a7xWL6Gc=
golang.org/x/sys v0.31.0/go.mod h1:BJP2sWEmIv4KK5OTEluFJCKSidICx8ciO85XgH3Ak8k= golang.org/x/sys v0.38.0/go.mod h1:OgkHotnGiDImocRcuBABYBEXf8A9a87e/uXjp9XT3ks=
golang.org/x/text v0.23.0 h1:D71I7dUrlY+VX0gQShAThNGHFxZ13dGLBHQLVl1mJlY= golang.org/x/text v0.31.0 h1:aC8ghyu4JhP8VojJ2lEHBnochRno1sgL6nEi9WGFGMM=
golang.org/x/text v0.23.0/go.mod h1:/BLNzu4aZCJ1+kcD0DNRotWKage4q2rGVAg4o22unh4= golang.org/x/text v0.31.0/go.mod h1:tKRAlv61yKIjGGHX/4tP1LTbc13YSec1pxVEWXzfoeM=
golang.org/x/xerrors v0.0.0-20191204190536-9bdfabe68543 h1:E7g+9GITq07hpfrRu66IVDexMakfv52eLZ2CXBWiKr4= golang.org/x/xerrors v0.0.0-20191204190536-9bdfabe68543 h1:E7g+9GITq07hpfrRu66IVDexMakfv52eLZ2CXBWiKr4=
golang.org/x/xerrors v0.0.0-20191204190536-9bdfabe68543/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0= golang.org/x/xerrors v0.0.0-20191204190536-9bdfabe68543/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
google.golang.org/protobuf v1.34.1 h1:9ddQBjfCyZPOHPUiPxpYESBLc+T8P3E+Vo4IbKZgFWg= google.golang.org/protobuf v1.34.1 h1:9ddQBjfCyZPOHPUiPxpYESBLc+T8P3E+Vo4IbKZgFWg=
+101 -16
View File
@@ -1,7 +1,6 @@
package proxy package proxy
import ( import (
"container/ring"
"context" "context"
"fmt" "fmt"
"io" "io"
@@ -12,6 +11,85 @@ import (
"github.com/mostlygeek/llama-swap/event" "github.com/mostlygeek/llama-swap/event"
) )
// circularBuffer is a fixed-size circular byte buffer that overwrites
// oldest data when full. It provides O(1) writes and O(n) reads.
type circularBuffer struct {
data []byte // pre-allocated capacity
head int // next write position
size int // current number of bytes stored (0 to cap)
}
func newCircularBuffer(capacity int) *circularBuffer {
return &circularBuffer{
data: make([]byte, capacity),
head: 0,
size: 0,
}
}
// Write appends bytes to the buffer, overwriting oldest data when full.
// Data is copied into the internal buffer (not stored by reference).
func (cb *circularBuffer) Write(p []byte) {
if len(p) == 0 {
return
}
cap := len(cb.data)
// If input is larger than capacity, only keep the last cap bytes
if len(p) >= cap {
copy(cb.data, p[len(p)-cap:])
cb.head = 0
cb.size = cap
return
}
// Calculate how much space is available from head to end of buffer
firstPart := cap - cb.head
if firstPart >= len(p) {
// All data fits without wrapping
copy(cb.data[cb.head:], p)
cb.head = (cb.head + len(p)) % cap
} else {
// Data wraps around
copy(cb.data[cb.head:], p[:firstPart])
copy(cb.data[:len(p)-firstPart], p[firstPart:])
cb.head = len(p) - firstPart
}
// Update size
cb.size += len(p)
if cb.size > cap {
cb.size = cap
}
}
// GetHistory returns all buffered data in correct order (oldest to newest).
// Returns a new slice (copy), not a view into internal buffer.
func (cb *circularBuffer) GetHistory() []byte {
if cb.size == 0 {
return nil
}
result := make([]byte, cb.size)
cap := len(cb.data)
// Calculate start position (oldest data)
start := (cb.head - cb.size + cap) % cap
if start+cb.size <= cap {
// Data is contiguous, single copy
copy(result, cb.data[start:start+cb.size])
} else {
// Data wraps around, two copies
firstPart := cap - start
copy(result[:firstPart], cb.data[start:])
copy(result[firstPart:], cb.data[:cb.size-firstPart])
}
return result
}
type LogLevel int type LogLevel int
const ( const (
@@ -19,12 +97,14 @@ const (
LevelInfo LevelInfo
LevelWarn LevelWarn
LevelError LevelError
LogBufferSize = 100 * 1024
) )
type LogMonitor struct { type LogMonitor struct {
eventbus *event.Dispatcher eventbus *event.Dispatcher
mu sync.RWMutex mu sync.RWMutex
buffer *ring.Ring buffer *circularBuffer
bufferMu sync.RWMutex bufferMu sync.RWMutex
// typically this can be os.Stdout // typically this can be os.Stdout
@@ -45,7 +125,7 @@ func NewLogMonitor() *LogMonitor {
func NewLogMonitorWriter(stdout io.Writer) *LogMonitor { func NewLogMonitorWriter(stdout io.Writer) *LogMonitor {
return &LogMonitor{ return &LogMonitor{
eventbus: event.NewDispatcherConfig(1000), eventbus: event.NewDispatcherConfig(1000),
buffer: ring.New(10 * 1024), // keep 10KB of buffered logs buffer: nil, // lazy initialized on first Write
stdout: stdout, stdout: stdout,
level: LevelInfo, level: LevelInfo,
prefix: "", prefix: "",
@@ -64,12 +144,15 @@ func (w *LogMonitor) Write(p []byte) (n int, err error) {
} }
w.bufferMu.Lock() w.bufferMu.Lock()
bufferCopy := make([]byte, len(p)) if w.buffer == nil {
copy(bufferCopy, p) w.buffer = newCircularBuffer(LogBufferSize)
w.buffer.Value = bufferCopy }
w.buffer = w.buffer.Next() w.buffer.Write(p)
w.bufferMu.Unlock() w.bufferMu.Unlock()
// Make a copy for broadcast to preserve immutability
bufferCopy := make([]byte, len(p))
copy(bufferCopy, p)
w.broadcast(bufferCopy) w.broadcast(bufferCopy)
return n, nil return n, nil
} }
@@ -77,16 +160,18 @@ func (w *LogMonitor) Write(p []byte) (n int, err error) {
func (w *LogMonitor) GetHistory() []byte { func (w *LogMonitor) GetHistory() []byte {
w.bufferMu.RLock() w.bufferMu.RLock()
defer w.bufferMu.RUnlock() defer w.bufferMu.RUnlock()
if w.buffer == nil {
return nil
}
return w.buffer.GetHistory()
}
var history []byte // Clear releases the buffer memory, making it eligible for GC.
w.buffer.Do(func(p any) { // The buffer will be lazily re-allocated on the next Write.
if p != nil { func (w *LogMonitor) Clear() {
if content, ok := p.([]byte); ok { w.bufferMu.Lock()
history = append(history, content...) w.buffer = nil
} w.bufferMu.Unlock()
}
})
return history
} }
func (w *LogMonitor) OnLogData(callback func(data []byte)) context.CancelFunc { func (w *LogMonitor) OnLogData(callback func(data []byte)) context.CancelFunc {
+201
View File
@@ -113,3 +113,204 @@ func TestWrite_LogTimeFormat(t *testing.T) {
t.Fatalf("Cannot find timestamp: %v", err) t.Fatalf("Cannot find timestamp: %v", err)
} }
} }
func TestCircularBuffer_WrapAround(t *testing.T) {
// Create a small buffer to test wrap-around
cb := newCircularBuffer(10)
// Write "hello" (5 bytes)
cb.Write([]byte("hello"))
if got := string(cb.GetHistory()); got != "hello" {
t.Errorf("Expected 'hello', got %q", got)
}
// Write "world" (5 bytes) - buffer now full
cb.Write([]byte("world"))
if got := string(cb.GetHistory()); got != "helloworld" {
t.Errorf("Expected 'helloworld', got %q", got)
}
// Write "12345" (5 bytes) - should overwrite "hello"
cb.Write([]byte("12345"))
if got := string(cb.GetHistory()); got != "world12345" {
t.Errorf("Expected 'world12345', got %q", got)
}
// Write data larger than buffer capacity
cb.Write([]byte("abcdefghijklmnop")) // 16 bytes, only last 10 kept
if got := string(cb.GetHistory()); got != "ghijklmnop" {
t.Errorf("Expected 'ghijklmnop', got %q", got)
}
}
func TestCircularBuffer_BoundaryConditions(t *testing.T) {
// Test empty buffer
cb := newCircularBuffer(10)
if got := cb.GetHistory(); got != nil {
t.Errorf("Expected nil for empty buffer, got %q", got)
}
// Test exact capacity
cb.Write([]byte("1234567890"))
if got := string(cb.GetHistory()); got != "1234567890" {
t.Errorf("Expected '1234567890', got %q", got)
}
// Test write exactly at capacity boundary
cb = newCircularBuffer(10)
cb.Write([]byte("12345"))
cb.Write([]byte("67890"))
if got := string(cb.GetHistory()); got != "1234567890" {
t.Errorf("Expected '1234567890', got %q", got)
}
}
func TestLogMonitor_LazyInit(t *testing.T) {
lm := NewLogMonitorWriter(io.Discard)
// Buffer should be nil before any writes
if lm.buffer != nil {
t.Error("Expected buffer to be nil before first write")
}
// GetHistory should return nil when buffer is nil
if got := lm.GetHistory(); got != nil {
t.Errorf("Expected nil history before first write, got %q", got)
}
// Write should lazily initialize the buffer
lm.Write([]byte("test"))
if lm.buffer == nil {
t.Error("Expected buffer to be initialized after write")
}
if got := string(lm.GetHistory()); got != "test" {
t.Errorf("Expected 'test', got %q", got)
}
}
func TestLogMonitor_Clear(t *testing.T) {
lm := NewLogMonitorWriter(io.Discard)
// Write some data
lm.Write([]byte("hello"))
if got := string(lm.GetHistory()); got != "hello" {
t.Errorf("Expected 'hello', got %q", got)
}
// Clear should release the buffer
lm.Clear()
if lm.buffer != nil {
t.Error("Expected buffer to be nil after Clear")
}
if got := lm.GetHistory(); got != nil {
t.Errorf("Expected nil history after Clear, got %q", got)
}
}
func TestLogMonitor_ClearAndReuse(t *testing.T) {
lm := NewLogMonitorWriter(io.Discard)
// Write, clear, then write again
lm.Write([]byte("first"))
lm.Clear()
lm.Write([]byte("second"))
if got := string(lm.GetHistory()); got != "second" {
t.Errorf("Expected 'second' after clear and reuse, got %q", got)
}
}
func BenchmarkLogMonitorWrite(b *testing.B) {
// Test data of varying sizes
smallMsg := []byte("small message\n")
mediumMsg := []byte(strings.Repeat("medium message content ", 10) + "\n")
largeMsg := []byte(strings.Repeat("large message content for benchmarking ", 100) + "\n")
b.Run("SmallWrite", func(b *testing.B) {
lm := NewLogMonitorWriter(io.Discard)
b.ResetTimer()
for i := 0; i < b.N; i++ {
lm.Write(smallMsg)
}
})
b.Run("MediumWrite", func(b *testing.B) {
lm := NewLogMonitorWriter(io.Discard)
b.ResetTimer()
for i := 0; i < b.N; i++ {
lm.Write(mediumMsg)
}
})
b.Run("LargeWrite", func(b *testing.B) {
lm := NewLogMonitorWriter(io.Discard)
b.ResetTimer()
for i := 0; i < b.N; i++ {
lm.Write(largeMsg)
}
})
b.Run("WithSubscribers", func(b *testing.B) {
lm := NewLogMonitorWriter(io.Discard)
// Add some subscribers
for i := 0; i < 5; i++ {
lm.OnLogData(func(data []byte) {})
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
lm.Write(mediumMsg)
}
})
b.Run("GetHistory", func(b *testing.B) {
lm := NewLogMonitorWriter(io.Discard)
// Pre-populate with data
for i := 0; i < 1000; i++ {
lm.Write(mediumMsg)
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
lm.GetHistory()
}
})
}
/*
Benchmark Results - MBP M1 Pro
Before (ring.Ring):
| Benchmark | ns/op | bytes/op | allocs/op |
|---------------------------------|------------|----------|-----------|
| SmallWrite (14B) | 43 ns | 40 B | 2 |
| MediumWrite (241B) | 76 ns | 264 B | 2 |
| LargeWrite (4KB) | 504 ns | 4,120 B | 2 |
| WithSubscribers (5 subs) | 355 ns | 264 B | 2 |
| GetHistory (after 1000 writes) | 145,000 ns | 1.2 MB | 22 |
After (circularBuffer 10KB):
| Benchmark | ns/op | bytes/op | allocs/op |
|---------------------------------|------------|----------|-----------|
| SmallWrite (14B) | 26 ns | 16 B | 1 |
| MediumWrite (241B) | 67 ns | 240 B | 1 |
| LargeWrite (4KB) | 774 ns | 4,096 B | 1 |
| WithSubscribers (5 subs) | 325 ns | 240 B | 1 |
| GetHistory (after 1000 writes) | 1,042 ns | 10,240 B | 1 |
After (circularBuffer 100KB):
| Benchmark | ns/op | bytes/op | allocs/op |
|---------------------------------|------------|-----------|-----------|
| SmallWrite (14B) | 26 ns | 16 B | 1 |
| MediumWrite (241B) | 66 ns | 240 B | 1 |
| LargeWrite (4KB) | 753 ns | 4,096 B | 1 |
| WithSubscribers (5 subs) | 309 ns | 240 B | 1 |
| GetHistory (after 1000 writes) | 7,788 ns | 106,496 B | 1 |
Summary:
- GetHistory: 139x faster (10KB), 18x faster (100KB)
- Allocations: reduced from 2 to 1 across all operations
- Small/medium writes: ~1.1-1.6x faster
*/
+40 -16
View File
@@ -122,11 +122,18 @@ func (mp *metricsMonitor) wrapHandler(
} }
} else { } else {
if gjson.ValidBytes(body) { if gjson.ValidBytes(body) {
if tm, err := parseMetrics(modelID, recorder.StartTime(), gjson.ParseBytes(body)); err != nil { parsed := gjson.ParseBytes(body)
usage := parsed.Get("usage")
timings := parsed.Get("timings")
if usage.Exists() || timings.Exists() {
if tm, err := parseMetrics(modelID, recorder.StartTime(), usage, timings); err != nil {
mp.logger.Warnf("error parsing metrics: %v, path=%s", err, request.URL.Path) mp.logger.Warnf("error parsing metrics: %v, path=%s", err, request.URL.Path)
} else { } else {
mp.addMetrics(tm) mp.addMetrics(tm)
} }
}
} else { } else {
mp.logger.Warnf("metrics skipped, invalid JSON in response body path=%s", request.URL.Path) mp.logger.Warnf("metrics skipped, invalid JSON in response body path=%s", request.URL.Path)
} }
@@ -174,19 +181,20 @@ func processStreamingResponse(modelID string, start time.Time, body []byte) (Tok
} }
if gjson.ValidBytes(data) { if gjson.ValidBytes(data) {
return parseMetrics(modelID, start, gjson.ParseBytes(data)) parsed := gjson.ParseBytes(data)
usage := parsed.Get("usage")
timings := parsed.Get("timings")
if usage.Exists() || timings.Exists() {
return parseMetrics(modelID, start, usage, timings)
}
} }
} }
return TokenMetrics{}, fmt.Errorf("no valid JSON data found in stream") return TokenMetrics{}, fmt.Errorf("no valid JSON data found in stream")
} }
func parseMetrics(modelID string, start time.Time, jsonData gjson.Result) (TokenMetrics, error) { func parseMetrics(modelID string, start time.Time, usage, timings gjson.Result) (TokenMetrics, error) {
usage := jsonData.Get("usage")
timings := jsonData.Get("timings")
if !usage.Exists() && !timings.Exists() {
return TokenMetrics{}, fmt.Errorf("no usage or timings data found")
}
// default values // default values
cachedTokens := -1 // unknown or missing data cachedTokens := -1 // unknown or missing data
outputTokens := 0 outputTokens := 0
@@ -198,19 +206,35 @@ func parseMetrics(modelID string, start time.Time, jsonData gjson.Result) (Token
durationMs := int(time.Since(start).Milliseconds()) durationMs := int(time.Since(start).Milliseconds())
if usage.Exists() { if usage.Exists() {
outputTokens = int(jsonData.Get("usage.completion_tokens").Int()) if pt := usage.Get("prompt_tokens"); pt.Exists() {
inputTokens = int(jsonData.Get("usage.prompt_tokens").Int()) // v1/chat/completions
inputTokens = int(pt.Int())
} else if it := usage.Get("input_tokens"); it.Exists() {
// v1/messages
inputTokens = int(it.Int())
}
if ct := usage.Get("completion_tokens"); ct.Exists() {
// v1/chat/completions
outputTokens = int(ct.Int())
} else if ot := usage.Get("output_tokens"); ot.Exists() {
outputTokens = int(ot.Int())
}
if ct := usage.Get("cache_read_input_tokens"); ct.Exists() {
cachedTokens = int(ct.Int())
}
} }
// use llama-server's timing data for tok/sec and duration as it is more accurate // use llama-server's timing data for tok/sec and duration as it is more accurate
if timings.Exists() { if timings.Exists() {
inputTokens = int(jsonData.Get("timings.prompt_n").Int()) inputTokens = int(timings.Get("prompt_n").Int())
outputTokens = int(jsonData.Get("timings.predicted_n").Int()) outputTokens = int(timings.Get("predicted_n").Int())
promptPerSecond = jsonData.Get("timings.prompt_per_second").Float() promptPerSecond = timings.Get("prompt_per_second").Float()
tokensPerSecond = jsonData.Get("timings.predicted_per_second").Float() tokensPerSecond = timings.Get("predicted_per_second").Float()
durationMs = int(jsonData.Get("timings.prompt_ms").Float() + jsonData.Get("timings.predicted_ms").Float()) durationMs = int(timings.Get("prompt_ms").Float() + timings.Get("predicted_ms").Float())
if cachedValue := jsonData.Get("timings.cache_n"); cachedValue.Exists() { if cachedValue := timings.Get("cache_n"); cachedValue.Exists() {
cachedTokens = int(cachedValue.Int()) cachedTokens = int(cachedValue.Int())
} }
} }
+14 -2
View File
@@ -256,6 +256,7 @@ func (p *Process) start() error {
p.cmd.Env = append(p.cmd.Environ(), p.config.Env...) p.cmd.Env = append(p.cmd.Environ(), p.config.Env...)
p.cmd.Cancel = p.cmdStopUpstreamProcess p.cmd.Cancel = p.cmdStopUpstreamProcess
p.cmd.WaitDelay = p.gracefulStopTimeout p.cmd.WaitDelay = p.gracefulStopTimeout
setProcAttributes(p.cmd)
p.cmdMutex.Lock() p.cmdMutex.Lock()
p.cancelUpstream = ctxCancelUpstream p.cancelUpstream = ctxCancelUpstream
@@ -413,6 +414,9 @@ func (p *Process) stopCommand() {
stopStartTime := time.Now() stopStartTime := time.Now()
defer func() { defer func() {
p.proxyLogger.Debugf("<%s> stopCommand took %v", p.ID, time.Since(stopStartTime)) p.proxyLogger.Debugf("<%s> stopCommand took %v", p.ID, time.Since(stopStartTime))
// free the buffer in processLogger so the memory can be recovered
p.processLogger.Clear()
}() }()
p.cmdMutex.RLock() p.cmdMutex.RLock()
@@ -506,7 +510,10 @@ func (p *Process) ProxyRequest(w http.ResponseWriter, r *http.Request) {
// add a sync so the streaming client only runs when the goroutine has exited // add a sync so the streaming client only runs when the goroutine has exited
isStreaming, _ := r.Context().Value(proxyCtxKey("streaming")).(bool) isStreaming, _ := r.Context().Value(proxyCtxKey("streaming")).(bool)
if p.config.SendLoadingState != nil && *p.config.SendLoadingState && isStreaming {
// PR #417 (no support for anthropic v1/messages yet)
isChatCompletions := strings.HasPrefix(r.URL.Path, "/v1/chat/completions")
if p.config.SendLoadingState != nil && *p.config.SendLoadingState && isStreaming && isChatCompletions {
srw = newStatusResponseWriter(p, w) srw = newStatusResponseWriter(p, w)
go srw.statusUpdates(swapCtx) go srw.statusUpdates(swapCtx)
} else { } else {
@@ -625,6 +632,7 @@ func (p *Process) cmdStopUpstreamProcess() error {
stopCmd := exec.Command(stopArgs[0], stopArgs[1:]...) stopCmd := exec.Command(stopArgs[0], stopArgs[1:]...)
stopCmd.Stdout = p.processLogger stopCmd.Stdout = p.processLogger
stopCmd.Stderr = p.processLogger stopCmd.Stderr = p.processLogger
setProcAttributes(stopCmd)
stopCmd.Env = p.cmd.Env stopCmd.Env = p.cmd.Env
if err := stopCmd.Run(); err != nil { if err := stopCmd.Run(); err != nil {
@@ -641,6 +649,11 @@ func (p *Process) cmdStopUpstreamProcess() error {
return nil return nil
} }
// Logger returns the logger for this process.
func (p *Process) Logger() *LogMonitor {
return p.processLogger
}
var loadingRemarks = []string{ var loadingRemarks = []string{
"Still faster than your last standup meeting...", "Still faster than your last standup meeting...",
"Reticulating splines...", "Reticulating splines...",
@@ -859,7 +872,6 @@ func (s *statusResponseWriter) WriteHeader(statusCode int) {
s.Flush() s.Flush()
} }
// Add Flush method
func (s *statusResponseWriter) Flush() { func (s *statusResponseWriter) Flush() {
if flusher, ok := s.writer.(http.Flusher); ok { if flusher, ok := s.writer.(http.Flusher); ok {
flusher.Flush() flusher.Flush()
+12
View File
@@ -0,0 +1,12 @@
//go:build !windows
package proxy
import (
"os/exec"
)
// setProcAttributes sets platform-specific process attributes
func setProcAttributes(cmd *exec.Cmd) {
// No-op on Unix systems
}
+16
View File
@@ -0,0 +1,16 @@
//go:build windows
package proxy
import (
"os/exec"
"syscall"
)
// setProcAttributes sets platform-specific process attributes
func setProcAttributes(cmd *exec.Cmd) {
cmd.SysProcAttr = &syscall.SysProcAttr{
HideWindow: true,
CreationFlags: 0x08000000, // CREATE_NO_WINDOW
}
}
+9 -1
View File
@@ -46,7 +46,8 @@ func NewProcessGroup(id string, config config.Config, proxyLogger *LogMonitor, u
// Create a Process for each member in the group // Create a Process for each member in the group
for _, modelID := range groupConfig.Members { for _, modelID := range groupConfig.Members {
modelConfig, modelID, _ := pg.config.FindConfig(modelID) modelConfig, modelID, _ := pg.config.FindConfig(modelID)
process := NewProcess(modelID, pg.config.HealthCheckTimeout, modelConfig, pg.upstreamLogger, pg.proxyLogger) processLogger := NewLogMonitorWriter(upstreamLogger)
process := NewProcess(modelID, pg.config.HealthCheckTimeout, modelConfig, processLogger, pg.proxyLogger)
pg.processes[modelID] = process pg.processes[modelID] = process
} }
@@ -88,6 +89,13 @@ func (pg *ProcessGroup) HasMember(modelName string) bool {
return slices.Contains(pg.config.Groups[pg.id].Members, modelName) return slices.Contains(pg.config.Groups[pg.id].Members, modelName)
} }
func (pg *ProcessGroup) GetMember(modelName string) (*Process, bool) {
if pg.HasMember(modelName) {
return pg.processes[modelName], true
}
return nil, false
}
func (pg *ProcessGroup) StopProcess(modelID string, strategy StopStrategy) error { func (pg *ProcessGroup) StopProcess(modelID string, strategy StopStrategy) error {
pg.Lock() pg.Lock()
+13 -11
View File
@@ -236,27 +236,29 @@ func (pm *ProxyManager) setupGinEngine() {
}) })
// Set up routes using the Gin engine // Set up routes using the Gin engine
pm.ginEngine.POST("/v1/chat/completions", pm.proxyOAIHandler) pm.ginEngine.POST("/v1/chat/completions", pm.proxyInferenceHandler)
// Support legacy /v1/completions api, see issue #12 // Support legacy /v1/completions api, see issue #12
pm.ginEngine.POST("/v1/completions", pm.proxyOAIHandler) pm.ginEngine.POST("/v1/completions", pm.proxyInferenceHandler)
// Support anthropic /v1/messages (added https://github.com/ggml-org/llama.cpp/pull/17570)
pm.ginEngine.POST("/v1/messages", pm.proxyInferenceHandler)
// Support embeddings and reranking // Support embeddings and reranking
pm.ginEngine.POST("/v1/embeddings", pm.proxyOAIHandler) pm.ginEngine.POST("/v1/embeddings", pm.proxyInferenceHandler)
// llama-server's /reranking endpoint + aliases // llama-server's /reranking endpoint + aliases
pm.ginEngine.POST("/reranking", pm.proxyOAIHandler) pm.ginEngine.POST("/reranking", pm.proxyInferenceHandler)
pm.ginEngine.POST("/rerank", pm.proxyOAIHandler) pm.ginEngine.POST("/rerank", pm.proxyInferenceHandler)
pm.ginEngine.POST("/v1/rerank", pm.proxyOAIHandler) pm.ginEngine.POST("/v1/rerank", pm.proxyInferenceHandler)
pm.ginEngine.POST("/v1/reranking", pm.proxyOAIHandler) pm.ginEngine.POST("/v1/reranking", pm.proxyInferenceHandler)
// llama-server's /infill endpoint for code infilling // llama-server's /infill endpoint for code infilling
pm.ginEngine.POST("/infill", pm.proxyOAIHandler) pm.ginEngine.POST("/infill", pm.proxyInferenceHandler)
// llama-server's /completion endpoint // llama-server's /completion endpoint
pm.ginEngine.POST("/completion", pm.proxyOAIHandler) pm.ginEngine.POST("/completion", pm.proxyInferenceHandler)
// Support audio/speech endpoint // Support audio/speech endpoint
pm.ginEngine.POST("/v1/audio/speech", pm.proxyOAIHandler) pm.ginEngine.POST("/v1/audio/speech", pm.proxyInferenceHandler)
pm.ginEngine.POST("/v1/audio/transcriptions", pm.proxyOAIPostFormHandler) pm.ginEngine.POST("/v1/audio/transcriptions", pm.proxyOAIPostFormHandler)
pm.ginEngine.GET("/v1/models", pm.listModelsHandler) pm.ginEngine.GET("/v1/models", pm.listModelsHandler)
@@ -545,7 +547,7 @@ func (pm *ProxyManager) proxyToUpstream(c *gin.Context) {
} }
} }
func (pm *ProxyManager) proxyOAIHandler(c *gin.Context) { func (pm *ProxyManager) proxyInferenceHandler(c *gin.Context) {
bodyBytes, err := io.ReadAll(c.Request.Body) bodyBytes, err := io.ReadAll(c.Request.Body)
if err != nil { if err != nil {
pm.sendErrorResponse(c, http.StatusBadRequest, "could not ready request body") pm.sendErrorResponse(c, http.StatusBadRequest, "could not ready request body")
+17 -11
View File
@@ -83,18 +83,24 @@ func (pm *ProxyManager) streamLogsHandler(c *gin.Context) {
// getLogger searches for the appropriate logger based on the logMonitorId // getLogger searches for the appropriate logger based on the logMonitorId
func (pm *ProxyManager) getLogger(logMonitorId string) (*LogMonitor, error) { func (pm *ProxyManager) getLogger(logMonitorId string) (*LogMonitor, error) {
var logger *LogMonitor switch logMonitorId {
case "":
if logMonitorId == "" {
// maintain the default // maintain the default
logger = pm.muxLogger return pm.muxLogger, nil
} else if logMonitorId == "proxy" { case "proxy":
logger = pm.proxyLogger return pm.proxyLogger, nil
} else if logMonitorId == "upstream" { case "upstream":
logger = pm.upstreamLogger return pm.upstreamLogger, nil
} else { default:
return nil, fmt.Errorf("invalid logger. Use 'proxy' or 'upstream'") // search for a models specific logger
if name, found := pm.config.RealModelName(logMonitorId); found {
for _, group := range pm.processGroups {
if process, found := group.GetMember(name); found {
return process.Logger(), nil
}
}
} }
return logger, nil return nil, fmt.Errorf("invalid logger. Use 'proxy' or 'upstream'")
}
} }
+61 -59
View File
@@ -752,9 +752,9 @@
} }
}, },
"node_modules/@eslint-community/eslint-utils": { "node_modules/@eslint-community/eslint-utils": {
"version": "4.7.0", "version": "4.9.0",
"resolved": "https://registry.npmjs.org/@eslint-community/eslint-utils/-/eslint-utils-4.7.0.tgz", "resolved": "https://registry.npmjs.org/@eslint-community/eslint-utils/-/eslint-utils-4.9.0.tgz",
"integrity": "sha512-dyybb3AcajC7uha6CvhdVRJqaKyn7w2YKqKyAN37NKYgZT36w+iRb0Dymmc5qEJ549c/S31cMMSFd75bteCpCw==", "integrity": "sha512-ayVFHdtZ+hsq1t2Dy24wCmGXGe4q9Gu3smhLYALJrr473ZH27MsnSL+LKUlimp4BWJqMDMLmPpx/Q9R3OAlL4g==",
"dev": true, "dev": true,
"license": "MIT", "license": "MIT",
"dependencies": { "dependencies": {
@@ -794,13 +794,13 @@
} }
}, },
"node_modules/@eslint/config-array": { "node_modules/@eslint/config-array": {
"version": "0.20.0", "version": "0.21.1",
"resolved": "https://registry.npmjs.org/@eslint/config-array/-/config-array-0.20.0.tgz", "resolved": "https://registry.npmjs.org/@eslint/config-array/-/config-array-0.21.1.tgz",
"integrity": "sha512-fxlS1kkIjx8+vy2SjuCB94q3htSNrufYTXubwiBFeaQHbH6Ipi43gFJq2zCMt6PHhImH3Xmr0NksKDvchWlpQQ==", "integrity": "sha512-aw1gNayWpdI/jSYVgzN5pL0cfzU02GT3NBpeT/DXbx1/1x7ZKxFPd9bwrzygx/qiwIQiJ1sw/zD8qY/kRvlGHA==",
"dev": true, "dev": true,
"license": "Apache-2.0", "license": "Apache-2.0",
"dependencies": { "dependencies": {
"@eslint/object-schema": "^2.1.6", "@eslint/object-schema": "^2.1.7",
"debug": "^4.3.1", "debug": "^4.3.1",
"minimatch": "^3.1.2" "minimatch": "^3.1.2"
}, },
@@ -809,19 +809,22 @@
} }
}, },
"node_modules/@eslint/config-helpers": { "node_modules/@eslint/config-helpers": {
"version": "0.2.2", "version": "0.4.2",
"resolved": "https://registry.npmjs.org/@eslint/config-helpers/-/config-helpers-0.2.2.tgz", "resolved": "https://registry.npmjs.org/@eslint/config-helpers/-/config-helpers-0.4.2.tgz",
"integrity": "sha512-+GPzk8PlG0sPpzdU5ZvIRMPidzAnZDl/s9L+y13iodqvb8leL53bTannOrQ/Im7UkpsmFU5Ily5U60LWixnmLg==", "integrity": "sha512-gBrxN88gOIf3R7ja5K9slwNayVcZgK6SOUORm2uBzTeIEfeVaIhOpCtTox3P6R7o2jLFwLFTLnC7kU/RGcYEgw==",
"dev": true, "dev": true,
"license": "Apache-2.0", "license": "Apache-2.0",
"dependencies": {
"@eslint/core": "^0.17.0"
},
"engines": { "engines": {
"node": "^18.18.0 || ^20.9.0 || >=21.1.0" "node": "^18.18.0 || ^20.9.0 || >=21.1.0"
} }
}, },
"node_modules/@eslint/core": { "node_modules/@eslint/core": {
"version": "0.14.0", "version": "0.17.0",
"resolved": "https://registry.npmjs.org/@eslint/core/-/core-0.14.0.tgz", "resolved": "https://registry.npmjs.org/@eslint/core/-/core-0.17.0.tgz",
"integrity": "sha512-qIbV0/JZr7iSDjqAc60IqbLdsj9GDt16xQtWD+B78d/HAlvysGdZZ6rpJHGAc2T0FQx1X6thsSPdnoiGKdNtdg==", "integrity": "sha512-yL/sLrpmtDaFEiUj1osRP4TI2MDz1AddJL+jZ7KSqvBuliN4xqYY54IfdN8qD8Toa6g1iloph1fxQNkjOxrrpQ==",
"dev": true, "dev": true,
"license": "Apache-2.0", "license": "Apache-2.0",
"dependencies": { "dependencies": {
@@ -869,9 +872,9 @@
} }
}, },
"node_modules/@eslint/js": { "node_modules/@eslint/js": {
"version": "9.28.0", "version": "9.39.1",
"resolved": "https://registry.npmjs.org/@eslint/js/-/js-9.28.0.tgz", "resolved": "https://registry.npmjs.org/@eslint/js/-/js-9.39.1.tgz",
"integrity": "sha512-fnqSjGWd/CoIp4EXIxWVK/sHA6DOHN4+8Ix2cX5ycOY7LG0UY8nHCU5pIp2eaE1Mc7Qd8kHspYNzYXT2ojPLzg==", "integrity": "sha512-S26Stp4zCy88tH94QbBv3XCuzRQiZ9yXofEILmglYTh/Ug/a9/umqvgFtYBAo3Lp0nsI/5/qH1CCrbdK3AP1Tw==",
"dev": true, "dev": true,
"license": "MIT", "license": "MIT",
"engines": { "engines": {
@@ -882,9 +885,9 @@
} }
}, },
"node_modules/@eslint/object-schema": { "node_modules/@eslint/object-schema": {
"version": "2.1.6", "version": "2.1.7",
"resolved": "https://registry.npmjs.org/@eslint/object-schema/-/object-schema-2.1.6.tgz", "resolved": "https://registry.npmjs.org/@eslint/object-schema/-/object-schema-2.1.7.tgz",
"integrity": "sha512-RBMg5FRL0I0gs51M/guSAj5/e14VQ4tpZnQNWwuDT66P14I43ItmPfIZRhO9fUVIPOAQXU47atlywZ/czoqFPA==", "integrity": "sha512-VtAOaymWVfZcmZbp6E2mympDIHvyjXs/12LqWYjVw6qjrfF+VK+fyG33kChz3nnK+SU5/NeHOqrTEHS8sXO3OA==",
"dev": true, "dev": true,
"license": "Apache-2.0", "license": "Apache-2.0",
"engines": { "engines": {
@@ -892,13 +895,13 @@
} }
}, },
"node_modules/@eslint/plugin-kit": { "node_modules/@eslint/plugin-kit": {
"version": "0.3.1", "version": "0.4.1",
"resolved": "https://registry.npmjs.org/@eslint/plugin-kit/-/plugin-kit-0.3.1.tgz", "resolved": "https://registry.npmjs.org/@eslint/plugin-kit/-/plugin-kit-0.4.1.tgz",
"integrity": "sha512-0J+zgWxHN+xXONWIyPWKFMgVuJoZuGiIFu8yxk7RJjxkzpGmyja5wRFqZIVtjDVOQpV+Rw0iOAjYPE2eQyjr0w==", "integrity": "sha512-43/qtrDUokr7LJqoF2c3+RInu/t4zfrpYdoSDfYyhg52rwLV6TnOvdG4fXm7IkSB3wErkcmJS9iEhjVtOSEjjA==",
"dev": true, "dev": true,
"license": "Apache-2.0", "license": "Apache-2.0",
"dependencies": { "dependencies": {
"@eslint/core": "^0.14.0", "@eslint/core": "^0.17.0",
"levn": "^0.4.1" "levn": "^0.4.1"
}, },
"engines": { "engines": {
@@ -1908,9 +1911,9 @@
} }
}, },
"node_modules/@typescript-eslint/typescript-estree/node_modules/brace-expansion": { "node_modules/@typescript-eslint/typescript-estree/node_modules/brace-expansion": {
"version": "2.0.1", "version": "2.0.2",
"resolved": "https://registry.npmjs.org/brace-expansion/-/brace-expansion-2.0.1.tgz", "resolved": "https://registry.npmjs.org/brace-expansion/-/brace-expansion-2.0.2.tgz",
"integrity": "sha512-XnAIvQ8eM+kC6aULx6wuQiwVsnzsi9d3WxzV3FpWTGA19F621kwdbsAcFKXgKUHZWsy+mY6iL1sHTxWEFCytDA==", "integrity": "sha512-Jt0vHyM+jmUBqojB7E1NIYadt0vI0Qxjxd2TErW94wDz+E2LAm5vKMXXwg6ZZBTHPuUlDgQHKXvjGBdfcF1ZDQ==",
"dev": true, "dev": true,
"license": "MIT", "license": "MIT",
"dependencies": { "dependencies": {
@@ -2010,9 +2013,9 @@
} }
}, },
"node_modules/acorn": { "node_modules/acorn": {
"version": "8.14.1", "version": "8.15.0",
"resolved": "https://registry.npmjs.org/acorn/-/acorn-8.14.1.tgz", "resolved": "https://registry.npmjs.org/acorn/-/acorn-8.15.0.tgz",
"integrity": "sha512-OvQ/2pUDKmgfCg++xsTX1wGxfTaszcHVcTctW4UJB4hibJx2HXxxO5UmVgyjMa+ZDsiaf5wWLXYpRWMmBI0QHg==", "integrity": "sha512-NZyJarBfL7nWwIq+FDL6Zp/yHEhePMNnnJ0y3qfieCrmNvYct8uvtiV41UvlSe6apAfk0fY1FbWx+NwfmpvtTg==",
"dev": true, "dev": true,
"license": "MIT", "license": "MIT",
"bin": { "bin": {
@@ -2080,9 +2083,9 @@
"license": "MIT" "license": "MIT"
}, },
"node_modules/brace-expansion": { "node_modules/brace-expansion": {
"version": "1.1.11", "version": "1.1.12",
"resolved": "https://registry.npmjs.org/brace-expansion/-/brace-expansion-1.1.11.tgz", "resolved": "https://registry.npmjs.org/brace-expansion/-/brace-expansion-1.1.12.tgz",
"integrity": "sha512-iCuPHDFgrHX7H2vEI/5xpz07zSHB00TpugqhmYtVmMO6518mCuRMoOYFldEBl0g187ufozdaHgWKcYFb61qGiA==", "integrity": "sha512-9T9UjW3r0UW5c1Q7GTwllptXwhvYmEzFhzMfZ9H7FQWt+uZePjZPjBP/W1ZEyZ1twGWom5/56TF4lPcqjnDHcg==",
"dev": true, "dev": true,
"license": "MIT", "license": "MIT",
"dependencies": { "dependencies": {
@@ -2380,33 +2383,32 @@
} }
}, },
"node_modules/eslint": { "node_modules/eslint": {
"version": "9.28.0", "version": "9.39.1",
"resolved": "https://registry.npmjs.org/eslint/-/eslint-9.28.0.tgz", "resolved": "https://registry.npmjs.org/eslint/-/eslint-9.39.1.tgz",
"integrity": "sha512-ocgh41VhRlf9+fVpe7QKzwLj9c92fDiqOj8Y3Sd4/ZmVA4Btx4PlUYPq4pp9JDyupkf1upbEXecxL2mwNV7jPQ==", "integrity": "sha512-BhHmn2yNOFA9H9JmmIVKJmd288g9hrVRDkdoIgRCRuSySRUHH7r/DI6aAXW9T1WwUuY3DFgrcaqB+deURBLR5g==",
"dev": true, "dev": true,
"license": "MIT", "license": "MIT",
"dependencies": { "dependencies": {
"@eslint-community/eslint-utils": "^4.2.0", "@eslint-community/eslint-utils": "^4.8.0",
"@eslint-community/regexpp": "^4.12.1", "@eslint-community/regexpp": "^4.12.1",
"@eslint/config-array": "^0.20.0", "@eslint/config-array": "^0.21.1",
"@eslint/config-helpers": "^0.2.1", "@eslint/config-helpers": "^0.4.2",
"@eslint/core": "^0.14.0", "@eslint/core": "^0.17.0",
"@eslint/eslintrc": "^3.3.1", "@eslint/eslintrc": "^3.3.1",
"@eslint/js": "9.28.0", "@eslint/js": "9.39.1",
"@eslint/plugin-kit": "^0.3.1", "@eslint/plugin-kit": "^0.4.1",
"@humanfs/node": "^0.16.6", "@humanfs/node": "^0.16.6",
"@humanwhocodes/module-importer": "^1.0.1", "@humanwhocodes/module-importer": "^1.0.1",
"@humanwhocodes/retry": "^0.4.2", "@humanwhocodes/retry": "^0.4.2",
"@types/estree": "^1.0.6", "@types/estree": "^1.0.6",
"@types/json-schema": "^7.0.15",
"ajv": "^6.12.4", "ajv": "^6.12.4",
"chalk": "^4.0.0", "chalk": "^4.0.0",
"cross-spawn": "^7.0.6", "cross-spawn": "^7.0.6",
"debug": "^4.3.2", "debug": "^4.3.2",
"escape-string-regexp": "^4.0.0", "escape-string-regexp": "^4.0.0",
"eslint-scope": "^8.3.0", "eslint-scope": "^8.4.0",
"eslint-visitor-keys": "^4.2.0", "eslint-visitor-keys": "^4.2.1",
"espree": "^10.3.0", "espree": "^10.4.0",
"esquery": "^1.5.0", "esquery": "^1.5.0",
"esutils": "^2.0.2", "esutils": "^2.0.2",
"fast-deep-equal": "^3.1.3", "fast-deep-equal": "^3.1.3",
@@ -2464,9 +2466,9 @@
} }
}, },
"node_modules/eslint-scope": { "node_modules/eslint-scope": {
"version": "8.3.0", "version": "8.4.0",
"resolved": "https://registry.npmjs.org/eslint-scope/-/eslint-scope-8.3.0.tgz", "resolved": "https://registry.npmjs.org/eslint-scope/-/eslint-scope-8.4.0.tgz",
"integrity": "sha512-pUNxi75F8MJ/GdeKtVLSbYg4ZI34J6C0C7sbL4YOp2exGwen7ZsuBqKzUhXd0qMQ362yET3z+uPwKeg/0C2XCQ==", "integrity": "sha512-sNXOfKCn74rt8RICKMvJS7XKV/Xk9kA7DyJr8mJik3S7Cwgy3qlkkmyS2uQB3jiJg6VNdZd/pDBJu0nvG2NlTg==",
"dev": true, "dev": true,
"license": "BSD-2-Clause", "license": "BSD-2-Clause",
"dependencies": { "dependencies": {
@@ -2481,9 +2483,9 @@
} }
}, },
"node_modules/eslint-visitor-keys": { "node_modules/eslint-visitor-keys": {
"version": "4.2.0", "version": "4.2.1",
"resolved": "https://registry.npmjs.org/eslint-visitor-keys/-/eslint-visitor-keys-4.2.0.tgz", "resolved": "https://registry.npmjs.org/eslint-visitor-keys/-/eslint-visitor-keys-4.2.1.tgz",
"integrity": "sha512-UyLnSehNt62FFhSwjZlHmeokpRK59rcz29j+F1/aDgbkbRTk7wIc9XzdoasMUbRNKDM0qQt/+BJ4BrpFeABemw==", "integrity": "sha512-Uhdk5sfqcee/9H/rCOJikYz67o0a2Tw2hGRPOG2Y1R2dg7brRe1uG0yaNQDHu+TO/uQPF/5eCapvYSmHUjt7JQ==",
"dev": true, "dev": true,
"license": "Apache-2.0", "license": "Apache-2.0",
"engines": { "engines": {
@@ -2494,15 +2496,15 @@
} }
}, },
"node_modules/espree": { "node_modules/espree": {
"version": "10.3.0", "version": "10.4.0",
"resolved": "https://registry.npmjs.org/espree/-/espree-10.3.0.tgz", "resolved": "https://registry.npmjs.org/espree/-/espree-10.4.0.tgz",
"integrity": "sha512-0QYC8b24HWY8zjRnDTL6RiHfDbAWn63qb4LMj1Z4b076A4une81+z03Kg7l7mn/48PUTqoLptSXez8oknU8Clg==", "integrity": "sha512-j6PAQ2uUr79PZhBjP5C5fhl8e39FmRnOjsD5lGnWrFU8i2G776tBK7+nP8KuQUTTyAZUwfQqXAgrVH5MbH9CYQ==",
"dev": true, "dev": true,
"license": "BSD-2-Clause", "license": "BSD-2-Clause",
"dependencies": { "dependencies": {
"acorn": "^8.14.0", "acorn": "^8.15.0",
"acorn-jsx": "^5.3.2", "acorn-jsx": "^5.3.2",
"eslint-visitor-keys": "^4.2.0" "eslint-visitor-keys": "^4.2.1"
}, },
"engines": { "engines": {
"node": "^18.18.0 || ^20.9.0 || >=21.1.0" "node": "^18.18.0 || ^20.9.0 || >=21.1.0"
@@ -2852,9 +2854,9 @@
"license": "MIT" "license": "MIT"
}, },
"node_modules/js-yaml": { "node_modules/js-yaml": {
"version": "4.1.0", "version": "4.1.1",
"resolved": "https://registry.npmjs.org/js-yaml/-/js-yaml-4.1.0.tgz", "resolved": "https://registry.npmjs.org/js-yaml/-/js-yaml-4.1.1.tgz",
"integrity": "sha512-wpxZs9NoxZaJESJGIZTyDEaYpl0FKSA+FB9aJiyemKhMwkxQg63h4T1KJgUGHpTqPDNRcmmYLugrRjJlBtWvRA==", "integrity": "sha512-qQKT4zQxXl8lLwBtHMWwaTcGfFOZviOJet3Oy/xmGk2gZH677CJM9EvtfdSkgWcATZhj/55JZ0rmy3myCT5lsA==",
"dev": true, "dev": true,
"license": "MIT", "license": "MIT",
"dependencies": { "dependencies": {