llama-swap

Author	SHA1	Message	Date
Benson Wong	fe71e8a6ea	proxy,ui-svelte: improve support for v1/messages and v1/responses (#758 ) This improves the support for activity logging from the v1/responses and v1/messages endpoints. - add chat endpoint selection to Playground > Chat > Settings - improve metrics extraction for streaming v1/messages and v1/responses endpoints (tested with llama-server) Fixes #742	2026-05-14 21:53:57 -07:00
Benson Wong	430166d5eb	proxy: fix zero duration for non streaming responses (#723 ) Updates #654	2026-04-30 19:51:28 -07:00
Benson Wong	fd3c28ffc5	Refactor Activity Page (#710 ) - inference handles to store an activity record for all inference endpoints - add path, status code, and content type to Activities page - toggle on/off columns no Activities page - add configurable capture level for inference endpoints so large binary blobs are not stored in memory - store captures in compressed binary format	2026-04-28 20:33:03 -07:00
Benson Wong	5e3c646829	proxy: compress captures with zstd (#668 ) The previous captures were saved uncompressed in memory. In agentic workflows there can be many turns with each request containing the previous context in the body with a lot of redundant data. Use zstd to compress the request and response data before keeping a copy of memory. Results: - Average Percentage Saved: 73.19% - Average Compression Factor: ~6.77:1	2026-04-17 23:29:37 -07:00
Leoy	06bc6a614c	proxy: preserve wall-clock duration in metrics (#629 ) Keep request duration from being underreported when upstream timings only cover part of the full request lifecycle. - compare wall-clock and upstream timing durations - keep token and throughput values from timings - add regression coverage for underreported timings fixes #602	2026-04-07 01:52:41 -07:00
Benson Wong	c3c258a55d	proxy: fix metrics capture for v1/responses (#586 ) properly parse anthropic compatible usage data from streaming responses. closes: #577	2026-03-13 16:50:12 -07:00
Benson Wong	8d6d949ec3	proxy: support timings for /infill from llama-server (#510 ) fixes: #463	2026-02-07 17:16:27 -08:00
Benson Wong	b5fde8eb6d	proxy,ui-svelte: add request/response capturing (#508 ) Add saving request and response headers and bodies that go through llama-swap in memory. - captureBuffer added to configuration. Captures are enabled by default. - 5MB of memory is allocated for req/response captures in a ring buffer. Setting captureBuffer to 0 will disable captures. - UI elements to view captured data added to Activity page. Includes some QOL features like json formatting and recombining SSE chat streams - capture saving is done at the byte level and has minimal impact on llama-swap performance Fixes #464 Ref #503	2026-02-07 15:40:01 -08:00
Benson Wong	22e098ac8b	Add Peer Model Support (#438 ) This PR allows a single llama-swap to be the central proxy for models served by other inference servers. The peer servers can be another llama-swap or any API that supports the /v1/* inference endpoint. Updates: #433, #299 Closes: #296	2025-12-27 20:18:06 -08:00
Benson Wong	e250e71e59	Include metrics from upstream chat requests (#361 ) * proxy: refactor metrics recording - remove metrics_middleware.go as this wrapper is no longer needed. This also eliminiates double body parsing for the modelID - move metrics parsing to be part of MetricsMonitor - refactor how metrics are recording in ProxyManager - add MetricsMonitor tests - improve mem efficiency of processStreamingResponse - add benchmarks for MetricsMonitor.addMetrics - proxy: refactor MetricsMonitor to be more safe handling errors	2025-10-25 17:38:18 -07:00

10 Commits