The previous captures were saved uncompressed in memory. In agentic
workflows there can be many turns with each request containing the
previous context in the body with a lot of redundant data. Use zstd to
compress the request and response data before keeping a copy of memory.
Results:
- Average Percentage Saved: 73.19%
- Average Compression Factor: ~6.77:1
Keep request duration from being underreported when upstream timings
only cover part of the full request lifecycle.
- compare wall-clock and upstream timing durations
- keep token and throughput values from timings
- add regression coverage for underreported timings
fixes#602
Add saving request and response headers and bodies that go through
llama-swap in memory.
- captureBuffer added to configuration. Captures are enabled by default.
- 5MB of memory is allocated for req/response captures in a ring buffer.
Setting captureBuffer to 0 will disable captures.
- UI elements to view captured data added to Activity page. Includes
some
QOL features like json formatting and recombining SSE chat streams
- capture saving is done at the byte level and has minimal impact on
llama-swap performance
Fixes#464
Ref #503
This PR allows a single llama-swap to be the central proxy for models served by other inference servers. The peer servers can be another llama-swap or any API that supports the /v1/* inference endpoint.
Updates: #433, #299Closes: #296
* proxy: refactor metrics recording
- remove metrics_middleware.go as this wrapper is no longer needed. This
also eliminiates double body parsing for the modelID
- move metrics parsing to be part of MetricsMonitor
- refactor how metrics are recording in ProxyManager
- add MetricsMonitor tests
- improve mem efficiency of processStreamingResponse
- add benchmarks for MetricsMonitor.addMetrics
- proxy: refactor MetricsMonitor to be more safe handling errors