proxy, ui: add pending requests count to the main dashboard (#516 )

add a real time counter of pending (inflight) requests to the UI.
.coderabbit.yaml,AGENTS.md: small tweaks
2026-02-16 09:41:15 -08:00 · 2026-02-15 21:31:30 -08:00 · 2026-02-15 21:30:52 -08:00 · 2026-02-15 11:00:44 -08:00 · 2026-02-14 10:23:08 -08:00 · 2026-02-14 10:06:05 -08:00
34 changed files with 1832 additions and 152 deletions
@@ -4,7 +4,7 @@ early_access: false
 reviews:
  profile: "chill"
  request_changes_workflow: false
-  high_level_summary: true
+  high_level_summary: false
  poem: false
  review_status: true
  collapse_walkthrough: false
@@ -17,6 +17,13 @@ on:
      - 'docker/build-container.sh'
      - 'docker/*.Containerfile'

+# grant permissions on GITHUB_TOKEN to publish packages
+# ref: https://docs.github.com/en/packages/managing-github-packages-using-github-actions-workflows/publishing-and-installing-a-package-with-github-actions#publishing-a-package-using-an-action
+permissions:
+  contents: read
+  packages: write
+  id-token: write
+
 jobs:
  build-and-push:
    runs-on: ubuntu-latest
@@ -0,0 +1,50 @@
+## Project Description:
+
+llama-swap is a light weight, transparent proxy server that provides automatic model swapping to llama.cpp's server.
+
+## Tech stack
+
+- golang
+- typescript, vite and svelt5 for UI (located in ui/)
+
+## Workflow Tasks
+
+- when summarizing changes only include details that require further action
+- just say "Done." when there is no further action
+- use the github CLI `gh` to create pull requests and work with github
+- Rules for creating pull requests:
+  - keep them short and focused on changes.
+  - never include a test plan
+  - write the summary using the same style rules as commit message
+
+## Testing
+
+- Follow test naming conventions like `TestProxyManager_<test name>`, `TestProcessGroup_<test name>`, etc.
+- Use `go test -v -run <name pattern for new tests>` to run any new tests you've written.
+- Use `make test-dev` after running new tests for a quick over all test run. This runs `go test` and `staticcheck`. Fix any static checking errors. Use this only when changes are made to any code under the `proxy/` directory
+- Use `make test-all` before completing work. This includes long running concurrency tests.
+
+### Commit message example format:
+
+```
+proxy: add new feature
+
+Add new feature that implements functionality X and Y.
+
+- key change 1
+- key change 2
+- key change 3
+
+fixes #123
+```
+
+## Code Reviews
+
+- use three levels High, Medium, Low severity
+- label each discovered issue with a label like H1, M2, L3 respectively
+- High severity are must fix issues (security, race conditions, critical bugs)
+- Medium severity are recommended improvements (coding style, missing functionality, inconsistencies)
+- Low severity are nice to have changes and nits
+- Include a suggestion with each discovered item
+- Limit your code review to three items with the highest priority first
+- Double check your discovered items and recommended remediations
@@ -1,49 +1 @@
-## Project Description:
-
-llama-swap is a light weight, transparent proxy server that provides automatic model swapping to llama.cpp's server.
-
-## Tech stack
-
- golang
- typescript, vite and react for UI (located in ui/)
-
-## Workflow Tasks
-
- when summarizing changes only include details that require further action
- just say "Done." when there is no further action
- use `gh` to create PRs and load issues
- do include Co-Authored-By or created by when committing changes or creating PRs
- keep PR descriptions short and focused on changes.
-  - never include a test plan
-
-## Testing
-
- Follow test naming conventions like `TestProxyManager_<test name>`, `TestProcessGroup_<test name>`, etc.
- Use `go test -v -run <name pattern for new tests>` to run any new tests you've written.
- Use `make test-dev` after running new tests for a quick over all test run. This runs `go test` and `staticcheck`. Fix any static checking errors. Use this only when changes are made to any code under the `proxy/` directory
- Use `make test-all` before completing work. This includes long running concurrency tests.
-
-### Commit message example format:
-
-```
-proxy: add new feature
-
-Add new feature that implements functionality X and Y.
-
- key change 1
- key change 2
- key change 3
-
-fixes #123
-```
-
-## Code Reviews
-
- use three levels High, Medium, Low severity
- label each discovered issue with a label like H1, M2, L3 respectively
- High severity are must fix issues (security, race conditions, critical bugs)
- Medium severity are recommended improvements (coding style, missing functionality, inconsistencies)
- Low severity are nice to have changes and nits
- Include a suggestion with each discovered item
- Limit your code review to three items with the highest priority first
- Double check your discovered items and recommended remediations
+@AGENTS.md
@@ -13,7 +13,7 @@ Built in Go for performance and simplicity, llama-swap has zero dependencies and

 - ✅ Easy to deploy and configure: one binary, one configuration file. no external dependencies
 - ✅ On-demand model switching
- ✅ Use any local OpenAI compatible server (llama.cpp, vllm, tabbyAPI, etc.)
+- ✅ Use any local OpenAI compatible server (llama.cpp, vllm, tabbyAPI, stable-diffusion.cpp, etc.)
  - future proof, upgrade your inference servers at any time.
 - ✅ OpenAI API supported endpoints:
  - `v1/completions`
@@ -69,6 +69,7 @@ llama-swap can be installed in multiple ways
 ### Docker Install ([download images](https://github.com/mostlygeek/llama-swap/pkgs/container/llama-swap))

 Nightly container images with llama-swap and llama-server are built for multiple platforms (cuda, vulkan, intel, etc.) including [non-root variants with improved security](docs/container-security.md).
+The stable-diffusion.cpp server is also included for the musa and vulkan platforms.

 ```shell
 $ docker pull ghcr.io/mostlygeek/llama-swap:cuda
@@ -87,6 +87,12 @@
            "default": 1000,
            "description": "Maximum number of metrics to keep in memory. Controls how many metrics are stored before older ones are discarded."
        },
+        "captureBuffer": {
+            "type": "integer",
+            "minimum": 0,
+            "default": 5,
+            "description": "Size in megabytes of the buffer for storing request/response captures. Set to 0 to disable captures."
+        },
        "startPort": {
            "type": "integer",
            "default": 5800,
@@ -50,6 +50,11 @@ logToStdout: "proxy"
 # - useful for limiting memory usage when processing large volumes of metrics
 metricsMaxInMemory: 1000

+# captureBuffer: how many MBs to allocate for storing request/response captures
+# - optional, default: 10
+# - set to 0 to disable
+captureBuffer: 15
+
 # startPort: sets the starting port number for the automatic ${PORT} macro.
 # - optional, default: 5800
 # - the ${PORT} macro can be used in model.cmd and model.proxy settings
@@ -142,7 +142,7 @@ for CONTAINER_TYPE in non-root root; do
  fi

  log_info "Building $CONTAINER_TYPE $CONTAINER_TAG $LS_VER"
-  docker build -f llama-swap.Containerfile --build-arg BASE_TAG=${BASE_TAG} --build-arg LS_VER=${LS_VER} --build-arg UID=${USER_UID} \
+  docker build --provenance=false -f llama-swap.Containerfile --build-arg BASE_TAG=${BASE_TAG} --build-arg LS_VER=${LS_VER} --build-arg UID=${USER_UID} \
    --build-arg LS_REPO=${LS_REPO} --build-arg GID=${USER_GID} --build-arg USER_HOME=${USER_HOME} -t ${CONTAINER_TAG} -t ${CONTAINER_LATEST} \
    --build-arg BASE_IMAGE=${BASE_IMAGE} .

@@ -150,7 +150,7 @@ for CONTAINER_TYPE in non-root root; do
  case "$ARCH" in
    "musa" | "vulkan")
      log_info "Adding sd-server to $CONTAINER_TAG"
-      docker build -f llama-swap-sd.Containerfile \
+      docker build --provenance=false -f llama-swap-sd.Containerfile \
        --build-arg BASE=${CONTAINER_TAG} \
        --build-arg SD_IMAGE=${SD_IMAGE} --build-arg SD_TAG=${SD_TAG} \
        --build-arg UID=${USER_UID} --build-arg GID=${USER_GID} \
@@ -123,6 +123,7 @@ type Config struct {
 	LogTimeFormat      string                 `yaml:"logTimeFormat"`
 	LogToStdout        string                 `yaml:"logToStdout"`
 	MetricsMaxInMemory int                    `yaml:"metricsMaxInMemory"`
+	CaptureBuffer      int                    `yaml:"captureBuffer"`
 	Models             map[string]ModelConfig `yaml:"models"` /* key is model ID */
 	Profiles           map[string][]string    `yaml:"profiles"`
 	Groups             map[string]GroupConfig `yaml:"groups"` /* key is group ID */
@@ -201,6 +202,7 @@ func LoadConfigFromReader(r io.Reader) (Config, error) {
 		LogTimeFormat:      "",
 		LogToStdout:        LogToStdoutProxy,
 		MetricsMaxInMemory: 1000,
+		CaptureBuffer:      5,
 	}
 	if err = yaml.Unmarshal([]byte(yamlStr), &config); err != nil {
 		return Config{}, err
@@ -215,6 +215,7 @@ groups:
 		},
 		HealthCheckTimeout: 15,
 		MetricsMaxInMemory: 1000,
+		CaptureBuffer:      5,
 		Profiles: map[string][]string{
 			"test": {"model1", "model2"},
 		},
@@ -204,6 +204,7 @@ groups:
 		},
 		HealthCheckTimeout: 15,
 		MetricsMaxInMemory: 1000,
+		CaptureBuffer:      5,
 		Profiles: map[string][]string{
 			"test": {"model1", "model2"},
 		},
@@ -8,6 +8,7 @@ const ConfigFileChangedEventID = 0x03
 const LogDataEventID = 0x04
 const TokenMetricsEventID = 0x05
 const ModelPreloadedEventID = 0x06
+const InFlightRequestsEventID = 0x07

 type ProcessStateChangeEvent struct {
 	ProcessName string
@@ -58,3 +59,11 @@ type ModelPreloadedEvent struct {
 func (e ModelPreloadedEvent) Type() uint32 {
 	return ModelPreloadedEventID
 }
+
+type InFlightRequestsEvent struct {
+	Total int
+}
+
+func (e InFlightRequestsEvent) Type() uint32 {
+	return InFlightRequestsEventID
+}
@@ -28,6 +28,28 @@ type TokenMetrics struct {
 	PromptPerSecond float64   `json:"prompt_per_second"`
 	TokensPerSecond float64   `json:"tokens_per_second"`
 	DurationMs      int       `json:"duration_ms"`
+	HasCapture      bool      `json:"has_capture"`
+}
+
+type ReqRespCapture struct {
+	ID          int               `json:"id"`
+	ReqPath     string            `json:"req_path"`
+	ReqHeaders  map[string]string `json:"req_headers"`
+	ReqBody     []byte            `json:"req_body"`
+	RespHeaders map[string]string `json:"resp_headers"`
+	RespBody    []byte            `json:"resp_body"`
+}
+
+// Size returns the approximate memory usage of this capture in bytes
+func (c *ReqRespCapture) Size() int {
+	size := len(c.ReqPath) + len(c.ReqBody) + len(c.RespBody)
+	for k, v := range c.ReqHeaders {
+		size += len(k) + len(v)
+	}
+	for k, v := range c.RespHeaders {
+		size += len(k) + len(v)
+	}
+	return size
 }

 // TokenMetricsEvent represents a token metrics event
@@ -46,19 +68,32 @@ type metricsMonitor struct {
 	maxMetrics int
 	nextID     int
 	logger     *LogMonitor
+
+	// capture fields
+	enableCaptures bool
+	captures       map[int]ReqRespCapture // map for O(1) lookup by ID
+	captureOrder   []int                  // track insertion order for FIFO eviction
+	captureSize    int                    // current total size in bytes
+	maxCaptureSize int                    // max bytes for captures
 }

-func newMetricsMonitor(logger *LogMonitor, maxMetrics int) *metricsMonitor {
-	mp := &metricsMonitor{
-		logger:     logger,
-		maxMetrics: maxMetrics,
+// newMetricsMonitor creates a new metricsMonitor. captureBufferMB is the
+// capture buffer size in megabytes; 0 disables captures.
+func newMetricsMonitor(logger *LogMonitor, maxMetrics int, captureBufferMB int) *metricsMonitor {
+	return &metricsMonitor{
+		logger:         logger,
+		maxMetrics:     maxMetrics,
+		enableCaptures: captureBufferMB > 0,
+		captures:       make(map[int]ReqRespCapture),
+		captureOrder:   make([]int, 0),
+		captureSize:    0,
+		maxCaptureSize: captureBufferMB * 1024 * 1024,
 	}
-
-	return mp
 }

-// addMetrics adds a new metric to the collection and publishes an event
-func (mp *metricsMonitor) addMetrics(metric TokenMetrics) {
+// addMetrics adds a new metric to the collection and publishes an event.
+// Returns the assigned metric ID.
+func (mp *metricsMonitor) addMetrics(metric TokenMetrics) int {
 	mp.mu.Lock()
 	defer mp.mu.Unlock()

@@ -69,6 +104,49 @@ func (mp *metricsMonitor) addMetrics(metric TokenMetrics) {
 		mp.metrics = mp.metrics[len(mp.metrics)-mp.maxMetrics:]
 	}
 	event.Emit(TokenMetricsEvent{Metrics: metric})
+	return metric.ID
+}
+
+// addCapture adds a new capture to the buffer with size-based eviction.
+// Captures are skipped if enableCaptures is false or if capture exceeds maxCaptureSize.
+func (mp *metricsMonitor) addCapture(capture ReqRespCapture) {
+	if !mp.enableCaptures {
+		return
+	}
+
+	mp.mu.Lock()
+	defer mp.mu.Unlock()
+
+	captureSize := capture.Size()
+	if captureSize > mp.maxCaptureSize {
+		mp.logger.Warnf("capture size %d exceeds max %d, skipping", captureSize, mp.maxCaptureSize)
+		return
+	}
+
+	// Evict oldest (FIFO) until room available
+	for mp.captureSize+captureSize > mp.maxCaptureSize && len(mp.captureOrder) > 0 {
+		oldestID := mp.captureOrder[0]
+		mp.captureOrder = mp.captureOrder[1:]
+		if evicted, exists := mp.captures[oldestID]; exists {
+			mp.captureSize -= evicted.Size()
+			delete(mp.captures, oldestID)
+		}
+	}
+
+	mp.captures[capture.ID] = capture
+	mp.captureOrder = append(mp.captureOrder, capture.ID)
+	mp.captureSize += captureSize
+}
+
+// getCaptureByID returns a capture by its ID, or nil if not found.
+func (mp *metricsMonitor) getCaptureByID(id int) *ReqRespCapture {
+	mp.mu.RLock()
+	defer mp.mu.RUnlock()
+
+	if capture, exists := mp.captures[id]; exists {
+		return &capture
+	}
+	return nil
 }

 // getMetrics returns a copy of the current metrics
@@ -97,6 +175,28 @@ func (mp *metricsMonitor) wrapHandler(
 	request *http.Request,
 	next func(modelID string, w http.ResponseWriter, r *http.Request) error,
 ) error {
+	// Capture request body and headers if captures enabled
+	var reqBody []byte
+	var reqHeaders map[string]string
+	if mp.enableCaptures {
+		if request.Body != nil {
+			var err error
+			reqBody, err = io.ReadAll(request.Body)
+			if err != nil {
+				return fmt.Errorf("failed to read request body for capture: %w", err)
+			}
+			request.Body.Close()
+			request.Body = io.NopCloser(bytes.NewBuffer(reqBody))
+		}
+		reqHeaders = make(map[string]string)
+		for key, values := range request.Header {
+			if len(values) > 0 {
+				reqHeaders[key] = values[0]
+			}
+		}
+		redactHeaders(reqHeaders)
+	}
+
 	recorder := newBodyCopier(writer)

 	// Filter Accept-Encoding to only include encodings we can decompress for metrics
@@ -140,7 +240,6 @@ func (mp *metricsMonitor) wrapHandler(
 			return nil
 		}
 	}
-
 	if strings.Contains(recorder.Header().Get("Content-Type"), "text/event-stream") {
 		if parsed, err := processStreamingResponse(modelID, recorder.StartTime(), body); err != nil {
 			mp.logger.Warnf("error processing streaming response: %v, path=%s, recording minimal metrics", err, request.URL.Path)
@@ -153,6 +252,14 @@ func (mp *metricsMonitor) wrapHandler(
 			usage := parsed.Get("usage")
 			timings := parsed.Get("timings")

+			// extract timings for infill - response is an array, timings are in the last element
+			// see #463
+			if strings.HasPrefix(request.URL.Path, "/infill") {
+				if arr := parsed.Array(); len(arr) > 0 {
+					timings = arr[len(arr)-1].Get("timings")
+				}
+			}
+
 			if usage.Exists() || timings.Exists() {
 				if parsedMetrics, err := parseMetrics(modelID, recorder.StartTime(), usage, timings); err != nil {
 					mp.logger.Warnf("error parsing metrics: %v, path=%s, recording minimal metrics", err, request.URL.Path)
@@ -165,7 +272,38 @@ func (mp *metricsMonitor) wrapHandler(
 		}
 	}

-	mp.addMetrics(tm)
+	// Build capture if enabled and determine if it will be stored
+	var capture *ReqRespCapture
+	if mp.enableCaptures {
+		respHeaders := make(map[string]string)
+		for key, values := range recorder.Header() {
+			if len(values) > 0 {
+				respHeaders[key] = values[0]
+			}
+		}
+		redactHeaders(respHeaders)
+		delete(respHeaders, "Content-Encoding")
+		capture = &ReqRespCapture{
+			ReqPath:     request.URL.Path,
+			ReqHeaders:  reqHeaders,
+			ReqBody:     reqBody,
+			RespHeaders: respHeaders,
+			RespBody:    body,
+		}
+		// Only set HasCapture if the capture will actually be stored (not too large)
+		if capture.Size() <= mp.maxCaptureSize {
+			tm.HasCapture = true
+		}
+	}
+
+	metricID := mp.addMetrics(tm)
+
+	// Store capture if enabled
+	if capture != nil {
+		capture.ID = metricID
+		mp.addCapture(*capture)
+	}
+
 	return nil
 }

@@ -336,6 +474,24 @@ func (w *responseBodyCopier) StartTime() time.Time {
 	return w.start
 }

+// sensitiveHeaders lists headers that should be redacted in captures
+var sensitiveHeaders = map[string]bool{
+	"authorization":       true,
+	"proxy-authorization": true,
+	"cookie":              true,
+	"set-cookie":          true,
+	"x-api-key":           true,
+}
+
+// redactHeaders replaces sensitive header values in-place with "[REDACTED]"
+func redactHeaders(headers map[string]string) {
+	for key := range headers {
+		if sensitiveHeaders[strings.ToLower(key)] {
+			headers[key] = "[REDACTED]"
+		}
+	}
+}
+
 // filterAcceptEncoding filters the Accept-Encoding header to only include
 // encodings we can decompress (gzip, deflate). This respects the client's
 // preferences while ensuring we can parse response bodies for metrics.
@@ -18,7 +18,7 @@ import (

 func TestMetricsMonitor_AddMetrics(t *testing.T) {
 	t.Run("adds metrics and assigns ID", func(t *testing.T) {
-		mm := newMetricsMonitor(testLogger, 10)
+		mm := newMetricsMonitor(testLogger, 10, 0)

 		metric := TokenMetrics{
 			Model:        "test-model",
@@ -37,7 +37,7 @@ func TestMetricsMonitor_AddMetrics(t *testing.T) {
 	})

 	t.Run("increments ID for each metric", func(t *testing.T) {
-		mm := newMetricsMonitor(testLogger, 10)
+		mm := newMetricsMonitor(testLogger, 10, 0)

 		for i := 0; i < 5; i++ {
 			mm.addMetrics(TokenMetrics{Model: "model"})
@@ -51,7 +51,7 @@ func TestMetricsMonitor_AddMetrics(t *testing.T) {
 	})

 	t.Run("respects max metrics limit", func(t *testing.T) {
-		mm := newMetricsMonitor(testLogger, 3)
+		mm := newMetricsMonitor(testLogger, 3, 0)

 		// Add 5 metrics
 		for i := 0; i < 5; i++ {
@@ -71,7 +71,7 @@ func TestMetricsMonitor_AddMetrics(t *testing.T) {
 	})

 	t.Run("emits TokenMetricsEvent", func(t *testing.T) {
-		mm := newMetricsMonitor(testLogger, 10)
+		mm := newMetricsMonitor(testLogger, 10, 0)

 		receivedEvent := make(chan TokenMetricsEvent, 1)
 		cancel := event.On(func(e TokenMetricsEvent) {
@@ -101,14 +101,14 @@ func TestMetricsMonitor_AddMetrics(t *testing.T) {

 func TestMetricsMonitor_GetMetrics(t *testing.T) {
 	t.Run("returns empty slice when no metrics", func(t *testing.T) {
-		mm := newMetricsMonitor(testLogger, 10)
+		mm := newMetricsMonitor(testLogger, 10, 0)
 		metrics := mm.getMetrics()
 		assert.NotNil(t, metrics)
 		assert.Equal(t, 0, len(metrics))
 	})

 	t.Run("returns copy of metrics", func(t *testing.T) {
-		mm := newMetricsMonitor(testLogger, 10)
+		mm := newMetricsMonitor(testLogger, 10, 0)
 		mm.addMetrics(TokenMetrics{Model: "model1"})
 		mm.addMetrics(TokenMetrics{Model: "model2"})

@@ -128,7 +128,7 @@ func TestMetricsMonitor_GetMetrics(t *testing.T) {

 func TestMetricsMonitor_GetMetricsJSON(t *testing.T) {
 	t.Run("returns valid JSON for empty metrics", func(t *testing.T) {
-		mm := newMetricsMonitor(testLogger, 10)
+		mm := newMetricsMonitor(testLogger, 10, 0)
 		jsonData, err := mm.getMetricsJSON()
 		assert.NoError(t, err)
 		assert.NotNil(t, jsonData)
@@ -140,7 +140,7 @@ func TestMetricsMonitor_GetMetricsJSON(t *testing.T) {
 	})

 	t.Run("returns valid JSON with metrics", func(t *testing.T) {
-		mm := newMetricsMonitor(testLogger, 10)
+		mm := newMetricsMonitor(testLogger, 10, 0)
 		mm.addMetrics(TokenMetrics{
 			Model:           "model1",
 			InputTokens:     100,
@@ -168,7 +168,7 @@ func TestMetricsMonitor_GetMetricsJSON(t *testing.T) {

 func TestMetricsMonitor_WrapHandler(t *testing.T) {
 	t.Run("successful non-streaming request with usage data", func(t *testing.T) {
-		mm := newMetricsMonitor(testLogger, 10)
+		mm := newMetricsMonitor(testLogger, 10, 0)

 		responseBody := `{
 			"usage": {
@@ -199,7 +199,7 @@ func TestMetricsMonitor_WrapHandler(t *testing.T) {
 	})

 	t.Run("successful request with timings data", func(t *testing.T) {
-		mm := newMetricsMonitor(testLogger, 10)
+		mm := newMetricsMonitor(testLogger, 10, 0)

 		responseBody := `{
 			"timings": {
@@ -239,7 +239,7 @@ func TestMetricsMonitor_WrapHandler(t *testing.T) {
 	})

 	t.Run("streaming request with SSE format", func(t *testing.T) {
-		mm := newMetricsMonitor(testLogger, 10)
+		mm := newMetricsMonitor(testLogger, 10, 0)

 		// Note: SSE format requires proper line breaks - each data line followed by blank line
 		responseBody := `data: {"choices":[{"text":"Hello"}]}
@@ -275,7 +275,7 @@ data: [DONE]
 	})

 	t.Run("non-OK status code does not record metrics", func(t *testing.T) {
-		mm := newMetricsMonitor(testLogger, 10)
+		mm := newMetricsMonitor(testLogger, 10, 0)

 		nextHandler := func(modelID string, w http.ResponseWriter, r *http.Request) error {
 			w.WriteHeader(http.StatusBadRequest)
@@ -295,7 +295,7 @@ data: [DONE]
 	})

 	t.Run("empty response body records minimal metrics", func(t *testing.T) {
-		mm := newMetricsMonitor(testLogger, 10)
+		mm := newMetricsMonitor(testLogger, 10, 0)

 		nextHandler := func(modelID string, w http.ResponseWriter, r *http.Request) error {
 			w.WriteHeader(http.StatusOK)
@@ -317,7 +317,7 @@ data: [DONE]
 	})

 	t.Run("invalid JSON records minimal metrics", func(t *testing.T) {
-		mm := newMetricsMonitor(testLogger, 10)
+		mm := newMetricsMonitor(testLogger, 10, 0)

 		nextHandler := func(modelID string, w http.ResponseWriter, r *http.Request) error {
 			w.Header().Set("Content-Type", "application/json")
@@ -341,7 +341,7 @@ data: [DONE]
 	})

 	t.Run("next handler error is propagated", func(t *testing.T) {
-		mm := newMetricsMonitor(testLogger, 10)
+		mm := newMetricsMonitor(testLogger, 10, 0)

 		expectedErr := assert.AnError
 		nextHandler := func(modelID string, w http.ResponseWriter, r *http.Request) error {
@@ -360,7 +360,7 @@ data: [DONE]
 	})

 	t.Run("response without usage or timings records minimal metrics", func(t *testing.T) {
-		mm := newMetricsMonitor(testLogger, 10)
+		mm := newMetricsMonitor(testLogger, 10, 0)

 		responseBody := `{"result": "ok"}`

@@ -384,6 +384,75 @@ data: [DONE]
 		assert.Equal(t, 0, metrics[0].InputTokens)
 		assert.Equal(t, 0, metrics[0].OutputTokens)
 	})
+
+	t.Run("infill request extracts timings from last array element", func(t *testing.T) {
+		mm := newMetricsMonitor(testLogger, 10, 0)
+
+		// Infill response is an array with timings in the last element
+		responseBody := `[
+			{"content": "first chunk"},
+			{"content": "second chunk"},
+			{"content": "final", "timings": {
+				"prompt_n": 150,
+				"predicted_n": 75,
+				"prompt_per_second": 200.5,
+				"predicted_per_second": 35.5,
+				"prompt_ms": 600.0,
+				"predicted_ms": 1800.0,
+				"cache_n": 30
+			}}
+		]`
+
+		nextHandler := func(modelID string, w http.ResponseWriter, r *http.Request) error {
+			w.Header().Set("Content-Type", "application/json")
+			w.WriteHeader(http.StatusOK)
+			w.Write([]byte(responseBody))
+			return nil
+		}
+
+		req := httptest.NewRequest("POST", "/infill", nil)
+		rec := httptest.NewRecorder()
+		ginCtx, _ := gin.CreateTestContext(rec)
+
+		err := mm.wrapHandler("test-model", ginCtx.Writer, req, nextHandler)
+		assert.NoError(t, err)
+
+		metrics := mm.getMetrics()
+		assert.Equal(t, 1, len(metrics))
+		assert.Equal(t, "test-model", metrics[0].Model)
+		assert.Equal(t, 150, metrics[0].InputTokens)
+		assert.Equal(t, 75, metrics[0].OutputTokens)
+		assert.Equal(t, 30, metrics[0].CachedTokens)
+		assert.Equal(t, 200.5, metrics[0].PromptPerSecond)
+		assert.Equal(t, 35.5, metrics[0].TokensPerSecond)
+		assert.Equal(t, 2400, metrics[0].DurationMs) // 600 + 1800
+	})
+
+	t.Run("infill request with empty array records minimal metrics", func(t *testing.T) {
+		mm := newMetricsMonitor(testLogger, 10, 0)
+
+		responseBody := `[]`
+
+		nextHandler := func(modelID string, w http.ResponseWriter, r *http.Request) error {
+			w.Header().Set("Content-Type", "application/json")
+			w.WriteHeader(http.StatusOK)
+			w.Write([]byte(responseBody))
+			return nil
+		}
+
+		req := httptest.NewRequest("POST", "/infill", nil)
+		rec := httptest.NewRecorder()
+		ginCtx, _ := gin.CreateTestContext(rec)
+
+		err := mm.wrapHandler("test-model", ginCtx.Writer, req, nextHandler)
+		assert.NoError(t, err)
+
+		metrics := mm.getMetrics()
+		assert.Equal(t, 1, len(metrics))
+		assert.Equal(t, "test-model", metrics[0].Model)
+		assert.Equal(t, 0, metrics[0].InputTokens)
+		assert.Equal(t, 0, metrics[0].OutputTokens)
+	})
 }

 func TestMetricsMonitor_ResponseBodyCopier(t *testing.T) {
@@ -437,7 +506,7 @@ func TestMetricsMonitor_ResponseBodyCopier(t *testing.T) {

 func TestMetricsMonitor_Concurrent(t *testing.T) {
 	t.Run("concurrent addMetrics is safe", func(t *testing.T) {
-		mm := newMetricsMonitor(testLogger, 1000)
+		mm := newMetricsMonitor(testLogger, 1000, 0)

 		var wg sync.WaitGroup
 		numGoroutines := 10
@@ -464,7 +533,7 @@ func TestMetricsMonitor_Concurrent(t *testing.T) {
 	})

 	t.Run("concurrent reads and writes are safe", func(t *testing.T) {
-		mm := newMetricsMonitor(testLogger, 100)
+		mm := newMetricsMonitor(testLogger, 100, 0)

 		done := make(chan bool)

@@ -502,7 +571,7 @@ func TestMetricsMonitor_Concurrent(t *testing.T) {

 func TestMetricsMonitor_ParseMetrics(t *testing.T) {
 	t.Run("prefers timings over usage data", func(t *testing.T) {
-		mm := newMetricsMonitor(testLogger, 10)
+		mm := newMetricsMonitor(testLogger, 10, 0)

 		// Timings should take precedence over usage
 		responseBody := `{
@@ -542,7 +611,7 @@ func TestMetricsMonitor_ParseMetrics(t *testing.T) {
 	})

 	t.Run("handles missing cache_n in timings", func(t *testing.T) {
-		mm := newMetricsMonitor(testLogger, 10)
+		mm := newMetricsMonitor(testLogger, 10, 0)

 		responseBody := `{
 			"timings": {
@@ -577,7 +646,7 @@ func TestMetricsMonitor_ParseMetrics(t *testing.T) {

 func TestMetricsMonitor_StreamingResponse(t *testing.T) {
 	t.Run("finds metrics in last valid SSE data", func(t *testing.T) {
-		mm := newMetricsMonitor(testLogger, 10)
+		mm := newMetricsMonitor(testLogger, 10, 0)

 		// Metrics should be found in the last data line before [DONE]
 		responseBody := `data: {"choices":[{"text":"First"}]}
@@ -611,7 +680,7 @@ data: [DONE]
 	})

 	t.Run("handles streaming with no valid JSON records minimal metrics", func(t *testing.T) {
-		mm := newMetricsMonitor(testLogger, 10)
+		mm := newMetricsMonitor(testLogger, 10, 0)

 		responseBody := `data: not json

@@ -641,7 +710,7 @@ data: [DONE]
 	})

 	t.Run("handles empty streaming response records minimal metrics", func(t *testing.T) {
-		mm := newMetricsMonitor(testLogger, 10)
+		mm := newMetricsMonitor(testLogger, 10, 0)

 		responseBody := ``

@@ -669,7 +738,7 @@ data: [DONE]

 // Benchmark tests
 func BenchmarkMetricsMonitor_AddMetrics(b *testing.B) {
-	mm := newMetricsMonitor(testLogger, 1000)
+	mm := newMetricsMonitor(testLogger, 1000, 0)

 	metric := TokenMetrics{
 		Model:           "test-model",
@@ -690,7 +759,7 @@ func BenchmarkMetricsMonitor_AddMetrics(b *testing.B) {

 func BenchmarkMetricsMonitor_AddMetrics_SmallBuffer(b *testing.B) {
 	// Test performance with a smaller buffer where wrapping occurs more frequently
-	mm := newMetricsMonitor(testLogger, 100)
+	mm := newMetricsMonitor(testLogger, 100, 0)

 	metric := TokenMetrics{
 		Model:           "test-model",
@@ -711,7 +780,7 @@ func BenchmarkMetricsMonitor_AddMetrics_SmallBuffer(b *testing.B) {

 func TestMetricsMonitor_WrapHandler_Compression(t *testing.T) {
 	t.Run("gzip encoded response", func(t *testing.T) {
-		mm := newMetricsMonitor(testLogger, 10)
+		mm := newMetricsMonitor(testLogger, 10, 0)

 		responseBody := `{"usage": {"prompt_tokens": 100, "completion_tokens": 50}}`

@@ -745,7 +814,7 @@ func TestMetricsMonitor_WrapHandler_Compression(t *testing.T) {
 	})

 	t.Run("deflate encoded response", func(t *testing.T) {
-		mm := newMetricsMonitor(testLogger, 10)
+		mm := newMetricsMonitor(testLogger, 10, 0)

 		responseBody := `{"usage": {"prompt_tokens": 200, "completion_tokens": 75}}`

@@ -779,7 +848,7 @@ func TestMetricsMonitor_WrapHandler_Compression(t *testing.T) {
 	})

 	t.Run("invalid gzip data records minimal metrics", func(t *testing.T) {
-		mm := newMetricsMonitor(testLogger, 10)
+		mm := newMetricsMonitor(testLogger, 10, 0)

 		// Invalid compressed data
 		invalidData := []byte("this is not gzip data")
@@ -807,7 +876,7 @@ func TestMetricsMonitor_WrapHandler_Compression(t *testing.T) {
 	})

 	t.Run("unknown encoding treated as uncompressed", func(t *testing.T) {
-		mm := newMetricsMonitor(testLogger, 10)
+		mm := newMetricsMonitor(testLogger, 10, 0)

 		responseBody := `{"usage": {"prompt_tokens": 300, "completion_tokens": 100}}`

@@ -832,3 +901,228 @@ func TestMetricsMonitor_WrapHandler_Compression(t *testing.T) {
 		assert.Equal(t, 100, metrics[0].OutputTokens)
 	})
 }
+
+func TestReqRespCapture_Size(t *testing.T) {
+	t.Run("calculates size correctly", func(t *testing.T) {
+		capture := ReqRespCapture{
+			ID:      1,
+			ReqPath: "/v1/chat/completions", // 20 bytes
+			ReqHeaders: map[string]string{
+				"Content-Type": "application/json", // 12 + 16 = 28
+			},
+			ReqBody: []byte("request body"), // 12 bytes
+			RespHeaders: map[string]string{
+				"X-Test": "value", // 6 + 5 = 11
+			},
+			RespBody: []byte("response body"), // 13 bytes
+		}
+
+		// Expected: 20 + 12 + 13 + 28 + 11 = 84
+		assert.Equal(t, 84, capture.Size())
+	})
+
+	t.Run("handles empty capture", func(t *testing.T) {
+		capture := ReqRespCapture{}
+		assert.Equal(t, 0, capture.Size())
+	})
+}
+
+func TestMetricsMonitor_AddCapture(t *testing.T) {
+	t.Run("does nothing when captures disabled", func(t *testing.T) {
+		mm := newMetricsMonitor(testLogger, 10, 0)
+
+		capture := ReqRespCapture{
+			ID:      0,
+			ReqBody: []byte("test"),
+		}
+		mm.addCapture(capture)
+
+		// Should not store capture
+		assert.Nil(t, mm.getCaptureByID(0))
+	})
+
+	t.Run("adds capture when enabled", func(t *testing.T) {
+		mm := newMetricsMonitor(testLogger, 10, 5)
+
+		capture := ReqRespCapture{
+			ID:       0,
+			ReqBody:  []byte("test request"),
+			RespBody: []byte("test response"),
+		}
+		mm.addCapture(capture)
+
+		retrieved := mm.getCaptureByID(0)
+		assert.NotNil(t, retrieved)
+		assert.Equal(t, 0, retrieved.ID)
+		assert.Equal(t, []byte("test request"), retrieved.ReqBody)
+		assert.Equal(t, []byte("test response"), retrieved.RespBody)
+	})
+
+	t.Run("evicts oldest when exceeding max size", func(t *testing.T) {
+		mm := newMetricsMonitor(testLogger, 10, 5)
+		mm.maxCaptureSize = 100 // Set small limit for test
+
+		// Add captures that will exceed the limit
+		capture1 := ReqRespCapture{ID: 0, ReqBody: make([]byte, 40)}
+		capture2 := ReqRespCapture{ID: 1, ReqBody: make([]byte, 40)}
+		capture3 := ReqRespCapture{ID: 2, ReqBody: make([]byte, 40)}
+
+		mm.addCapture(capture1)
+		mm.addCapture(capture2)
+		// Adding capture3 should evict capture1
+		mm.addCapture(capture3)
+
+		assert.Nil(t, mm.getCaptureByID(0), "capture 0 should be evicted")
+		assert.NotNil(t, mm.getCaptureByID(1), "capture 1 should exist")
+		assert.NotNil(t, mm.getCaptureByID(2), "capture 2 should exist")
+	})
+
+	t.Run("skips capture larger than max size", func(t *testing.T) {
+		mm := newMetricsMonitor(testLogger, 10, 5)
+		mm.maxCaptureSize = 100
+
+		// Add a capture larger than max
+		largeCapture := ReqRespCapture{ID: 0, ReqBody: make([]byte, 200)}
+		mm.addCapture(largeCapture)
+
+		assert.Nil(t, mm.getCaptureByID(0), "oversized capture should not be stored")
+	})
+}
+
+func TestMetricsMonitor_GetCaptureByID(t *testing.T) {
+	t.Run("returns nil for non-existent ID", func(t *testing.T) {
+		mm := newMetricsMonitor(testLogger, 10, 5)
+
+		assert.Nil(t, mm.getCaptureByID(999))
+	})
+
+	t.Run("returns capture by ID", func(t *testing.T) {
+		mm := newMetricsMonitor(testLogger, 10, 5)
+
+		capture := ReqRespCapture{
+			ID:      42,
+			ReqBody: []byte("test"),
+		}
+		mm.addCapture(capture)
+
+		retrieved := mm.getCaptureByID(42)
+		assert.NotNil(t, retrieved)
+		assert.Equal(t, 42, retrieved.ID)
+	})
+}
+
+func TestRedactHeaders(t *testing.T) {
+	t.Run("redacts sensitive headers", func(t *testing.T) {
+		headers := map[string]string{
+			"Authorization":       "Bearer secret-token",
+			"Proxy-Authorization": "Basic creds",
+			"Cookie":              "session=abc123",
+			"Set-Cookie":          "session=xyz789",
+			"X-Api-Key":           "sk-12345",
+			"Content-Type":        "application/json",
+			"X-Custom":            "safe-value",
+		}
+
+		redactHeaders(headers)
+
+		assert.Equal(t, "[REDACTED]", headers["Authorization"])
+		assert.Equal(t, "[REDACTED]", headers["Proxy-Authorization"])
+		assert.Equal(t, "[REDACTED]", headers["Cookie"])
+		assert.Equal(t, "[REDACTED]", headers["Set-Cookie"])
+		assert.Equal(t, "[REDACTED]", headers["X-Api-Key"])
+		assert.Equal(t, "application/json", headers["Content-Type"])
+		assert.Equal(t, "safe-value", headers["X-Custom"])
+	})
+
+	t.Run("handles mixed case header names", func(t *testing.T) {
+		headers := map[string]string{
+			"authorization": "Bearer token",
+			"COOKIE":        "session=abc",
+			"x-api-key":     "key123",
+		}
+
+		redactHeaders(headers)
+
+		assert.Equal(t, "[REDACTED]", headers["authorization"])
+		assert.Equal(t, "[REDACTED]", headers["COOKIE"])
+		assert.Equal(t, "[REDACTED]", headers["x-api-key"])
+	})
+
+	t.Run("handles empty headers", func(t *testing.T) {
+		headers := map[string]string{}
+		redactHeaders(headers)
+		assert.Empty(t, headers)
+	})
+}
+
+func TestMetricsMonitor_WrapHandler_Capture(t *testing.T) {
+	t.Run("captures request and response when enabled", func(t *testing.T) {
+		mm := newMetricsMonitor(testLogger, 10, 5)
+
+		requestBody := `{"model": "test", "prompt": "hello"}`
+		responseBody := `{"usage": {"prompt_tokens": 100, "completion_tokens": 50}}`
+
+		nextHandler := func(modelID string, w http.ResponseWriter, r *http.Request) error {
+			w.Header().Set("Content-Type", "application/json")
+			w.Header().Set("X-Custom", "header-value")
+			w.WriteHeader(http.StatusOK)
+			w.Write([]byte(responseBody))
+			return nil
+		}
+
+		req := httptest.NewRequest("POST", "/test", bytes.NewBufferString(requestBody))
+		req.Header.Set("Content-Type", "application/json")
+		req.Header.Set("Authorization", "Bearer secret")
+		rec := httptest.NewRecorder()
+		ginCtx, _ := gin.CreateTestContext(rec)
+
+		err := mm.wrapHandler("test-model", ginCtx.Writer, req, nextHandler)
+		assert.NoError(t, err)
+
+		// Check metric was recorded
+		metrics := mm.getMetrics()
+		assert.Equal(t, 1, len(metrics))
+		metricID := metrics[0].ID
+
+		// Check capture was stored with same ID
+		capture := mm.getCaptureByID(metricID)
+		assert.NotNil(t, capture)
+		assert.Equal(t, metricID, capture.ID)
+		assert.Equal(t, []byte(requestBody), capture.ReqBody)
+		assert.Equal(t, []byte(responseBody), capture.RespBody)
+		assert.Equal(t, "/test", capture.ReqPath)
+		assert.Equal(t, "application/json", capture.ReqHeaders["Content-Type"])
+		assert.Equal(t, "[REDACTED]", capture.ReqHeaders["Authorization"])
+		assert.Equal(t, "application/json", capture.RespHeaders["Content-Type"])
+		assert.Equal(t, "header-value", capture.RespHeaders["X-Custom"])
+	})
+
+	t.Run("does not capture when disabled", func(t *testing.T) {
+		mm := newMetricsMonitor(testLogger, 10, 0)
+
+		requestBody := `{"model": "test"}`
+		responseBody := `{"usage": {"prompt_tokens": 100, "completion_tokens": 50}}`
+
+		nextHandler := func(modelID string, w http.ResponseWriter, r *http.Request) error {
+			w.Header().Set("Content-Type", "application/json")
+			w.WriteHeader(http.StatusOK)
+			w.Write([]byte(responseBody))
+			return nil
+		}
+
+		req := httptest.NewRequest("POST", "/test", bytes.NewBufferString(requestBody))
+		rec := httptest.NewRecorder()
+		ginCtx, _ := gin.CreateTestContext(rec)
+
+		err := mm.wrapHandler("test-model", ginCtx.Writer, req, nextHandler)
+		assert.NoError(t, err)
+
+		// Metrics should still be recorded
+		metrics := mm.getMetrics()
+		assert.Equal(t, 1, len(metrics))
+
+		// But no capture
+		capture := mm.getCaptureByID(metrics[0].ID)
+		assert.Nil(t, capture)
+	})
+}
@@ -28,6 +28,40 @@ const (

 type proxyCtxKey string

+type InflightCounter struct {
+	mu    sync.Mutex
+	total int
+}
+
+func newInflightCounter() *InflightCounter {
+	return &InflightCounter{}
+}
+
+func (ic *InflightCounter) Current() int {
+	ic.mu.Lock()
+	total := ic.total
+	ic.mu.Unlock()
+	return total
+}
+
+func (ic *InflightCounter) Increment() int {
+	ic.mu.Lock()
+	ic.total++
+	total := ic.total
+	ic.mu.Unlock()
+	return total
+}
+
+func (ic *InflightCounter) Decrement() int {
+	ic.mu.Lock()
+	if ic.total > 0 {
+		ic.total--
+	}
+	total := ic.total
+	ic.mu.Unlock()
+	return total
+}
+
 type ProxyManager struct {
 	sync.Mutex

@@ -43,6 +77,8 @@ type ProxyManager struct {

 	processGroups map[string]*ProcessGroup

+	inFlightCounter *InflightCounter
+
 	// shutdown signaling
 	shutdownCtx    context.Context
 	shutdownCancel context.CancelFunc
@@ -151,10 +187,12 @@ func New(proxyConfig config.Config) *ProxyManager {
 		muxLogger:      muxLogger,
 		upstreamLogger: upstreamLogger,

-		metricsMonitor: newMetricsMonitor(proxyLogger, maxMetrics),
+		metricsMonitor: newMetricsMonitor(proxyLogger, maxMetrics, proxyConfig.CaptureBuffer),

 		processGroups: make(map[string]*ProcessGroup),

+		inFlightCounter: newInflightCounter(),
+
 		shutdownCtx:    shutdownCtx,
 		shutdownCancel: shutdownCancel,

@@ -276,37 +314,37 @@ func (pm *ProxyManager) setupGinEngine() {

 	// Set up routes using the Gin engine
 	// Protected routes use pm.apiKeyAuth() middleware
-	pm.ginEngine.POST("/v1/chat/completions", pm.apiKeyAuth(), pm.proxyInferenceHandler)
-	pm.ginEngine.POST("/v1/responses", pm.apiKeyAuth(), pm.proxyInferenceHandler)
+	pm.ginEngine.POST("/v1/chat/completions", pm.apiKeyAuth(), pm.trackInflight(), pm.proxyInferenceHandler)
+	pm.ginEngine.POST("/v1/responses", pm.apiKeyAuth(), pm.trackInflight(), pm.proxyInferenceHandler)
 	// Support legacy /v1/completions api, see issue #12
-	pm.ginEngine.POST("/v1/completions", pm.apiKeyAuth(), pm.proxyInferenceHandler)
+	pm.ginEngine.POST("/v1/completions", pm.apiKeyAuth(), pm.trackInflight(), pm.proxyInferenceHandler)
 	// Support anthropic /v1/messages (added https://github.com/ggml-org/llama.cpp/pull/17570)
-	pm.ginEngine.POST("/v1/messages", pm.apiKeyAuth(), pm.proxyInferenceHandler)
+	pm.ginEngine.POST("/v1/messages", pm.apiKeyAuth(), pm.trackInflight(), pm.proxyInferenceHandler)
 	// Support anthropic count_tokens API (Also added in the above PR)
-	pm.ginEngine.POST("/v1/messages/count_tokens", pm.apiKeyAuth(), pm.proxyInferenceHandler)
+	pm.ginEngine.POST("/v1/messages/count_tokens", pm.apiKeyAuth(), pm.trackInflight(), pm.proxyInferenceHandler)

 	// Support embeddings and reranking
-	pm.ginEngine.POST("/v1/embeddings", pm.apiKeyAuth(), pm.proxyInferenceHandler)
+	pm.ginEngine.POST("/v1/embeddings", pm.apiKeyAuth(), pm.trackInflight(), pm.proxyInferenceHandler)

 	// llama-server's /reranking endpoint + aliases
-	pm.ginEngine.POST("/reranking", pm.apiKeyAuth(), pm.proxyInferenceHandler)
-	pm.ginEngine.POST("/rerank", pm.apiKeyAuth(), pm.proxyInferenceHandler)
-	pm.ginEngine.POST("/v1/rerank", pm.apiKeyAuth(), pm.proxyInferenceHandler)
-	pm.ginEngine.POST("/v1/reranking", pm.apiKeyAuth(), pm.proxyInferenceHandler)
+	pm.ginEngine.POST("/reranking", pm.apiKeyAuth(), pm.trackInflight(), pm.proxyInferenceHandler)
+	pm.ginEngine.POST("/rerank", pm.apiKeyAuth(), pm.trackInflight(), pm.proxyInferenceHandler)
+	pm.ginEngine.POST("/v1/rerank", pm.apiKeyAuth(), pm.trackInflight(), pm.proxyInferenceHandler)
+	pm.ginEngine.POST("/v1/reranking", pm.apiKeyAuth(), pm.trackInflight(), pm.proxyInferenceHandler)

 	// llama-server's /infill endpoint for code infilling
-	pm.ginEngine.POST("/infill", pm.apiKeyAuth(), pm.proxyInferenceHandler)
+	pm.ginEngine.POST("/infill", pm.apiKeyAuth(), pm.trackInflight(), pm.proxyInferenceHandler)

 	// llama-server's /completion endpoint
-	pm.ginEngine.POST("/completion", pm.apiKeyAuth(), pm.proxyInferenceHandler)
+	pm.ginEngine.POST("/completion", pm.apiKeyAuth(), pm.trackInflight(), pm.proxyInferenceHandler)

 	// Support audio/speech endpoint
-	pm.ginEngine.POST("/v1/audio/speech", pm.apiKeyAuth(), pm.proxyInferenceHandler)
-	pm.ginEngine.POST("/v1/audio/voices", pm.apiKeyAuth(), pm.proxyInferenceHandler)
-	pm.ginEngine.GET("/v1/audio/voices", pm.apiKeyAuth(), pm.proxyGETModelHandler)
-	pm.ginEngine.POST("/v1/audio/transcriptions", pm.apiKeyAuth(), pm.proxyOAIPostFormHandler)
-	pm.ginEngine.POST("/v1/images/generations", pm.apiKeyAuth(), pm.proxyInferenceHandler)
-	pm.ginEngine.POST("/v1/images/edits", pm.apiKeyAuth(), pm.proxyOAIPostFormHandler)
+	pm.ginEngine.POST("/v1/audio/speech", pm.apiKeyAuth(), pm.trackInflight(), pm.proxyInferenceHandler)
+	pm.ginEngine.POST("/v1/audio/voices", pm.apiKeyAuth(), pm.trackInflight(), pm.proxyInferenceHandler)
+	pm.ginEngine.GET("/v1/audio/voices", pm.apiKeyAuth(), pm.trackInflight(), pm.proxyGETModelHandler)
+	pm.ginEngine.POST("/v1/audio/transcriptions", pm.apiKeyAuth(), pm.trackInflight(), pm.proxyOAIPostFormHandler)
+	pm.ginEngine.POST("/v1/images/generations", pm.apiKeyAuth(), pm.trackInflight(), pm.proxyInferenceHandler)
+	pm.ginEngine.POST("/v1/images/edits", pm.apiKeyAuth(), pm.trackInflight(), pm.proxyOAIPostFormHandler)

 	pm.ginEngine.GET("/v1/models", pm.apiKeyAuth(), pm.listModelsHandler)

@@ -325,7 +363,7 @@ func (pm *ProxyManager) setupGinEngine() {
 	pm.ginEngine.GET("/upstream", func(c *gin.Context) {
 		c.Redirect(http.StatusFound, "/ui/models")
 	})
-	pm.ginEngine.Any("/upstream/*upstreamPath", pm.apiKeyAuth(), pm.proxyToUpstream)
+	pm.ginEngine.Any("/upstream/*upstreamPath", pm.apiKeyAuth(), pm.trackInflight(), pm.proxyToUpstream)
 	pm.ginEngine.GET("/unload", pm.apiKeyAuth(), pm.unloadAllModelsHandler)
 	pm.ginEngine.GET("/running", pm.apiKeyAuth(), pm.listRunningProcessesHandler)
 	pm.ginEngine.GET("/health", func(c *gin.Context) {
@@ -389,6 +427,14 @@ func (pm *ProxyManager) setupGinEngine() {
 	gin.DisableConsoleColor()
 }

+func (pm *ProxyManager) trackInflight() gin.HandlerFunc {
+	return func(c *gin.Context) {
+		event.Emit(InFlightRequestsEvent{Total: pm.inFlightCounter.Increment()})
+		defer event.Emit(InFlightRequestsEvent{Total: pm.inFlightCounter.Decrement()})
+		c.Next()
+	}
+}
+
 // ServeHTTP implements http.Handler interface
 func (pm *ProxyManager) ServeHTTP(w http.ResponseWriter, r *http.Request) {
 	pm.ginEngine.ServeHTTP(w, r)
@@ -6,6 +6,7 @@ import (
 	"fmt"
 	"net/http"
 	"sort"
+	"strconv"
 	"strings"

 	"github.com/gin-gonic/gin"
@@ -31,6 +32,7 @@ func addApiHandlers(pm *ProxyManager) {
 		apiGroup.GET("/events", pm.apiSendEvents)
 		apiGroup.GET("/metrics", pm.apiGetMetrics)
 		apiGroup.GET("/version", pm.apiGetVersion)
+		apiGroup.GET("/captures/:id", pm.apiGetCapture)
 	}
 }

@@ -105,6 +107,7 @@ const (
 	msgTypeModelStatus messageType = "modelStatus"
 	msgTypeLogData     messageType = "logData"
 	msgTypeMetrics     messageType = "metrics"
+	msgTypeInFlight    messageType = "inflight"
 )

 type messageEnvelope struct {
@@ -164,6 +167,18 @@ func (pm *ProxyManager) apiSendEvents(c *gin.Context) {
 		}
 	}

+	sendInFlight := func(total int) {
+		jsonData, err := json.Marshal(gin.H{"total": total})
+		if err == nil {
+			select {
+			case sendBuffer <- messageEnvelope{Type: msgTypeInFlight, Data: string(jsonData)}:
+			case <-ctx.Done():
+				return
+			default:
+			}
+		}
+	}
+
 	/**
 	 * Send updated models list
 	 */
@@ -191,11 +206,19 @@ func (pm *ProxyManager) apiSendEvents(c *gin.Context) {
 		sendMetrics([]TokenMetrics{e.Metrics})
 	})()

+	/**
+	 * Send in-flight request stats related to token stats "Waiting: N" count.
+	 */
+	defer event.On(func(e InFlightRequestsEvent) {
+		sendInFlight(e.Total)
+	})()
+
 	// send initial batch of data
 	sendLogData("proxy", pm.proxyLogger.GetHistory())
 	sendLogData("upstream", pm.upstreamLogger.GetHistory())
 	sendModels()
 	sendMetrics(pm.metricsMonitor.getMetrics())
+	sendInFlight(pm.inFlightCounter.Current())

 	for {
 		select {
@@ -250,3 +273,20 @@ func (pm *ProxyManager) apiGetVersion(c *gin.Context) {
 		"build_date": pm.buildDate,
 	})
 }
+
+func (pm *ProxyManager) apiGetCapture(c *gin.Context) {
+	idStr := c.Param("id")
+	id, err := strconv.Atoi(idStr)
+	if err != nil {
+		c.JSON(http.StatusBadRequest, gin.H{"error": "invalid capture ID"})
+		return
+	}
+
+	capture := pm.metricsMonitor.getCaptureByID(id)
+	if capture == nil {
+		c.JSON(http.StatusNotFound, gin.H{"error": "capture not found"})
+		return
+	}
+
+	c.JSON(http.StatusOK, capture)
+}
@@ -925,6 +925,7 @@
      "integrity": "sha512-Y1Cs7hhTc+a5E9Va/xwKlAJoariQyHY+5zBgCZg4PFWNYQ1nMN9sjK1zhw1gK69DuqVP++sht/1GZg1aRwmAXQ==",
      "dev": true,
      "license": "MIT",
+      "peer": true,
      "dependencies": {
        "@sveltejs/vite-plugin-svelte-inspector": "^4.0.1",
        "debug": "^4.4.1",
@@ -1307,6 +1308,7 @@
      "integrity": "sha512-t7frlewr6+cbx+9Ohpl0NOTKXZNV9xHRmNOvql47BFJKcEG1CxtxlPEEe+gR9uhVWM4DwhnvTF110mIL4yP9RA==",
      "dev": true,
      "license": "MIT",
+      "peer": true,
      "dependencies": {
        "undici-types": "~7.16.0"
      }
@@ -1439,6 +1441,7 @@
      "resolved": "https://registry.npmjs.org/acorn/-/acorn-8.15.0.tgz",
      "integrity": "sha512-NZyJarBfL7nWwIq+FDL6Zp/yHEhePMNnnJ0y3qfieCrmNvYct8uvtiV41UvlSe6apAfk0fY1FbWx+NwfmpvtTg==",
      "license": "MIT",
+      "peer": true,
      "bin": {
        "acorn": "bin/acorn"
      },
@@ -3449,6 +3452,7 @@
      "integrity": "sha512-e5lPJi/aui4TO1LpAXIRLySmwXSE8k3b9zoGfd42p67wzxog4WHjiZF3M2uheQih4DGyc25QEV4yRBbpueNiUA==",
      "dev": true,
      "license": "MIT",
+      "peer": true,
      "dependencies": {
        "@types/estree": "1.0.8"
      },
@@ -3561,6 +3565,7 @@
      "resolved": "https://registry.npmjs.org/svelte/-/svelte-5.48.5.tgz",
      "integrity": "sha512-NB3o70OxfmnE5UPyLr8uH3IV02Q43qJVAuWigYmsSOYsS0s/rHxP0TF81blG0onF/xkhNvZw4G8NfzIX+By5ZQ==",
      "license": "MIT",
+      "peer": true,
      "dependencies": {
        "@jridgewell/remapping": "^2.3.4",
        "@jridgewell/sourcemap-codec": "^1.5.0",
@@ -3716,6 +3721,7 @@
      "integrity": "sha512-p1diW6TqL9L07nNxvRMM7hMMw4c5XOo/1ibL4aAIGmSAt9slTE1Xgw5KWuof2uTOvCg9BY7ZRi+GaF+7sfgPeQ==",
      "dev": true,
      "license": "Apache-2.0",
+      "peer": true,
      "bin": {
        "tsc": "bin/tsc",
        "tsserver": "bin/tsserver"
@@ -3894,6 +3900,7 @@
      "integrity": "sha512-+Oxm7q9hDoLMyJOYfUYBuHQo+dkAloi33apOPP56pzj+vsdJDzr+j1NISE5pyaAuKL4A3UD34qd0lx5+kfKp2g==",
      "dev": true,
      "license": "MIT",
+      "peer": true,
      "dependencies": {
        "esbuild": "^0.25.0",
        "fdir": "^6.4.4",
@@ -6,23 +6,28 @@
  import Models from "./routes/Models.svelte";
  import Activity from "./routes/Activity.svelte";
  import Playground from "./routes/Playground.svelte";
+  import PlaygroundStub from "./routes/PlaygroundStub.svelte";
  import { enableAPIEvents } from "./stores/api";
  import { initScreenWidth, isDarkMode, appTitle, connectionState } from "./stores/theme";
+  import { currentRoute } from "./stores/route";

  const routes = {
-    "/": Playground,
+    "/": PlaygroundStub,
    "/models": Models,
    "/logs": LogViewer,
    "/activity": Activity,
-    "*": Playground,
+    "*": PlaygroundStub,
  };

-  // Sync theme to document attribute
+  function handleRouteLoaded(event: { detail: { route: string | RegExp } }) {
+    const route = event.detail.route;
+    currentRoute.set(typeof route === "string" ? route : "/");
+  }
+
  $effect(() => {
    document.documentElement.setAttribute("data-theme", $isDarkMode ? "dark" : "light");
  });

-  // Sync title to document
  $effect(() => {
    const icon = $connectionState === "connecting" ? "\u{1F7E1}" : $connectionState === "connected" ? "\u{1F7E2}" : "\u{1F534}";
    document.title = `${icon} ${$appTitle}`;
@@ -43,6 +48,11 @@
  <Header />

  <main class="flex-1 overflow-auto p-4">
-    <Router {routes} />
+    <div class="h-full" class:hidden={$currentRoute !== "/"}>
+      <Playground />
+    </div>
+    <div class="h-full" class:hidden={$currentRoute === "/"}>
+      <Router {routes} on:routeLoaded={handleRouteLoaded} />
+    </div>
  </main>
 </div>
@@ -0,0 +1,452 @@
+<script lang="ts">
+  import type { ReqRespCapture } from "../lib/types";
+
+  interface Props {
+    capture: ReqRespCapture | null;
+    open: boolean;
+    onclose: () => void;
+  }
+
+  let { capture, open, onclose }: Props = $props();
+
+  let dialogEl: HTMLDialogElement | undefined = $state();
+
+  type BodyTab = "raw" | "pretty" | "chat";
+  let reqBodyTab: BodyTab = $state("pretty");
+  let respBodyTab: BodyTab = $state("pretty");
+  let copiedReq = $state(false);
+  let copiedResp = $state(false);
+
+  $effect(() => {
+    if (open && dialogEl) {
+      dialogEl.showModal();
+    } else if (!open && dialogEl) {
+      dialogEl.close();
+    }
+  });
+
+  // Reset tabs when capture changes
+  $effect(() => {
+    if (capture) {
+      const reqCt = getContentType(capture.req_headers);
+      const respCt = getContentType(capture.resp_headers);
+      reqBodyTab = reqCt.includes("json") ? "pretty" : "raw";
+      respBodyTab = respCt.includes("text/event-stream")
+        ? "chat"
+        : respCt.includes("json")
+          ? "pretty"
+          : "raw";
+    }
+  });
+
+  function handleDialogClose() {
+    onclose();
+  }
+
+  function decodeBody(body: string | null | undefined): string {
+    if (!body) return "";
+    try {
+      const binary = atob(body);
+      const bytes = Uint8Array.from(binary, (c) => c.charCodeAt(0));
+      return new TextDecoder().decode(bytes);
+    } catch {
+      return body;
+    }
+  }
+
+  function formatJson(str: string): string {
+    try {
+      const parsed = JSON.parse(str);
+      return JSON.stringify(parsed, null, 2);
+    } catch {
+      return str;
+    }
+  }
+
+  function getContentType(
+    headers: Record<string, string> | null | undefined,
+  ): string {
+    if (!headers) return "";
+    const ct = headers["Content-Type"] || headers["content-type"] || "";
+    return ct.toLowerCase();
+  }
+
+  function isImageContentType(contentType: string): boolean {
+    return contentType.startsWith("image/");
+  }
+
+  function isTextContentType(contentType: string): boolean {
+    return (
+      contentType.startsWith("text/") ||
+      contentType.includes("application/json") ||
+      contentType.includes("application/xml") ||
+      contentType.includes("application/javascript")
+    );
+  }
+
+  function getImageDataUrl(body: string, contentType: string): string {
+    const mimeType = contentType.split(";")[0].trim();
+    return `data:${mimeType};base64,${body}`;
+  }
+
+  interface SSEChat {
+    reasoning: string;
+    content: string;
+  }
+
+  function parseSSEChat(text: string): SSEChat {
+    const result: SSEChat = { reasoning: "", content: "" };
+    for (const line of text.split("\n")) {
+      const trimmed = line.trim();
+      if (!trimmed || !trimmed.startsWith("data: ")) continue;
+      const data = trimmed.slice(6);
+      if (data === "[DONE]") continue;
+      try {
+        const parsed = JSON.parse(data);
+        const delta = parsed.choices?.[0]?.delta;
+        if (delta?.content) result.content += delta.content;
+        if (delta?.reasoning_content) result.reasoning += delta.reasoning_content;
+      } catch {
+        // skip unparseable lines
+      }
+    }
+    return result;
+  }
+
+  async function copyToClipboard(text: string, type: "req" | "resp") {
+    try {
+      await navigator.clipboard.writeText(text);
+      if (type === "req") {
+        copiedReq = true;
+        setTimeout(() => (copiedReq = false), 1500);
+      } else {
+        copiedResp = true;
+        setTimeout(() => (copiedResp = false), 1500);
+      }
+    } catch {
+      // ignore
+    }
+  }
+
+  function getCopyText(): string {
+    if (respBodyTab === "chat") {
+      let text = "";
+      if (sseChat.reasoning) text += sseChat.reasoning + "\n\n";
+      text += sseChat.content;
+      return text;
+    }
+    return displayedResponseBody;
+  }
+
+  // Request body derivations
+  let requestContentType = $derived(
+    capture ? getContentType(capture.req_headers) : "",
+  );
+  let isRequestJson = $derived(requestContentType.includes("json"));
+
+  let requestBodyRaw = $derived.by(() => {
+    if (!capture) return "";
+    return decodeBody(capture.req_body);
+  });
+
+  let requestBodyPretty = $derived.by(() => {
+    if (!isRequestJson) return requestBodyRaw;
+    return formatJson(requestBodyRaw);
+  });
+
+  let displayedRequestBody = $derived(
+    reqBodyTab === "pretty" ? requestBodyPretty : requestBodyRaw,
+  );
+
+  // Response body derivations
+  let responseContentType = $derived(
+    capture ? getContentType(capture.resp_headers) : "",
+  );
+  let isResponseImage = $derived(isImageContentType(responseContentType));
+  let isResponseText = $derived(isTextContentType(responseContentType));
+  let isResponseJson = $derived(responseContentType.includes("json"));
+  let isSSE = $derived(responseContentType.includes("text/event-stream"));
+
+  let responseBodyRaw = $derived.by(() => {
+    if (!capture) return "";
+    return decodeBody(capture.resp_body);
+  });
+
+  let responseBodyPretty = $derived.by(() => {
+    if (!isResponseJson) return responseBodyRaw;
+    return formatJson(responseBodyRaw);
+  });
+
+  let sseChat = $derived.by(() => {
+    if (!isSSE || !responseBodyRaw)
+      return { reasoning: "", content: "" } as SSEChat;
+    return parseSSEChat(responseBodyRaw);
+  });
+
+  let displayedResponseBody = $derived.by(() => {
+    if (respBodyTab === "pretty") return responseBodyPretty;
+    return responseBodyRaw;
+  });
+</script>
+
+<dialog
+  bind:this={dialogEl}
+  onclose={handleDialogClose}
+  class="bg-surface text-txtmain rounded-lg shadow-xl max-w-4xl w-full max-h-[90vh] p-0 backdrop:bg-black/50 m-auto"
+>
+  {#if capture}
+    <div class="flex flex-col max-h-[90vh]">
+      <div
+        class="flex justify-between items-center p-4 border-b border-card-border"
+      >
+        <h2 class="text-xl font-bold pb-0">Capture #{capture.id + 1}{#if capture.req_path} <span class="text-base font-mono font-normal text-txtsecondary">{capture.req_path}</span>{/if}</h2>
+        <button
+          onclick={() => dialogEl?.close()}
+          class="text-txtsecondary hover:text-txtmain text-2xl leading-none"
+        >
+          &times;
+        </button>
+      </div>
+
+      <div class="overflow-y-auto flex-1 p-4 space-y-4">
+        <!-- Request Headers -->
+        <details class="group" open>
+          <summary
+            class="cursor-pointer font-semibold text-sm uppercase tracking-wider text-txtsecondary hover:text-txtmain"
+          >
+            Request Headers
+          </summary>
+          <div
+            class="mt-2 bg-background rounded border border-card-border overflow-auto max-h-48"
+          >
+            <table class="w-full text-sm">
+              <tbody>
+                {#each Object.entries(capture.req_headers || {}) as [key, value]}
+                  <tr class="border-b border-card-border-inner last:border-0">
+                    <td class="px-3 py-1 font-mono text-primary whitespace-nowrap"
+                      >{key}</td
+                    >
+                    <td class="px-3 py-1 font-mono break-all">{value}</td>
+                  </tr>
+                {/each}
+              </tbody>
+            </table>
+          </div>
+        </details>
+
+        <!-- Request Body -->
+        <details class="group" open>
+          <summary
+            class="cursor-pointer font-semibold text-sm uppercase tracking-wider text-txtsecondary hover:text-txtmain"
+          >
+            Request Body
+          </summary>
+          {#if requestBodyRaw}
+            <div class="mt-2 flex items-center justify-between">
+              <div class="flex gap-1">
+                {#if isRequestJson}
+                  <button
+                    class="tab-btn"
+                    class:tab-btn-active={reqBodyTab === "pretty"}
+                    onclick={() => (reqBodyTab = "pretty")}>Pretty</button
+                  >
+                  <button
+                    class="tab-btn"
+                    class:tab-btn-active={reqBodyTab === "raw"}
+                    onclick={() => (reqBodyTab = "raw")}>Raw</button
+                  >
+                {/if}
+              </div>
+              <button
+                class="tab-btn"
+                onclick={() =>
+                  copyToClipboard(displayedRequestBody, "req")}
+              >
+                {#if copiedReq}
+                  Copied!
+                {:else}
+                  Copy
+                {/if}
+              </button>
+            </div>
+            <div
+              class="mt-1 bg-background rounded border border-card-border overflow-auto max-h-96"
+            >
+              <pre
+                class="p-3 text-sm font-mono whitespace-pre-wrap break-all">{displayedRequestBody}</pre>
+            </div>
+          {:else}
+            <div
+              class="mt-2 bg-background rounded border border-card-border overflow-auto max-h-96"
+            >
+              <pre class="p-3 text-sm font-mono whitespace-pre-wrap break-all"
+                >(empty)</pre
+              >
+            </div>
+          {/if}
+        </details>
+
+        <!-- Response Headers -->
+        <details class="group" open>
+          <summary
+            class="cursor-pointer font-semibold text-sm uppercase tracking-wider text-txtsecondary hover:text-txtmain"
+          >
+            Response Headers
+          </summary>
+          <div
+            class="mt-2 bg-background rounded border border-card-border overflow-auto max-h-48"
+          >
+            <table class="w-full text-sm">
+              <tbody>
+                {#each Object.entries(capture.resp_headers || {}) as [key, value]}
+                  <tr class="border-b border-card-border-inner last:border-0">
+                    <td class="px-3 py-1 font-mono text-primary whitespace-nowrap"
+                      >{key}</td
+                    >
+                    <td class="px-3 py-1 font-mono break-all">{value}</td>
+                  </tr>
+                {/each}
+              </tbody>
+            </table>
+          </div>
+        </details>
+
+        <!-- Response Body -->
+        <details class="group" open>
+          <summary
+            class="cursor-pointer font-semibold text-sm uppercase tracking-wider text-txtsecondary hover:text-txtmain"
+          >
+            Response Body
+          </summary>
+          {#if isResponseImage && capture.resp_body}
+            <div
+              class="mt-2 bg-background rounded border border-card-border overflow-auto max-h-96"
+            >
+              <div class="p-3 flex justify-center">
+                <img
+                  src={getImageDataUrl(capture.resp_body, responseContentType)}
+                  alt="Response"
+                  class="max-w-full h-auto"
+                />
+              </div>
+            </div>
+          {:else if isSSE || isResponseText}
+            <div class="mt-2 flex items-center justify-between">
+              <div class="flex gap-1">
+                {#if isSSE}
+                  <button
+                    class="tab-btn"
+                    class:tab-btn-active={respBodyTab === "chat"}
+                    onclick={() => (respBodyTab = "chat")}>Chat</button
+                  >
+                {/if}
+                {#if isResponseJson}
+                  <button
+                    class="tab-btn"
+                    class:tab-btn-active={respBodyTab === "pretty"}
+                    onclick={() => (respBodyTab = "pretty")}>Pretty</button
+                  >
+                {/if}
+                {#if isSSE || isResponseJson}
+                  <button
+                    class="tab-btn"
+                    class:tab-btn-active={respBodyTab === "raw"}
+                    onclick={() => (respBodyTab = "raw")}>Raw</button
+                  >
+                {/if}
+              </div>
+              <button
+                class="tab-btn"
+                onclick={() => copyToClipboard(getCopyText(), "resp")}
+              >
+                {#if copiedResp}
+                  Copied!
+                {:else}
+                  Copy
+                {/if}
+              </button>
+            </div>
+            <div
+              class="mt-1 bg-background rounded border border-card-border overflow-auto max-h-96"
+            >
+              {#if respBodyTab === "chat"}
+                <div class="p-3 text-sm space-y-3">
+                  {#if sseChat.reasoning}
+                    <div>
+                      <div
+                        class="text-xs font-semibold uppercase tracking-wider text-txtsecondary mb-1"
+                      >
+                        Reasoning
+                      </div>
+                      <pre
+                        class="font-mono whitespace-pre-wrap break-all text-txtsecondary">{sseChat.reasoning}</pre>
+                    </div>
+                  {/if}
+                  {#if sseChat.content}
+                    <div>
+                      {#if sseChat.reasoning}
+                        <div
+                          class="text-xs font-semibold uppercase tracking-wider text-txtsecondary mb-1"
+                        >
+                          Response
+                        </div>
+                      {/if}
+                      <pre
+                        class="font-mono whitespace-pre-wrap break-all">{sseChat.content}</pre>
+                    </div>
+                  {/if}
+                  {#if !sseChat.reasoning && !sseChat.content}
+                    <pre class="font-mono">(empty)</pre>
+                  {/if}
+                </div>
+              {:else}
+                <pre
+                  class="p-3 text-sm font-mono whitespace-pre-wrap break-all">{displayedResponseBody || "(empty)"}</pre>
+              {/if}
+            </div>
+          {:else if responseBodyRaw}
+            <div
+              class="mt-2 bg-background rounded border border-card-border overflow-auto max-h-96"
+            >
+              <div class="p-3 text-sm text-txtsecondary italic">
+                (binary data - {responseContentType || "unknown content type"})
+              </div>
+            </div>
+          {:else}
+            <div
+              class="mt-2 bg-background rounded border border-card-border overflow-auto max-h-96"
+            >
+              <pre class="p-3 text-sm font-mono">(empty)</pre>
+            </div>
+          {/if}
+        </details>
+      </div>
+
+      <div class="p-4 border-t border-card-border flex justify-end">
+        <button onclick={() => dialogEl?.close()} class="btn"> Close </button>
+      </div>
+    </div>
+  {/if}
+</dialog>
+
+<style>
+  .tab-btn {
+    padding: 2px 10px;
+    font-size: 0.75rem;
+    border-radius: 4px;
+    color: var(--color-txtsecondary);
+    cursor: pointer;
+    border: 1px solid transparent;
+    background: transparent;
+    transition: all 0.15s;
+  }
+  .tab-btn:hover {
+    color: var(--color-txtmain);
+    background: var(--color-secondary);
+  }
+  .tab-btn-active {
+    color: var(--color-primary);
+    background: color-mix(in srgb, var(--color-primary) 12%, transparent);
+    border-color: color-mix(in srgb, var(--color-primary) 25%, transparent);
+  }
+</style>
@@ -1,6 +1,8 @@
 <script lang="ts">
-  import { link, location } from "svelte-spa-router";
+  import { link } from "svelte-spa-router";
  import { screenWidth, toggleTheme, isDarkMode, appTitle, isNarrow } from "../stores/theme";
+  import { currentRoute } from "../stores/route";
+  import { playgroundActivity } from "../stores/playgroundActivity";
  import ConnectionStatus from "./ConnectionStatus.svelte";

  function handleTitleChange(newTitle: string): void {
@@ -22,9 +24,10 @@
    handleTitleChange(target.textContent || "(set title)");
  }

-  function isActive(path: string, currentLocation: string): boolean {
-    return path === "/" ? currentLocation === "/" : currentLocation.startsWith(path);
+  function isActive(path: string, current: string): boolean {
+    return path === "/" ? current === "/" : current.startsWith(path);
  }
+
 </script>

 <header
@@ -47,8 +50,7 @@
    <a
      href="/"
      use:link
-      class="text-gray-600 hover:text-black dark:text-gray-300 dark:hover:text-gray-100 p-1 whitespace-nowrap"
-      class:font-semibold={isActive("/", $location)}
+      class="p-1 whitespace-nowrap {isActive('/', $currentRoute) ? 'font-semibold' : ''} {$playgroundActivity ? 'activity-link' : 'text-gray-600 hover:text-black dark:text-gray-300 dark:hover:text-gray-100'}"
    >
      Playground
    </a>
@@ -56,7 +58,7 @@
      href="/models"
      use:link
      class="text-gray-600 hover:text-black dark:text-gray-300 dark:hover:text-gray-100 p-1 whitespace-nowrap"
-      class:font-semibold={isActive("/models", $location)}
+      class:font-semibold={isActive("/models", $currentRoute)}
    >
      Models
    </a>
@@ -64,7 +66,7 @@
      href="/activity"
      use:link
      class="text-gray-600 hover:text-black dark:text-gray-300 dark:hover:text-gray-100 p-1 whitespace-nowrap"
-      class:font-semibold={isActive("/activity", $location)}
+      class:font-semibold={isActive("/activity", $currentRoute)}
    >
      Activity
    </a>
@@ -72,7 +74,7 @@
      href="/logs"
      use:link
      class="text-gray-600 hover:text-black dark:text-gray-300 dark:hover:text-gray-100 p-1 whitespace-nowrap"
-      class:font-semibold={isActive("/logs", $location)}
+      class:font-semibold={isActive("/logs", $currentRoute)}
    >
      Logs
    </a>
@@ -96,3 +98,23 @@
    <ConnectionStatus />
  </menu>
 </header>
+
+<style>
+  .activity-link {
+    background: linear-gradient(90deg, #6366f1, #8b5cf6, #a855f7, #8b5cf6, #6366f1);
+    background-size: 200% 100%;
+    -webkit-background-clip: text;
+    background-clip: text;
+    -webkit-text-fill-color: transparent;
+    animation: gradient-shift 2s linear infinite;
+  }
+
+  @keyframes gradient-shift {
+    0% {
+      background-position: 0% 50%;
+    }
+    100% {
+      background-position: 200% 50%;
+    }
+  }
+</style>
@@ -1,5 +1,5 @@
 <script lang="ts">
-  import { metrics } from "../stores/api";
+  import { inFlightRequests, metrics } from "../stores/api";
  import TokenHistogram from "./TokenHistogram.svelte";

  interface HistogramData {
@@ -15,7 +15,14 @@
  let stats = $derived.by(() => {
    const totalRequests = $metrics.length;
    if (totalRequests === 0) {
-      return { totalRequests: 0, totalInputTokens: 0, totalOutputTokens: 0, tokenStats: { p99: "0", p95: "0", p50: "0" }, histogramData: null };
+      return {
+        totalRequests: 0,
+        totalInputTokens: 0,
+        totalOutputTokens: 0,
+        inFlightRequests: $inFlightRequests,
+        tokenStats: { p99: "0", p95: "0", p50: "0" },
+        histogramData: null,
+      };
    }

    const totalInputTokens = $metrics.reduce((sum, m) => sum + m.input_tokens, 0);
@@ -24,7 +31,14 @@
    // Calculate token statistics using output_tokens and duration_ms
    const validMetrics = $metrics.filter((m) => m.duration_ms > 0 && m.output_tokens > 0);
    if (validMetrics.length === 0) {
-      return { totalRequests, totalInputTokens, totalOutputTokens, tokenStats: { p99: "0", p95: "0", p50: "0" }, histogramData: null };
+      return {
+        totalRequests,
+        totalInputTokens,
+        totalOutputTokens,
+        inFlightRequests: $inFlightRequests,
+        tokenStats: { p99: "0", p95: "0", p50: "0" },
+        histogramData: null,
+      };
    }

    // Calculate tokens/second for each valid metric
@@ -63,6 +77,7 @@
      totalRequests,
      totalInputTokens,
      totalOutputTokens,
+      inFlightRequests: $inFlightRequests,
      tokenStats: {
        p99: p99.toFixed(2),
        p95: p95.toFixed(2),
@@ -95,7 +110,12 @@

      <tbody class="bg-surface divide-y divide-card-border-inner">
        <tr class="hover:bg-secondary">
-          <td class="px-4 py-4 text-sm font-semibold text-gray-900 dark:text-white">{stats.totalRequests}</td>
+          <td class="px-4 py-4 text-sm font-semibold text-gray-900 dark:text-white">
+            <div class="flex flex-col gap-1">
+              <span class="text-xs font-medium text-gray-500 dark:text-gray-400">Completed: {nf.format(stats.totalRequests)}</span>
+              <span class="text-xs font-medium text-gray-500 dark:text-gray-400">Waiting: {nf.format(stats.inFlightRequests)}</span>
+            </div>
+          </td>

          <td class="px-4 py-4 text-sm text-gray-700 dark:text-gray-300 border-l border-gray-200 dark:border-white/10">
            <div class="flex items-center gap-2">
@@ -2,6 +2,7 @@
  import { models } from "../../stores/api";
  import { persistentStore } from "../../stores/persistent";
  import { transcribeAudio } from "../../lib/audioApi";
+  import { playgroundStores } from "../../stores/playgroundActivity";
  import ModelSelector from "./ModelSelector.svelte";

  const selectedModelStore = persistentStore<string>("playground-audio-model", "");
@@ -22,6 +23,10 @@

  let canTranscribe = $derived(selectedFile !== null && $selectedModelStore !== "" && !isTranscribing);

+  $effect(() => {
+    playgroundStores.audioTranscribing.set(isTranscribing);
+  });
+
  function validateFile(file: File): { valid: boolean; error?: string } {
    const ext = '.' + file.name.split('.').pop()?.toLowerCase();

@@ -2,6 +2,7 @@
  import { models } from "../../stores/api";
  import { persistentStore } from "../../stores/persistent";
  import { streamChatCompletion } from "../../lib/chatApi";
+  import { playgroundStores } from "../../stores/playgroundActivity";
  import type { ChatMessage, ContentPart } from "../../lib/types";
  import ChatMessageComponent from "./ChatMessage.svelte";
  import ModelSelector from "./ModelSelector.svelte";
@@ -11,7 +12,16 @@
  const systemPromptStore = persistentStore<string>("playground-system-prompt", "");
  const temperatureStore = persistentStore<number>("playground-temperature", 0.7);

-  let messages = $state<ChatMessage[]>([]);
+  function loadMessages(): ChatMessage[] {
+    try {
+      const saved = localStorage.getItem("playground-messages");
+      return saved ? JSON.parse(saved) : [];
+    } catch {
+      return [];
+    }
+  }
+
+  let messages = $state<ChatMessage[]>(loadMessages());
  let userInput = $state("");
  let isStreaming = $state(false);
  let isReasoning = $state(false);
@@ -24,21 +34,52 @@
  let imageError = $state<string | null>(null);

  let hasModels = $derived($models.some((m) => !m.unlisted));
+  let userScrolledUp = $state(false);

-  // Auto-scroll when messages change
  $effect(() => {
-    if (messages.length > 0 && messagesContainer) {
+    playgroundStores.chatStreaming.set(isStreaming);
+  });
+
+  function handleMessagesScroll() {
+    if (!messagesContainer) return;
+    const { scrollTop, scrollHeight, clientHeight } = messagesContainer;
+    // Consider "at bottom" if within 40px of the bottom
+    userScrolledUp = scrollHeight - scrollTop - clientHeight > 40;
+  }
+
+  // Auto-scroll when messages change — skip if user scrolled up
+  $effect(() => {
+    if (messages.length > 0 && messagesContainer && !userScrolledUp) {
      messagesContainer.scrollTo({
        top: messagesContainer.scrollHeight,
-        behavior: "smooth",
+        behavior: isStreaming ? "instant" : "smooth",
      });
    }
  });

+  // Persist messages to localStorage (throttled to once per 2s)
+  let lastSaveTime = 0;
+  $effect(() => {
+    const json = JSON.stringify(messages);
+    const elapsed = Date.now() - lastSaveTime;
+    const save = () => {
+      try { localStorage.setItem("playground-messages", json); } catch {}
+      lastSaveTime = Date.now();
+    };
+    if (elapsed >= 2000) {
+      save();
+      return;
+    }
+    const timer = setTimeout(save, 2000 - elapsed);
+    return () => clearTimeout(timer);
+  });
+
  async function sendMessage() {
    const trimmedInput = userInput.trim();
    if ((!trimmedInput && attachedImages.length === 0) || !$selectedModelStore || isStreaming) return;

+    userScrolledUp = false;
+
    // Build message content (multimodal if images attached)
    let content: string | ContentPart[];
    if (attachedImages.length > 0) {
@@ -321,6 +362,7 @@
    <div
      class="flex-1 overflow-y-auto mb-4 px-2"
      bind:this={messagesContainer}
+      onscroll={handleMessagesScroll}
    >
      {#if messages.length === 0}
        <div class="h-full flex items-center justify-center text-txtsecondary">
@@ -1,5 +1,6 @@
 <script lang="ts">
-  import { renderMarkdown, escapeHtml } from "../../lib/markdown";
+  import { renderMarkdown, escapeHtml, renderStreamingMarkdown, createStreamingCache } from "../../lib/markdown";
+  import type { RenderedBlock } from "../../lib/markdown";
  import { Copy, Check, Pencil, X, Save, RefreshCw, ChevronDown, ChevronRight, Brain, Code } from "lucide-svelte";
  import { getTextContent, getImageUrls } from "../../lib/types";
  import type { ContentPart } from "../../lib/types";
@@ -22,11 +23,17 @@
  let hasImages = $derived(imageUrls.length > 0);
  let canEdit = $derived(onEdit !== undefined && !hasImages);

-  let renderedContent = $derived(
-    role === "assistant" && !isStreaming
-      ? renderMarkdown(textContent)
-      : escapeHtml(textContent).replace(/\n/g, '<br>')
-  );
+  let streamingCache = createStreamingCache();
+  let renderedParts = $derived.by(() => {
+    if (role !== "assistant") {
+      return { blocks: [{ id: -1, html: escapeHtml(textContent).replace(/\n/g, '<br>') }] as RenderedBlock[], pendingHtml: "" };
+    }
+    if (!isStreaming) {
+      streamingCache = createStreamingCache();
+      return { blocks: [{ id: -1, html: renderMarkdown(textContent) }] as RenderedBlock[], pendingHtml: "" };
+    }
+    return renderStreamingMarkdown(textContent, streamingCache);
+  });
  let copied = $state(false);
  let showRaw = $state(false);
  let isEditing = $state(false);
@@ -113,9 +120,9 @@

 <div class="flex {role === 'user' ? 'justify-end' : 'justify-start'} mb-4">
  <div
-    class="relative group max-w-[85%] rounded-lg px-4 py-2 {role === 'user'
-      ? 'bg-primary text-btn-primary-text'
-      : 'bg-surface border border-gray-200 dark:border-white/10'}"
+    class="relative group rounded-lg px-4 py-2 {role === 'user'
+      ? 'max-w-[85%] bg-primary text-btn-primary-text'
+      : 'w-full sm:w-4/5 bg-surface border border-gray-200 dark:border-white/10'}"
  >
    {#if role === "assistant"}
      {#if reasoning_content || isReasoning}
@@ -168,7 +175,10 @@
        <div class="whitespace-pre-wrap font-mono text-sm">{textContent}</div>
      {:else}
        <div class="prose prose-sm dark:prose-invert max-w-none">
-          {@html renderedContent}
+          {#each renderedParts.blocks as block (block.id)}
+            {@html block.html}
+          {/each}
+          {@html renderedParts.pendingHtml}
          {#if isStreaming && !isReasoning}
            <span class="inline-block w-2 h-4 bg-current animate-pulse ml-0.5"></span>
          {/if}
@@ -2,6 +2,7 @@
  import { models } from "../../stores/api";
  import { persistentStore } from "../../stores/persistent";
  import { generateImage } from "../../lib/imageApi";
+  import { playgroundStores } from "../../stores/playgroundActivity";
  import ModelSelector from "./ModelSelector.svelte";
  import ExpandableTextarea from "./ExpandableTextarea.svelte";

@@ -17,6 +18,10 @@

  let hasModels = $derived($models.some((m) => !m.unlisted));

+  $effect(() => {
+    playgroundStores.imageGenerating.set(isGenerating);
+  });
+
  async function generate() {
    const trimmedPrompt = prompt.trim();
    if (!trimmedPrompt || !$selectedModelStore || isGenerating) return;
@@ -2,6 +2,7 @@
  import { models } from "../../stores/api";
  import { persistentStore } from "../../stores/persistent";
  import { generateSpeech } from "../../lib/speechApi";
+  import { playgroundStores } from "../../stores/playgroundActivity";
  import ModelSelector from "./ModelSelector.svelte";
  import ExpandableTextarea from "./ExpandableTextarea.svelte";

@@ -20,11 +21,9 @@
  let availableVoices = $state<string[]>(["coral", "alloy", "echo", "fable", "onyx", "nova", "shimmer"]);
  let isLoadingVoices = $state(false);

-  // Default voices to fall back to if API call fails
  const defaultVoices = ["coral", "alloy", "echo", "fable", "onyx", "nova", "shimmer"];
  const CACHE_KEY = "playground-speech-voices-cache";

-  // Load voices cache from localStorage
  function getVoicesCache(): Record<string, string[]> {
    if (typeof window === "undefined") return {};
    try {
@@ -35,7 +34,6 @@
    }
  }

-  // Save voices cache to localStorage
  function saveVoicesCache(cache: Record<string, string[]>) {
    if (typeof window === "undefined") return;
    try {
@@ -47,9 +45,12 @@

  let hasModels = $derived($models.some((m) => !m.unlisted));

-  // Track if this is the initial page load to avoid fetching on refresh
  let isInitialLoad = $state(true);

+  $effect(() => {
+    playgroundStores.speechGenerating.set(isGenerating);
+  });
+
  // On page load, restore cached voices for the selected model if available
  $effect(() => {
    const model = $selectedModelStore;
@@ -1,5 +1,5 @@
 import { describe, it, expect } from "vitest";
-import { renderMarkdown, escapeHtml } from "./markdown";
+import { renderMarkdown, escapeHtml, splitCompleteBlocks, closePendingBlock, normalizeLatexDelimiters, renderStreamingMarkdown, createStreamingCache } from "./markdown";

 describe("renderMarkdown", () => {
  describe("basic markdown", () => {
@@ -130,6 +130,35 @@ More text here.
      expect(result).toContain("katex");
      expect(result).toContain("sqrt");
    });
+
+    it("renders \\[...\\] display math", () => {
+      const result = renderMarkdown("\\[\nx^2 + y^2 = z^2\n\\]");
+      expect(result).toContain("katex");
+    });
+
+    it("renders \\(...\\) inline math", () => {
+      const result = renderMarkdown("The equation \\(E = mc^2\\) is famous.");
+      expect(result).toContain("katex");
+    });
+  });
+
+  describe("normalizeLatexDelimiters", () => {
+    it("converts \\[...\\] to $$...$$", () => {
+      expect(normalizeLatexDelimiters("\\[\nx^2\n\\]")).toBe("$$\nx^2\n$$");
+    });
+
+    it("converts \\(...\\) to $...$", () => {
+      expect(normalizeLatexDelimiters("\\(x^2\\)")).toBe("$x^2$");
+    });
+
+    it("leaves $$ and $ delimiters unchanged", () => {
+      const text = "$$x^2$$ and $y$";
+      expect(normalizeLatexDelimiters(text)).toBe(text);
+    });
+
+    it("handles multiple occurrences", () => {
+      expect(normalizeLatexDelimiters("\\(a\\) and \\(b\\)")).toBe("$a$ and $b$");
+    });
  });

  describe("escapeHtml", () => {
@@ -158,3 +187,237 @@ More text here.
    });
  });
 });
+
+describe("splitCompleteBlocks", () => {
+  it("returns everything as pending when no blank line", () => {
+    const result = splitCompleteBlocks("Hello world");
+    expect(result.complete).toBe("");
+    expect(result.pending).toBe("Hello world");
+  });
+
+  it("returns empty for empty input", () => {
+    const result = splitCompleteBlocks("");
+    expect(result.complete).toBe("");
+    expect(result.pending).toBe("");
+  });
+
+  it("splits on blank line between paragraphs", () => {
+    const result = splitCompleteBlocks("First paragraph.\n\nSecond paragraph");
+    expect(result.complete).toBe("First paragraph.\n");
+    expect(result.pending).toBe("Second paragraph");
+  });
+
+  it("splits multiple paragraphs at last blank line", () => {
+    const result = splitCompleteBlocks("Para 1.\n\nPara 2.\n\nPara 3");
+    expect(result.complete).toBe("Para 1.\n\nPara 2.\n");
+    expect(result.pending).toBe("Para 3");
+  });
+
+  it("treats closed code fence as complete boundary", () => {
+    const text = "```js\nconst x = 1;\n```\nMore text";
+    const result = splitCompleteBlocks(text);
+    expect(result.complete).toBe("```js\nconst x = 1;\n```");
+    expect(result.pending).toBe("More text");
+  });
+
+  it("treats unclosed code fence as pending", () => {
+    const text = "Done paragraph.\n\n```js\nconst x = 1;";
+    const result = splitCompleteBlocks(text);
+    expect(result.complete).toBe("Done paragraph.\n");
+    expect(result.pending).toBe("```js\nconst x = 1;");
+  });
+
+  it("does not split on blank lines inside code fences", () => {
+    const text = "```\nline1\n\nline2\n```";
+    const result = splitCompleteBlocks(text);
+    expect(result.complete).toBe("```\nline1\n\nline2\n```");
+    expect(result.pending).toBe("");
+  });
+
+  it("handles tilde fences", () => {
+    const text = "~~~py\nprint('hi')\n~~~\nAfter";
+    const result = splitCompleteBlocks(text);
+    expect(result.complete).toBe("~~~py\nprint('hi')\n~~~");
+    expect(result.pending).toBe("After");
+  });
+
+  it("does not close backtick fence with tilde fence", () => {
+    const text = "```\ncode\n~~~\nstill code";
+    const result = splitCompleteBlocks(text);
+    // The ~~~ should not close a backtick fence, so everything from ``` onward is pending
+    expect(result.complete).toBe("");
+    expect(result.pending).toBe("```\ncode\n~~~\nstill code");
+  });
+
+  it("treats closed math block as complete boundary", () => {
+    const text = "$$\nx^2\n$$\nAfter";
+    const result = splitCompleteBlocks(text);
+    expect(result.complete).toBe("$$\nx^2\n$$");
+    expect(result.pending).toBe("After");
+  });
+
+  it("treats unclosed math block as pending", () => {
+    const text = "Before.\n\n$$\nx^2";
+    const result = splitCompleteBlocks(text);
+    expect(result.complete).toBe("Before.\n");
+    expect(result.pending).toBe("$$\nx^2");
+  });
+
+  it("treats closed \\[...\\] math block as complete boundary", () => {
+    const text = "\\[\nx^2\n\\]\nAfter";
+    const result = splitCompleteBlocks(text);
+    expect(result.complete).toBe("\\[\nx^2\n\\]");
+    expect(result.pending).toBe("After");
+  });
+
+  it("treats unclosed \\[ math block as pending", () => {
+    const text = "Before.\n\n\\[\nx^2";
+    const result = splitCompleteBlocks(text);
+    expect(result.complete).toBe("Before.\n");
+    expect(result.pending).toBe("\\[\nx^2");
+  });
+
+  it("handles trailing blank line making everything complete", () => {
+    const text = "Hello world.\n";
+    const result = splitCompleteBlocks(text);
+    // Last line is empty string after split, which is a blank line
+    expect(result.complete).toBe("Hello world.\n");
+    expect(result.pending).toBe("");
+  });
+});
+
+describe("closePendingBlock", () => {
+  it("returns empty string for empty input", () => {
+    expect(closePendingBlock("")).toBe("");
+  });
+
+  it("returns plain text unchanged", () => {
+    expect(closePendingBlock("Hello world")).toBe("Hello world");
+  });
+
+  it("closes an open backtick code fence", () => {
+    const result = closePendingBlock("```python\nprint('hi')");
+    expect(result).toBe("```python\nprint('hi')\n```");
+  });
+
+  it("closes an open tilde code fence", () => {
+    const result = closePendingBlock("~~~js\nconst x = 1;");
+    expect(result).toBe("~~~js\nconst x = 1;\n~~~");
+  });
+
+  it("does not modify already-closed code fence", () => {
+    const text = "```py\ncode\n```";
+    expect(closePendingBlock(text)).toBe(text);
+  });
+
+  it("closes an open math block", () => {
+    const result = closePendingBlock("$$\nx^2 + y^2");
+    expect(result).toBe("$$\nx^2 + y^2\n$$");
+  });
+
+  it("does not modify already-closed math block", () => {
+    const text = "$$\nx^2\n$$";
+    expect(closePendingBlock(text)).toBe(text);
+  });
+
+  it("closes an open \\[ math block with \\]", () => {
+    const result = closePendingBlock("\\[\nx^2 + y^2");
+    expect(result).toBe("\\[\nx^2 + y^2\n\\]");
+  });
+
+  it("does not modify already-closed \\[...\\] math block", () => {
+    const text = "\\[\nx^2\n\\]";
+    expect(closePendingBlock(text)).toBe(text);
+  });
+
+  it("closes code fence when preceded by regular text", () => {
+    const result = closePendingBlock("Some text\n```\ncode");
+    expect(result).toBe("Some text\n```\ncode\n```");
+  });
+
+  it("leaves headers unchanged", () => {
+    expect(closePendingBlock("## Hello")).toBe("## Hello");
+  });
+
+  it("leaves tables unchanged", () => {
+    const table = "| a | b |\n| --- | --- |\n| 1 | 2 |";
+    expect(closePendingBlock(table)).toBe(table);
+  });
+
+  it("leaves lists unchanged", () => {
+    expect(closePendingBlock("- item 1\n- item 2")).toBe("- item 1\n- item 2");
+  });
+});
+
+describe("renderStreamingMarkdown", () => {
+  it("renders complete blocks and pending as markdown", () => {
+    const cache = createStreamingCache();
+    const text = "# Hello\n\nWorld";
+    const { blocks, pendingHtml } = renderStreamingMarkdown(text, cache);
+    expect(blocks).toHaveLength(1);
+    expect(blocks[0].html).toContain("<h1>Hello</h1>");
+    expect(pendingHtml).toContain("World");
+    expect(pendingHtml).toContain("<p>");
+  });
+
+  it("preserves existing blocks when complete portion is unchanged", () => {
+    const cache = createStreamingCache();
+    renderStreamingMarkdown("# Hello\n\nWor", cache);
+    const firstBlocks = cache.blocks;
+
+    const { blocks } = renderStreamingMarkdown("# Hello\n\nWorld", cache);
+    // Same block array reference — nothing changed in the complete section
+    expect(blocks).toBe(firstBlocks);
+    expect(cache.completeKey).toBe("# Hello\n");
+  });
+
+  it("appends a new block when a new section completes", () => {
+    const cache = createStreamingCache();
+    renderStreamingMarkdown("# Hello\n\nParagraph", cache);
+    expect(cache.blocks).toHaveLength(1);
+    const firstBlock = cache.blocks[0];
+
+    renderStreamingMarkdown("# Hello\n\nParagraph.\n\nMore", cache);
+    expect(cache.blocks).toHaveLength(2);
+    // First block is preserved with the same id and html
+    expect(cache.blocks[0].id).toBe(firstBlock.id);
+    expect(cache.blocks[0].html).toBe(firstBlock.html);
+    // Second block contains the new paragraph
+    expect(cache.blocks[1].html).toContain("Paragraph.");
+  });
+
+  it("assigns unique stable ids to each block", () => {
+    const cache = createStreamingCache();
+    renderStreamingMarkdown("A.\n\nB.\n\nC", cache);
+    expect(cache.blocks).toHaveLength(1);
+    const id0 = cache.blocks[0].id;
+
+    renderStreamingMarkdown("A.\n\nB.\n\nC.\n\nD", cache);
+    expect(cache.blocks).toHaveLength(2);
+    expect(cache.blocks[0].id).toBe(id0);
+    expect(cache.blocks[1].id).toBe(id0 + 1);
+  });
+
+  it("renders pending code block with syntax highlighting", () => {
+    const cache = createStreamingCache();
+    const text = "Done.\n\n```python\nprint('hello')";
+    const { pendingHtml } = renderStreamingMarkdown(text, cache);
+    expect(pendingHtml).toContain("<code");
+    expect(pendingHtml).toContain("hljs");
+  });
+
+  it("renders pending table as markdown", () => {
+    const cache = createStreamingCache();
+    const text = "Done.\n\n| a | b |\n| --- | --- |\n| 1 | 2 |";
+    const { pendingHtml } = renderStreamingMarkdown(text, cache);
+    expect(pendingHtml).toContain("<table>");
+    expect(pendingHtml).toContain("<td>");
+  });
+
+  it("renders pending portion through markdown pipeline", () => {
+    const cache = createStreamingCache();
+    const text = "Done.\n\nSome **bold** text";
+    const { pendingHtml } = renderStreamingMarkdown(text, cache);
+    expect(pendingHtml).toContain("<strong>bold</strong>");
+  });
+});
@@ -69,13 +69,189 @@ const processor = unified()
  .use(rehypeHighlight)
  .use(rehypeStringify, { allowDangerousHtml: true });

+export function splitCompleteBlocks(text: string): { complete: string; pending: string } {
+  if (!text) {
+    return { complete: "", pending: "" };
+  }
+
+  const lines = text.split("\n");
+  let lastCompleteBoundary = -1; // index of last line that ends a complete block
+  let inFence = false;
+  let fenceChar = "";
+  let inMathBlock = false;
+
+  for (let i = 0; i < lines.length; i++) {
+    const trimmed = lines[i].trimEnd();
+
+    if (inFence) {
+      // Check for closing fence: same character, at least 3, no other content
+      if (new RegExp(`^\\s*${fenceChar.replace(/~/g, "\\~")}{3,}\\s*$`).test(trimmed)) {
+        inFence = false;
+        fenceChar = "";
+        lastCompleteBoundary = i;
+      }
+      continue;
+    }
+
+    if (inMathBlock) {
+      if (trimmed === "$$" || trimmed === "\\]") {
+        inMathBlock = false;
+        lastCompleteBoundary = i;
+      }
+      continue;
+    }
+
+    // Check for opening fence
+    const fenceMatch = trimmed.match(/^(\s*)(```|~~~)/);
+    if (fenceMatch) {
+      // Check if it's an opening fence (may have language info after)
+      // A line with just ``` or ~~~ could be opening or closing, but since we're not in a fence it's opening
+      fenceChar = fenceMatch[2][0]; // '`' or '~'
+      inFence = true;
+      continue;
+    }
+
+    // Check for opening math block
+    if (trimmed === "$$" || trimmed === "\\[") {
+      inMathBlock = true;
+      continue;
+    }
+
+    // Outside fences/math: blank line marks a complete boundary
+    if (trimmed === "") {
+      lastCompleteBoundary = i;
+    }
+  }
+
+  if (lastCompleteBoundary < 0) {
+    return { complete: "", pending: text };
+  }
+
+  const completeLines = lines.slice(0, lastCompleteBoundary + 1);
+  const pendingLines = lines.slice(lastCompleteBoundary + 1);
+
+  return {
+    complete: completeLines.join("\n"),
+    pending: pendingLines.join("\n"),
+  };
+}
+
+export function closePendingBlock(pending: string): string {
+  if (!pending) return "";
+
+  const lines = pending.split("\n");
+  let inFence = false;
+  let fenceStr = "";
+  let inMathBlock = false;
+  let mathClose = "";
+
+  for (const line of lines) {
+    const trimmed = line.trimEnd();
+
+    if (inFence) {
+      if (new RegExp(`^\\s*${fenceStr[0] === "~" ? "~~~" : "\\`\\`\\`"}\\s*$`).test(trimmed)) {
+        inFence = false;
+        fenceStr = "";
+      }
+      continue;
+    }
+
+    if (inMathBlock) {
+      if (trimmed === "$$" || trimmed === "\\]") {
+        inMathBlock = false;
+        mathClose = "";
+      }
+      continue;
+    }
+
+    const fenceMatch = trimmed.match(/^(\s*)(```|~~~)/);
+    if (fenceMatch) {
+      fenceStr = fenceMatch[2];
+      inFence = true;
+      continue;
+    }
+
+    if (trimmed === "$$") {
+      inMathBlock = true;
+      mathClose = "$$";
+      continue;
+    }
+
+    if (trimmed === "\\[") {
+      inMathBlock = true;
+      mathClose = "\\]";
+      continue;
+    }
+  }
+
+  if (inFence) return pending + "\n" + fenceStr;
+  if (inMathBlock) return pending + "\n" + mathClose;
+  return pending;
+}
+
+export interface RenderedBlock {
+  id: number;
+  html: string;
+}
+
+export interface StreamingCache {
+  blocks: RenderedBlock[];
+  nextId: number;
+  completeKey: string;
+}
+
+export function createStreamingCache(): StreamingCache {
+  return { blocks: [], nextId: 0, completeKey: "" };
+}
+
+export function renderStreamingMarkdown(
+  text: string,
+  cache: StreamingCache,
+): { blocks: RenderedBlock[]; pendingHtml: string } {
+  const { complete, pending } = splitCompleteBlocks(text);
+
+  if (complete) {
+    if (cache.completeKey !== complete) {
+      if (complete.startsWith(cache.completeKey) && cache.completeKey.length > 0) {
+        // Complete section grew — render only the new part as a new block
+        const newPart = complete.slice(cache.completeKey.length);
+        cache.blocks = [...cache.blocks, { id: cache.nextId++, html: renderMarkdown(newPart) }];
+      } else {
+        // Complete section changed unexpectedly — re-render as single block
+        cache.blocks = [{ id: cache.nextId++, html: renderMarkdown(complete) }];
+      }
+      cache.completeKey = complete;
+    }
+  } else if (cache.blocks.length > 0) {
+    cache.blocks = [];
+    cache.completeKey = "";
+  }
+
+  let pendingHtml = "";
+  if (pending) {
+    const closed = closePendingBlock(pending);
+    pendingHtml = renderMarkdown(closed);
+  }
+
+  return { blocks: cache.blocks, pendingHtml };
+}
+
+// Convert \[...\] to $$...$$ and \(...\) to $...$
+export function normalizeLatexDelimiters(text: string): string {
+  // Display math: \[...\] → $$...$$  (may span multiple lines)
+  text = text.replace(/\\\[([\s\S]*?)\\\]/g, (_match, inner) => `$$${inner}$$`);
+  // Inline math: \(...\) → $...$
+  text = text.replace(/\\\(([\s\S]*?)\\\)/g, (_match, inner) => `$${inner}$`);
+  return text;
+}
+
 export function renderMarkdown(content: string): string {
  if (!content) {
    return "";
  }

  try {
-    const result = processor.processSync(content);
+    const result = processor.processSync(normalizeLatexDelimiters(content));
    return String(result);
  } catch {
    // Fallback to escaped plain text if markdown parsing fails
@@ -21,6 +21,16 @@ export interface Metrics {
  prompt_per_second: number;
  tokens_per_second: number;
  duration_ms: number;
+  has_capture: boolean;
+}
+
+export interface ReqRespCapture {
+  id: number;
+  req_path: string;
+  req_headers: Record<string, string>;
+  req_body: string; // base64 encoded bytes
+  resp_headers: Record<string, string>;
+  resp_body: string; // base64 encoded bytes
 }

 export interface LogData {
@@ -28,8 +38,12 @@ export interface LogData {
  data: string;
 }

+export interface InFlightStats {
+  total: number;
+}
+
 export interface APIEventEnvelope {
-  type: "modelStatus" | "logData" | "metrics";
+  type: "modelStatus" | "logData" | "metrics" | "inflight";
  data: string;
 }

@@ -1,6 +1,8 @@
 <script lang="ts">
-  import { metrics } from "../stores/api";
+  import { metrics, getCapture } from "../stores/api";
  import Tooltip from "../components/Tooltip.svelte";
+  import CaptureDialog from "../components/CaptureDialog.svelte";
+  import type { ReqRespCapture } from "../lib/types";

  function formatSpeed(speed: number): string {
    return speed < 0 ? "unknown" : speed.toFixed(2) + " t/s";
@@ -38,6 +40,25 @@
  }

  let sortedMetrics = $derived([...$metrics].sort((a, b) => b.id - a.id));
+
+  let selectedCapture = $state<ReqRespCapture | null>(null);
+  let dialogOpen = $state(false);
+  let loadingCaptureId = $state<number | null>(null);
+
+  async function viewCapture(id: number) {
+    loadingCaptureId = id;
+    const capture = await getCapture(id);
+    loadingCaptureId = null;
+    if (capture) {
+      selectedCapture = capture;
+      dialogOpen = true;
+    }
+  }
+
+  function closeDialog() {
+    dialogOpen = false;
+    selectedCapture = null;
+  }
 </script>

 <div class="p-2">
@@ -65,6 +86,7 @@
            <th class="px-6 py-3">Prompt Processing</th>
            <th class="px-6 py-3">Generation Speed</th>
            <th class="px-6 py-3">Duration</th>
+            <th class="px-6 py-3">Capture</th>
          </tr>
        </thead>
        <tbody class="divide-y">
@@ -79,6 +101,19 @@
              <td class="px-6 py-4">{formatSpeed(metric.prompt_per_second)}</td>
              <td class="px-6 py-4">{formatSpeed(metric.tokens_per_second)}</td>
              <td class="px-6 py-4">{formatDuration(metric.duration_ms)}</td>
+              <td class="px-6 py-4">
+                {#if metric.has_capture}
+                  <button
+                    onclick={() => viewCapture(metric.id)}
+                    disabled={loadingCaptureId === metric.id}
+                    class="btn btn--sm"
+                  >
+                    {loadingCaptureId === metric.id ? "..." : "View"}
+                  </button>
+                {:else}
+                  <span class="text-txtsecondary">-</span>
+                {/if}
+              </td>
            </tr>
          {/each}
        </tbody>
@@ -86,3 +121,5 @@
    </div>
  {/if}
 </div>
+
+<CaptureDialog capture={selectedCapture} open={dialogOpen} onclose={closeDialog} />
@@ -0,0 +1 @@
+<!-- empty: real Playground is always mounted in App.svelte -->
@@ -1,5 +1,5 @@
 import { writable } from "svelte/store";
-import type { Model, Metrics, VersionInfo, LogData, APIEventEnvelope } from "../lib/types";
+import type { Model, Metrics, VersionInfo, LogData, APIEventEnvelope, ReqRespCapture, InFlightStats } from "../lib/types";
 import { connectionState } from "./theme";

 const LOG_LENGTH_LIMIT = 1024 * 100; /* 100KB of log data */
@@ -9,6 +9,7 @@ export const models = writable<Model[]>([]);
 export const proxyLogs = writable<string>("");
 export const upstreamLogs = writable<string>("");
 export const metrics = writable<Metrics[]>([]);
+export const inFlightRequests = writable<number>(0);
 export const versionInfo = writable<VersionInfo>({
  build_date: "unknown",
  commit: "unknown",
@@ -29,6 +30,7 @@ export function enableAPIEvents(enabled: boolean): void {
    apiEventSource?.close();
    apiEventSource = null;
    metrics.set([]);
+    inFlightRequests.set(0);
    return;
  }

@@ -46,6 +48,7 @@ export function enableAPIEvents(enabled: boolean): void {
      proxyLogs.set("");
      upstreamLogs.set("");
      metrics.set([]);
+      inFlightRequests.set(0);
      models.set([]);
      retryCount = 0;
      connectionState.set("connected");
@@ -83,6 +86,11 @@ export function enableAPIEvents(enabled: boolean): void {
            metrics.update((prevMetrics) => [...newMetrics, ...prevMetrics]);
            break;
          }
+          case "inflight": {
+            const stats = JSON.parse(message.data) as InFlightStats;
+            inFlightRequests.set(stats.total ?? 0);
+            break;
+          }
        }
      } catch (err) {
        console.error(e.data, err);
@@ -172,3 +180,19 @@ export async function loadModel(model: string): Promise<void> {
    throw error;
  }
 }
+
+export async function getCapture(id: number): Promise<ReqRespCapture | null> {
+  try {
+    const response = await fetch(`/api/captures/${id}`);
+    if (response.status === 404) {
+      return null;
+    }
+    if (!response.ok) {
+      throw new Error(`Failed to fetch capture: ${response.status}`);
+    }
+    return await response.json();
+  } catch (error) {
+    console.error("Failed to fetch capture:", error);
+    return null;
+  }
+}
@@ -0,0 +1,18 @@
+import { writable, derived } from "svelte/store";
+
+const chatStreaming = writable(false);
+const imageGenerating = writable(false);
+const speechGenerating = writable(false);
+const audioTranscribing = writable(false);
+
+export const playgroundActivity = derived(
+  [chatStreaming, imageGenerating, speechGenerating, audioTranscribing],
+  ([$chat, $image, $speech, $audio]) => $chat || $image || $speech || $audio
+);
+
+export const playgroundStores = {
+  chatStreaming,
+  imageGenerating,
+  speechGenerating,
+  audioTranscribing,
+};
@@ -0,0 +1,3 @@
+import { writable } from "svelte/store";
+
+export const currentRoute = writable("/");
Author	SHA1	Message	Date
Brian Mendonca	1688bdd1e9	proxy, ui: add pending requests count to the main dashboard (#516 ) add a real time counter of pending (inflight) requests to the UI.	2026-02-16 09:41:15 -08:00
Benson Wong	d33d51fa75	.coderabbit.yaml,AGENTS.md: small tweaks	2026-02-15 21:31:30 -08:00
Benson Wong	e3bf065574	ui: persist playground state across route navigation (#525 ) - Keep Playground component mounted when navigating away, preserving streaming/generating state - Add animated gradient effect on Playground nav link when activity is in progress	2026-02-15 21:30:52 -08:00
Benson Wong	3e52144058	ui-svelte: incremental rendering of chat messages in the Playground (#520 ) add incremental rendering to Playground > Chat	2026-02-15 11:00:44 -08:00
Benson Wong	d5e52d7d00	build: disable provenance attestations in container builds (#523 ) ## Summary - Add `--provenance=false` to docker build commands in `build-container.sh` - BuildKit attestation manifests are stored as untagged images in GHCR, and the `delete-untagged-containers` cleanup job deletes them, breaking the manifest list and causing `manifest unknown` errors on pull - ref: https://github.com/actions/delete-package-versions/issues/162	2026-02-14 10:23:08 -08:00
Benson Wong	17e5263a76	.github/workflows: fix expired token in publishing images (#522 ) Fixes: #517	2026-02-14 10:06:05 -08:00
Benson Wong	8d6d949ec3	proxy: support timings for /infill from llama-server (#510 ) fixes: #463	2026-02-07 17:16:27 -08:00
Benson Wong	b5fde8eb6d	proxy,ui-svelte: add request/response capturing (#508 ) Add saving request and response headers and bodies that go through llama-swap in memory. - captureBuffer added to configuration. Captures are enabled by default. - 5MB of memory is allocated for req/response captures in a ring buffer. Setting captureBuffer to 0 will disable captures. - UI elements to view captured data added to Activity page. Includes some QOL features like json formatting and recombining SSE chat streams - capture saving is done at the byte level and has minimal impact on llama-swap performance Fixes #464 Ref #503	2026-02-07 15:40:01 -08:00
Nuno	7eef5defb8	docs: add stable-diffusion.cpp references (#506 ) Signed-off-by: rare-magma <rare-magma@posteo.eu>	2026-02-04 20:20:39 -08:00
				`@@ -0,0 +1 @@`
				`<!-- empty: real Playground is always mounted in App.svelte -->`