feat: PR filter — compare models on the same set of PRs
Build & push image / build-and-push (push) Successful in 13s
CI / test (push) Successful in 9m51s

UI: a repo#pr multi-select (labeled with how many models ran each PR)
scopes the whole table — runs, minutes, findings, points — to the chosen
PRs, so a model with 2 runs can be fairly compared against one with 60.
API: GET /scoreboard accepts ?repo= and ?pr= (repeatable or comma-list).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
2026-07-02 22:55:43 -04:00
parent 2f003dd132
commit 1af115fdf1
6 changed files with 202 additions and 19 deletions
+6 -1
View File
@@ -106,7 +106,7 @@ against reviews that take minutes.
| `POST /findings/{id}/grade` | `{is_real, severity?, usefulness?, notes?, grader?}` | record a triage grade |
| `GET /export` | — | flat report×finding×run×latest-grade rows — the dashboard feed |
| `GET /runs` | — | list all runs (timing/tokens), oldest first |
| `GET /scoreboard` | — | points-free per-model rollup |
| `GET /scoreboard` | `?repo=<repo>` `&pr=<n>` (repeatable or comma-list, e.g. `?pr=10,11`) | points-free per-model rollup, optionally narrowed to specific PRs so models are compared on the same work |
`POST /runs` body: `{run_id, repo, pr, model, provider, lenses, duration_secs, input_tokens?, output_tokens?, cost_usd?}`
(re-posting the same `run_id` updates it).
@@ -138,6 +138,11 @@ ungraded, points, **points-per-minute**, points-per-run, by-severity — with **
(date range, repo, provider, model, lens, grade/severity), free-text search, and a click-to-scope
findings detail table.
Comparisons can be scoped to **specific PRs**: a multi-select lists every `repo#pr` with how many
models ran it (`steve/x#12 · 3/5 models`) — pick the PRs you want and the entire table (runs,
minutes, findings, points) counts only those, so a model with 2 runs can be compared against one
with 60 on exactly the work you choose.
True to the store's "no points" rule, **scoring lives in the browser**: the page has an editable
points curve (default `trivial=1, small=3, medium=5, high=8, critical=20`) and computes
`points = Σ weight[severity]·count` and `value/min = points / minutes` on the fly — retune it without