gadfly-reports

Author	SHA1	Message	Date
steve	e381c0ad41	feat(ui): PR picker becomes a persistent excluder, newest-first Build & push image / build-and-push (push) Successful in 13s Details CI / test (push) Successful in 10m43s Details Invert the PR scope from opt-in to exclusion: untick a PR to drop it from the comparison; the excluded set persists in localStorage and new PRs are included automatically as they arrive. The list is now reverse chronological (last run/report first) with the date shown per PR, the footer states the total count so truncation fears are checkable at a glance, and the scrollable list is pinned with min-height:0 for robustness. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 05:55:32 -04:00
steve	7fce78a664	feat(ui): searchable popup pickers for PR scope and model visibility Build & push image / build-and-push (push) Successful in 14s Details CI / test (push) Successful in 10m50s Details Replace the cramped PR multi-select with a modal: every repo#pr as a checkbox (with model coverage), a search box, and all/none that apply to the search results. The model hider moves to the same popup style — the per-row × and the hidden-chips bar are gone; both pickers live as buttons in the filter row showing their current state. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 23:04:40 -04:00
steve	1af115fdf1	feat: PR filter — compare models on the same set of PRs Build & push image / build-and-push (push) Successful in 13s Details CI / test (push) Successful in 9m51s Details UI: a repo#pr multi-select (labeled with how many models ran each PR) scopes the whole table — runs, minutes, findings, points — to the chosen PRs, so a model with 2 runs can be fairly compared against one with 60. API: GET /scoreboard accepts ?repo= and ?pr= (repeatable or comma-list). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 22:56:49 -04:00
Steve Dudenhoeffer	dd8ada479e	feat(ui): hide/exclude models from the dashboard (persisted) CI / test (pull_request) Successful in 9m51s Details Each scoreboard row gets a × to hide that model — for retired ones (m1 etc.) you no longer want in the view. Hidden models drop out of the table, totals, and the findings drill-down; the set persists in localStorage (grt-hidden) across reloads, with a "hidden (N): …" bar of click-to-restore chips + a "show all". Solo-ness is still computed against ALL models (hiding is a view filter, not a rescoring), so hiding one model never fakes another's solo finds. README Dashboard section updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 20:36:24 -04:00
steve	14cbee8e25	feat: solo-error penalty + fast healthcheck (instant Traefik restart) Build & push image / build-and-push (push) Successful in 20s Details CI / test (push) Successful in 10m22s Details Dashboard: add an editable 'solo-error penalty ×' (default 1.5) — a false positive only one model made (a unique wrong claim, derived from reporter count) multiplies its FP penalty, mirroring the solo-find bonus. Client-side; store stays point-free. Deploy: speed up the healthcheck (image HEALTHCHECK + compose example: interval 30s->5s, start_period 10s, start_interval 1s). Traefik gates routing on the Docker health status, so the old 30s-to-first-probe meant ~30s of 502s after a restart; the daemon binds the port in ms, so it now goes healthy in ~1s. Data is on the volume; only fire-and-forget emits in the ~1s window are at risk. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 12:45:07 -04:00
steve	c15f860853	feat(ui): solo-find bonus — reward a model for catching what others missed Build & push image / build-and-push (push) Successful in 20s Details CI / test (push) Successful in 10m20s Details Adds an editable 'solo-find bonus ×' (default 1.5). A confirmed finding reported by exactly one model (derived from the global reporter count per content-addressed finding — no grader flagging needed) scores severity × bonus. New 'solo' column counts uniquely-caught confirmed findings. Solo-ness is computed over ALL data so the model filter can't fake it. Client-side only; store stays point-free. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 12:24:29 -04:00
steve	0cb6b25f11	feat(ui): false-positive penalty (severity-scaled, default -0.5) Build & push image / build-and-push (push) Successful in 20s Details CI / test (push) Successful in 10m24s Details Adds an editable 'false-positive penalty ×' to the dashboard. A false positive carries no graded severity, so it's penalized by the severity the model CLAIMED (its lens verdict / raw_severity, mapped onto the curve: Blocking->high, Minor->small). points(net) = confirmed points + Σ penalty×points[claimed], so a model with a few good finds but many false positives nets down — even negative — and sorts to the bottom. Adds an 'fp pen' column; net points/pts-min/pts-run shown red when negative. Client-side only; the store stays point-free. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 09:50:18 -04:00
steve	35ebc53561	feat: built-in read-only dashboard at /ui + GET /runs Build & push image / build-and-push (push) Successful in 26s Details CI / test (push) Successful in 10m24s Details Serves a self-contained vanilla-JS dashboard (embedded via go:embed): a per-model performance table — runs, minutes, findings, confirmed/false-positive/ungraded, points, points-per-minute, points-per-run, by-severity — with drill-down filters (date range, repo, provider, model, lens, grade/severity), free-text search, and a click-to-scope findings detail table. Scoring stays client-side: the page has an editable points curve and computes points + value-per-minute in the browser, so the store remains point-free. Adds GET /runs (lists all runs, incl. zero-finding ones) so minutes/runs are filterable. The /ui shell is public (carries no data); data endpoints stay token-gated and the JS sends the token. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 00:22:39 -04:00

8 Commits