gadfly-reports

7 Commits 1 Branch 0 Tags

Author	SHA1	Message	Date
steve	14cbee8e25	feat: solo-error penalty + fast healthcheck (instant Traefik restart) Build & push image / build-and-push (push) Successful in 20s Details CI / test (push) Successful in 10m22s Details Dashboard: add an editable 'solo-error penalty ×' (default 1.5) — a false positive only one model made (a unique wrong claim, derived from reporter count) multiplies its FP penalty, mirroring the solo-find bonus. Client-side; store stays point-free. Deploy: speed up the healthcheck (image HEALTHCHECK + compose example: interval 30s->5s, start_period 10s, start_interval 1s). Traefik gates routing on the Docker health status, so the old 30s-to-first-probe meant ~30s of 502s after a restart; the daemon binds the port in ms, so it now goes healthy in ~1s. Data is on the volume; only fire-and-forget emits in the ~1s window are at risk. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 12:45:07 -04:00
steve	c15f860853	feat(ui): solo-find bonus — reward a model for catching what others missed Build & push image / build-and-push (push) Successful in 20s Details CI / test (push) Successful in 10m20s Details Adds an editable 'solo-find bonus ×' (default 1.5). A confirmed finding reported by exactly one model (derived from the global reporter count per content-addressed finding — no grader flagging needed) scores severity × bonus. New 'solo' column counts uniquely-caught confirmed findings. Solo-ness is computed over ALL data so the model filter can't fake it. Client-side only; store stays point-free. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 12:24:29 -04:00
steve	0cb6b25f11	feat(ui): false-positive penalty (severity-scaled, default -0.5) Build & push image / build-and-push (push) Successful in 20s Details CI / test (push) Successful in 10m24s Details Adds an editable 'false-positive penalty ×' to the dashboard. A false positive carries no graded severity, so it's penalized by the severity the model CLAIMED (its lens verdict / raw_severity, mapped onto the curve: Blocking->high, Minor->small). points(net) = confirmed points + Σ penalty×points[claimed], so a model with a few good finds but many false positives nets down — even negative — and sorts to the bottom. Adds an 'fp pen' column; net points/pts-min/pts-run shown red when negative. Client-side only; the store stays point-free. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 09:50:18 -04:00
steve	35ebc53561	feat: built-in read-only dashboard at /ui + GET /runs Build & push image / build-and-push (push) Successful in 26s Details CI / test (push) Successful in 10m24s Details Serves a self-contained vanilla-JS dashboard (embedded via go:embed): a per-model performance table — runs, minutes, findings, confirmed/false-positive/ungraded, points, points-per-minute, points-per-run, by-severity — with drill-down filters (date range, repo, provider, model, lens, grade/severity), free-text search, and a click-to-scope findings detail table. Scoring stays client-side: the page has an editable points curve and computes points + value-per-minute in the browser, so the store remains point-free. Adds GET /runs (lists all runs, incl. zero-finding ones) so minutes/runs are filterable. The /ui shell is public (carries no data); data endpoints stay token-gated and the JS sends the token. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 00:22:39 -04:00
steve	9458528b40	docs: add Traefik docker-compose example to expose the store over a domain CI / test (push) Successful in 10m21s Details Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 23:59:24 -04:00
steve	ddcf42a3ce	feat: gadfly-reports — findings store + scoreboard daemon Build & push image / build-and-push (push) Successful in 1m13s Details CI / test (push) Successful in 10m39s Details SQLite-backed HTTP store for Gadfly review findings, per-review run timings, and human/Claude grades, with a points-free per-model scoreboard. Pure fact store: it computes no points or rankings (the dashboard maps severity->points client-side and retunes without re-scoring). Findings are content-addressed by location so cross-model reports collapse for consensus; one grade per finding, latest wins. Pure-Go SQLite (CGO-free) + Docker image CI + tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 23:55:24 -04:00
steve	52dce5eb2f	Initial commit	2026-06-27 03:39:49 +00:00