Adds an editable 'false-positive penalty ×' to the dashboard. A false positive carries no graded severity, so it's penalized by the severity the model CLAIMED (its lens verdict / raw_severity, mapped onto the curve: Blocking->high, Minor->small). points(net) = confirmed points + Σ penalty×points[claimed], so a model with a few good finds but many false positives nets down — even negative — and sorts to the bottom. Adds an 'fp pen' column; net points/pts-min/pts-run shown red when negative. Client-side only; the store stays point-free. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
🪰📋 gadfly-reports
A small durable store + scoreboard for Gadfly review findings. Gadfly (and any CI) POST each model's findings and per-review timing here; a human or Claude — via gadfly-mcp — later grades each finding. It's a single Go binary backed by SQLite, speaking a tiny HTTP API.
🤖 Heads up: this is a vibe-coded project
gadfly-reports was built almost entirely by an AI agent (Claude Code) — the design, the code, and these docs. It's small and it's tested, but treat it accordingly: it's a homelab-grade service, not a hardened product, and there may be the occasional AI-flavored rough edge. Issues and PRs welcome.
What it stores — and what it deliberately doesn't
gadfly-reports is a pure fact store:
- runs — one per model's review of a PR: wall-clock duration, lens count, optional token/cost.
- findings — content-addressed by location (
repo + pr + lens + file + line), so the same issue raised by several models collapses to one finding with many reports. That collapse is what makes cross-model consensus and per-model precision measurable. - grades — a triage verdict per finding:
is_real,severity(trivial|small|medium|high|critical), optionalusefulness(1–5), notes, grader. Grade history is kept; the latest wins.
It stores no points and computes no rankings. Mapping severity → points and ranking models by "value per minute" (or per token) is a client/dashboard concern, so you can retune the curve any time without migrating or re-scoring stored data.
Run it
# from source
go run gitea.stevedudenhoeffer.com/steve/gadfly-reports@latest serve
# or Docker (image published by CI on every push to main)
docker run -d --name gadfly-reports -p 8090:8090 -v gadfly-reports-data:/data \
-e GADFLY_REPORTS_TOKEN=change-me \
gitea.stevedudenhoeffer.com/steve/gadfly-reports:latest
Deploy behind Traefik (expose over a domain)
# docker-compose.yml — publish gadfly-reports at https://reports.example.com via Traefik.
services:
gadfly-reports:
image: gitea.stevedudenhoeffer.com/steve/gadfly-reports:latest
restart: unless-stopped
environment:
# Auth is built in: callers (gadfly emit, gadfly-mcp) send this as a bearer
# token; /healthz stays open. ADDR and DB default to :8090 and
# /data/gadfly-reports.db inside the image.
GADFLY_REPORTS_TOKEN: ${GADFLY_REPORTS_TOKEN:?set GADFLY_REPORTS_TOKEN in .env}
volumes:
- gadfly-reports-data:/data
networks: [traefik]
healthcheck:
test: ["CMD", "wget", "-q", "-O", "-", "http://localhost:8090/healthz"]
interval: 30s
timeout: 5s
retries: 3
labels:
- "traefik.enable=true"
- "traefik.http.routers.gadfly-reports.rule=Host(`reports.example.com`)"
- "traefik.http.routers.gadfly-reports.entrypoints=websecure"
- "traefik.http.routers.gadfly-reports.tls=true"
- "traefik.http.routers.gadfly-reports.tls.certresolver=letsencrypt"
- "traefik.http.services.gadfly-reports.loadbalancer.server.port=8090"
volumes:
gadfly-reports-data:
networks:
traefik:
external: true # the network your Traefik instance is attached to
Put GADFLY_REPORTS_TOKEN=<secret> in a .env beside the compose file. Tailor the three
Traefik bits to your setup — the host (reports.example.com), the entrypoint
(websecure) and the certresolver (letsencrypt) must match your Traefik config, and the
traefik network must be the external one Traefik watches. Traefik terminates TLS and forwards
to the container's :8090. Then point gadfly's GADFLY_FINDINGS_URL and gadfly-mcp's
--store at https://reports.example.com (with the same token).
HTTP API (the canonical contract)
| Method & path | Body / query | Purpose |
|---|---|---|
GET /healthz |
— | liveness (open even when a token is set) |
GET / · GET /ui |
— | view-only dashboard — HTML shell, public; its JS fetches the gated endpoints with the token |
POST /runs |
one run object | upsert a model's review of a PR (timing/tokens) |
POST /reports |
JSON array of report objects | record findings + which model reported each |
POST /findings/{id}/grade |
{is_real, severity?, usefulness?, notes?, grader?} |
record a triage grade |
GET /export |
— | flat report×finding×run×latest-grade rows — the dashboard feed |
GET /runs |
— | list all runs (timing/tokens), oldest first |
GET /scoreboard |
— | points-free per-model rollup |
POST /runs body: {run_id, repo, pr, model, provider, lenses, duration_secs, input_tokens?, output_tokens?, cost_usd?}
(re-posting the same run_id updates it).
POST /reports array element: {repo, pr, lens, file, line, title, model, provider, run_id, raw_severity, detail}.
GET /scoreboard element: {model, provider, runs, minutes, input_tokens, output_tokens, findings, confirmed, false_positive, ungraded, by_severity:{severity:count}}.
If GADFLY_REPORTS_TOKEN is set, every route except the public view shell (/healthz, /, /ui)
requires Authorization: Bearer <token>. The /ui shell carries no data itself — its JS sends the
token on each fetch — so the public shell leaks nothing.
Configuration
| Env | Default | Meaning |
|---|---|---|
GADFLY_REPORTS_ADDR |
:8090 |
listen address |
GADFLY_REPORTS_DB |
gadfly-reports.db (/data/gadfly-reports.db in Docker) |
SQLite path |
GADFLY_REPORTS_TOKEN |
(empty) | bearer token callers must present (empty = open) |
CLI flags --addr / --db / --token override the env.
Dashboard
A built-in read-only dashboard ships at /ui (hit the host root and you're redirected
there). It's a single self-contained page that pulls /runs + /export and does everything in your
browser: a per-model performance table — runs, minutes, findings, confirmed / false-positive /
ungraded, points, points-per-minute, points-per-run, by-severity — with drill-down filters
(date range, repo, provider, model, lens, grade/severity), free-text search, and a click-to-scope
findings detail table.
True to the store's "no points" rule, scoring lives in the browser: the page has an editable
points curve (default trivial=1, small=3, medium=5, high=8, critical=20) and computes
points = Σ weight[severity]·count and value/min = points / minutes on the fly — retune it without
touching stored data.
There's also an editable false-positive penalty × (default -0.5). A false positive has no
graded severity, so it's penalized by the severity the model claimed (its lens verdict —
Blocking→high, Minor→small): penalty × points[claimed]. So a Blocking-claimed FP at -0.5 costs
high(8) × -0.5 = -4, and a model with the odd good find but many false positives nets down —
even negative — instead of coasting on its hits.
Auth: the /ui shell is public (it holds no data); paste the store token into its connect box,
or open /ui?token=<token> once (remembered in localStorage). Prefer your own dashboard? Point
Grafana/Metabase/etc. at the SQLite file or the same /export + /scoreboard + /runs JSON.
How it fits together
- gadfly POSTs findings here after each
review when
GADFLY_FINDINGS_URLpoints at this store (advisory; off by default). - gadfly-mcp is the MCP server Claude uses to list findings and record grades against this store.
Build / test
go build ./...
go test ./...
gofmt -l . # must be clean
License
MIT © 2026 Steve Dudenhoeffer.