14cbee8e25
Dashboard: add an editable 'solo-error penalty ×' (default 1.5) — a false positive only one model made (a unique wrong claim, derived from reporter count) multiplies its FP penalty, mirroring the solo-find bonus. Client-side; store stays point-free. Deploy: speed up the healthcheck (image HEALTHCHECK + compose example: interval 30s->5s, start_period 10s, start_interval 1s). Traefik gates routing on the Docker health status, so the old 30s-to-first-probe meant ~30s of 502s after a restart; the daemon binds the port in ms, so it now goes healthy in ~1s. Data is on the volume; only fire-and-forget emits in the ~1s window are at risk. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
184 lines
9.3 KiB
Markdown
184 lines
9.3 KiB
Markdown
# 🪰📋 gadfly-reports
|
||
|
||
A small **durable store + scoreboard** for [Gadfly](https://gitea.stevedudenhoeffer.com/steve/gadfly)
|
||
review findings. Gadfly (and any CI) POST each model's findings and per-review timing here; a human
|
||
or Claude — via [gadfly-mcp](https://gitea.stevedudenhoeffer.com/steve/gadfly-mcp) — later grades
|
||
each finding. It's a single Go binary backed by SQLite, speaking a tiny HTTP API.
|
||
|
||
> ### 🤖 Heads up: this is a vibe-coded project
|
||
> gadfly-reports was built almost entirely by an AI agent (Claude Code) — the design, the code, and
|
||
> these docs. It's small and it's tested, but treat it accordingly: it's a homelab-grade service,
|
||
> not a hardened product, and there may be the occasional AI-flavored rough edge. Issues and PRs
|
||
> welcome.
|
||
|
||
## What it stores — and what it deliberately doesn't
|
||
|
||
gadfly-reports is a **pure fact store**:
|
||
|
||
- **runs** — one per model's review of a PR: wall-clock duration, lens count, optional token/cost.
|
||
- **findings** — **content-addressed by location** (`repo + pr + lens + file + line`), so the *same*
|
||
issue raised by several models collapses to one finding with many **reports**. That collapse is
|
||
what makes cross-model **consensus** and per-model **precision** measurable.
|
||
- **grades** — a triage verdict per finding: `is_real`, `severity`
|
||
(`trivial|small|medium|high|critical`), optional `usefulness` (1–5), notes, grader. Grade history
|
||
is kept; the latest wins.
|
||
|
||
It stores **no points and computes no rankings.** Mapping severity → points and ranking models by
|
||
"value per minute" (or per token) is a **client/dashboard concern**, so you can retune the curve any
|
||
time without migrating or re-scoring stored data.
|
||
|
||
## Run it
|
||
|
||
```sh
|
||
# from source
|
||
go run gitea.stevedudenhoeffer.com/steve/gadfly-reports@latest serve
|
||
|
||
# or Docker (image published by CI on every push to main)
|
||
docker run -d --name gadfly-reports -p 8090:8090 -v gadfly-reports-data:/data \
|
||
-e GADFLY_REPORTS_TOKEN=change-me \
|
||
gitea.stevedudenhoeffer.com/steve/gadfly-reports:latest
|
||
```
|
||
|
||
### Deploy behind Traefik (expose over a domain)
|
||
|
||
```yaml
|
||
# docker-compose.yml — publish gadfly-reports at https://reports.example.com via Traefik.
|
||
services:
|
||
gadfly-reports:
|
||
image: gitea.stevedudenhoeffer.com/steve/gadfly-reports:latest
|
||
restart: unless-stopped
|
||
environment:
|
||
# Auth is built in: callers (gadfly emit, gadfly-mcp) send this as a bearer
|
||
# token; /healthz stays open. ADDR and DB default to :8090 and
|
||
# /data/gadfly-reports.db inside the image.
|
||
GADFLY_REPORTS_TOKEN: ${GADFLY_REPORTS_TOKEN:?set GADFLY_REPORTS_TOKEN in .env}
|
||
volumes:
|
||
- gadfly-reports-data:/data
|
||
networks: [traefik]
|
||
healthcheck:
|
||
test: ["CMD", "wget", "-q", "-O", "-", "http://localhost:8090/healthz"]
|
||
# Fast probe so Traefik resumes routing within ~1s of a restart (the daemon
|
||
# binds the port in milliseconds). Without a fast probe Traefik 502s until the
|
||
# first check — the usual "why is it down for 30s after restart".
|
||
interval: 5s
|
||
timeout: 3s
|
||
retries: 3
|
||
start_period: 10s
|
||
start_interval: 1s # probe every 1s during start_period (needs Docker 25+)
|
||
labels:
|
||
- "traefik.enable=true"
|
||
- "traefik.http.routers.gadfly-reports.rule=Host(`reports.example.com`)"
|
||
- "traefik.http.routers.gadfly-reports.entrypoints=websecure"
|
||
- "traefik.http.routers.gadfly-reports.tls=true"
|
||
- "traefik.http.routers.gadfly-reports.tls.certresolver=letsencrypt"
|
||
- "traefik.http.services.gadfly-reports.loadbalancer.server.port=8090"
|
||
|
||
volumes:
|
||
gadfly-reports-data:
|
||
|
||
networks:
|
||
traefik:
|
||
external: true # the network your Traefik instance is attached to
|
||
```
|
||
|
||
Put `GADFLY_REPORTS_TOKEN=<secret>` in a `.env` beside the compose file. Tailor the three
|
||
Traefik bits to your setup — the **host** (`reports.example.com`), the **entrypoint**
|
||
(`websecure`) and the **certresolver** (`letsencrypt`) must match your Traefik config, and the
|
||
`traefik` network must be the external one Traefik watches. Traefik terminates TLS and forwards
|
||
to the container's `:8090`. Then point `gadfly`'s `GADFLY_FINDINGS_URL` and `gadfly-mcp`'s
|
||
`--store` at `https://reports.example.com` (with the same token).
|
||
|
||
On `docker compose pull && docker compose up -d`, the fast healthcheck lets Traefik resume routing
|
||
within ~1s (the daemon starts in milliseconds — Traefik just won't route to a container whose health
|
||
probe hasn't passed yet, which is the "down for 30s after restart" gotcha). Your data lives on the
|
||
`gadfly-reports-data` volume and survives restarts; the only loss exposure is a review POSTing
|
||
findings during that ~1s window, since gadfly's emit is fire-and-forget (no retry) — negligible
|
||
against reviews that take minutes.
|
||
|
||
## HTTP API (the canonical contract)
|
||
|
||
| Method & path | Body / query | Purpose |
|
||
|---|---|---|
|
||
| `GET /healthz` | — | liveness (open even when a token is set) |
|
||
| `GET /` · `GET /ui` | — | **view-only dashboard** — HTML shell, public; its JS fetches the gated endpoints with the token |
|
||
| `POST /runs` | one run object | upsert a model's review of a PR (timing/tokens) |
|
||
| `POST /reports` | JSON **array** of report objects | record findings + which model reported each |
|
||
| `POST /findings/{id}/grade` | `{is_real, severity?, usefulness?, notes?, grader?}` | record a triage grade |
|
||
| `GET /export` | — | flat report×finding×run×latest-grade rows — the dashboard feed |
|
||
| `GET /runs` | — | list all runs (timing/tokens), oldest first |
|
||
| `GET /scoreboard` | — | points-free per-model rollup |
|
||
|
||
`POST /runs` body: `{run_id, repo, pr, model, provider, lenses, duration_secs, input_tokens?, output_tokens?, cost_usd?}`
|
||
(re-posting the same `run_id` updates it).
|
||
|
||
`POST /reports` array element: `{repo, pr, lens, file, line, title, model, provider, run_id, raw_severity, detail}`.
|
||
|
||
`GET /scoreboard` element: `{model, provider, runs, minutes, input_tokens, output_tokens, findings, confirmed, false_positive, ungraded, by_severity:{severity:count}}`.
|
||
|
||
If `GADFLY_REPORTS_TOKEN` is set, every route except the public view shell (`/healthz`, `/`, `/ui`)
|
||
requires `Authorization: Bearer <token>`. The `/ui` shell carries no data itself — its JS sends the
|
||
token on each fetch — so the public shell leaks nothing.
|
||
|
||
## Configuration
|
||
|
||
| Env | Default | Meaning |
|
||
|-----|---------|---------|
|
||
| `GADFLY_REPORTS_ADDR` | `:8090` | listen address |
|
||
| `GADFLY_REPORTS_DB` | `gadfly-reports.db` (`/data/gadfly-reports.db` in Docker) | SQLite path |
|
||
| `GADFLY_REPORTS_TOKEN` | *(empty)* | bearer token callers must present (empty = open) |
|
||
|
||
CLI flags `--addr` / `--db` / `--token` override the env.
|
||
|
||
## Dashboard
|
||
|
||
A built-in **read-only dashboard** ships at **`/ui`** (hit the host root and you're redirected
|
||
there). It's a single self-contained page that pulls `/runs` + `/export` and does everything in your
|
||
browser: a **per-model performance table** — runs, minutes, findings, confirmed / false-positive /
|
||
ungraded, points, **points-per-minute**, points-per-run, by-severity — with **drill-down filters**
|
||
(date range, repo, provider, model, lens, grade/severity), free-text search, and a click-to-scope
|
||
findings detail table.
|
||
|
||
True to the store's "no points" rule, **scoring lives in the browser**: the page has an editable
|
||
points curve (default `trivial=1, small=3, medium=5, high=8, critical=20`) and computes
|
||
`points = Σ weight[severity]·count` and `value/min = points / minutes` on the fly — retune it without
|
||
touching stored data.
|
||
|
||
There's also an editable **false-positive penalty ×** (default `-0.5`). A false positive has no
|
||
graded severity, so it's penalized by the severity the model **claimed** (its lens verdict —
|
||
Blocking→high, Minor→small): `penalty × points[claimed]`. So a Blocking-claimed FP at `-0.5` costs
|
||
`high(8) × -0.5 = -4`, and a model with the odd good find but many false positives nets *down* —
|
||
even negative — instead of coasting on its hits.
|
||
|
||
And an editable **solo-find bonus ×** (default `1.5`). Because findings are content-addressed, the
|
||
number of models that reported one is known, so a confirmed finding that **only that model** caught
|
||
(no other model reported it) scores `severity × bonus` — rewarding catching what the swarm missed.
|
||
The `solo` column counts those. This is derived from the data (reporter count); the grader never has
|
||
to flag it. Set the bonus to `1` to disable.
|
||
|
||
Its mirror, **solo-error penalty ×** (default `1.5`), multiplies the FP penalty when a false positive
|
||
was made by **only that model** — a unique wrong claim is noisier than a shared mistake. So a
|
||
Blocking-claimed solo FP costs `high(8) × -0.5 × 1.5 = -6` vs `-4` for a shared one. Set to `1` to disable.
|
||
|
||
Auth: the `/ui` shell is public (it holds no data); paste the store token into its **connect** box,
|
||
or open `/ui?token=<token>` once (remembered in `localStorage`). Prefer your own dashboard? Point
|
||
Grafana/Metabase/etc. at the SQLite file or the same `/export` + `/scoreboard` + `/runs` JSON.
|
||
|
||
## How it fits together
|
||
|
||
- **[gadfly](https://gitea.stevedudenhoeffer.com/steve/gadfly)** POSTs findings here after each
|
||
review when `GADFLY_FINDINGS_URL` points at this store (advisory; off by default).
|
||
- **[gadfly-mcp](https://gitea.stevedudenhoeffer.com/steve/gadfly-mcp)** is the MCP server Claude
|
||
uses to list findings and record grades against this store.
|
||
|
||
## Build / test
|
||
|
||
```sh
|
||
go build ./...
|
||
go test ./...
|
||
gofmt -l . # must be clean
|
||
```
|
||
|
||
## License
|
||
|
||
MIT © 2026 Steve Dudenhoeffer.
|