Steve Dudenhoeffer dd8ada479e
CI / test (pull_request) Successful in 9m51s
feat(ui): hide/exclude models from the dashboard (persisted)
Each scoreboard row gets a × to hide that model — for retired ones (m1
etc.) you no longer want in the view. Hidden models drop out of the
table, totals, and the findings drill-down; the set persists in
localStorage (grt-hidden) across reloads, with a "hidden (N): …" bar of
click-to-restore chips + a "show all".

Solo-ness is still computed against ALL models (hiding is a view filter,
not a rescoring), so hiding one model never fakes another's solo finds.
README Dashboard section updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-27 20:36:24 -04:00

🪰📋 gadfly-reports

A small durable store + scoreboard for Gadfly review findings. Gadfly (and any CI) POST each model's findings and per-review timing here; a human or Claude — via gadfly-mcp — later grades each finding. It's a single Go binary backed by SQLite, speaking a tiny HTTP API.

🤖 Heads up: this is a vibe-coded project

gadfly-reports was built almost entirely by an AI agent (Claude Code) — the design, the code, and these docs. It's small and it's tested, but treat it accordingly: it's a homelab-grade service, not a hardened product, and there may be the occasional AI-flavored rough edge. Issues and PRs welcome.

What it stores — and what it deliberately doesn't

gadfly-reports is a pure fact store:

  • runs — one per model's review of a PR: wall-clock duration, lens count, optional token/cost.
  • findingscontent-addressed by location (repo + pr + lens + file + line), so the same issue raised by several models collapses to one finding with many reports. That collapse is what makes cross-model consensus and per-model precision measurable.
  • grades — a triage verdict per finding: is_real, severity (trivial|small|medium|high|critical), optional usefulness (15), notes, grader. Grade history is kept; the latest wins.

It stores no points and computes no rankings. Mapping severity → points and ranking models by "value per minute" (or per token) is a client/dashboard concern, so you can retune the curve any time without migrating or re-scoring stored data.

Run it

# from source
go run gitea.stevedudenhoeffer.com/steve/gadfly-reports@latest serve

# or Docker (image published by CI on every push to main)
docker run -d --name gadfly-reports -p 8090:8090 -v gadfly-reports-data:/data \
  -e GADFLY_REPORTS_TOKEN=change-me \
  gitea.stevedudenhoeffer.com/steve/gadfly-reports:latest

Deploy behind Traefik (expose over a domain)

# docker-compose.yml — publish gadfly-reports at https://reports.example.com via Traefik.
services:
  gadfly-reports:
    image: gitea.stevedudenhoeffer.com/steve/gadfly-reports:latest
    restart: unless-stopped
    environment:
      # Auth is built in: callers (gadfly emit, gadfly-mcp) send this as a bearer
      # token; /healthz stays open. ADDR and DB default to :8090 and
      # /data/gadfly-reports.db inside the image.
      GADFLY_REPORTS_TOKEN: ${GADFLY_REPORTS_TOKEN:?set GADFLY_REPORTS_TOKEN in .env}
    volumes:
      - gadfly-reports-data:/data
    networks: [traefik]
    healthcheck:
      test: ["CMD", "wget", "-q", "-O", "-", "http://localhost:8090/healthz"]
      # Fast probe so Traefik resumes routing within ~1s of a restart (the daemon
      # binds the port in milliseconds). Without a fast probe Traefik 502s until the
      # first check — the usual "why is it down for 30s after restart".
      interval: 5s
      timeout: 3s
      retries: 3
      start_period: 10s
      start_interval: 1s   # probe every 1s during start_period (needs Docker 25+)
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.gadfly-reports.rule=Host(`reports.example.com`)"
      - "traefik.http.routers.gadfly-reports.entrypoints=websecure"
      - "traefik.http.routers.gadfly-reports.tls=true"
      - "traefik.http.routers.gadfly-reports.tls.certresolver=letsencrypt"
      - "traefik.http.services.gadfly-reports.loadbalancer.server.port=8090"

volumes:
  gadfly-reports-data:

networks:
  traefik:
    external: true   # the network your Traefik instance is attached to

Put GADFLY_REPORTS_TOKEN=<secret> in a .env beside the compose file. Tailor the three Traefik bits to your setup — the host (reports.example.com), the entrypoint (websecure) and the certresolver (letsencrypt) must match your Traefik config, and the traefik network must be the external one Traefik watches. Traefik terminates TLS and forwards to the container's :8090. Then point gadfly's GADFLY_FINDINGS_URL and gadfly-mcp's --store at https://reports.example.com (with the same token).

On docker compose pull && docker compose up -d, the fast healthcheck lets Traefik resume routing within ~1s (the daemon starts in milliseconds — Traefik just won't route to a container whose health probe hasn't passed yet, which is the "down for 30s after restart" gotcha). Your data lives on the gadfly-reports-data volume and survives restarts; the only loss exposure is a review POSTing findings during that ~1s window, since gadfly's emit is fire-and-forget (no retry) — negligible against reviews that take minutes.

HTTP API (the canonical contract)

Method & path Body / query Purpose
GET /healthz liveness (open even when a token is set)
GET / · GET /ui view-only dashboard — HTML shell, public; its JS fetches the gated endpoints with the token
POST /runs one run object upsert a model's review of a PR (timing/tokens)
POST /reports JSON array of report objects record findings + which model reported each
POST /findings/{id}/grade {is_real, severity?, usefulness?, notes?, grader?} record a triage grade
GET /export flat report×finding×run×latest-grade rows — the dashboard feed
GET /runs list all runs (timing/tokens), oldest first
GET /scoreboard points-free per-model rollup

POST /runs body: {run_id, repo, pr, model, provider, lenses, duration_secs, input_tokens?, output_tokens?, cost_usd?} (re-posting the same run_id updates it).

POST /reports array element: {repo, pr, lens, file, line, title, model, provider, run_id, raw_severity, detail}.

GET /scoreboard element: {model, provider, runs, minutes, input_tokens, output_tokens, findings, confirmed, false_positive, ungraded, by_severity:{severity:count}}.

If GADFLY_REPORTS_TOKEN is set, every route except the public view shell (/healthz, /, /ui) requires Authorization: Bearer <token>. The /ui shell carries no data itself — its JS sends the token on each fetch — so the public shell leaks nothing.

Configuration

Env Default Meaning
GADFLY_REPORTS_ADDR :8090 listen address
GADFLY_REPORTS_DB gadfly-reports.db (/data/gadfly-reports.db in Docker) SQLite path
GADFLY_REPORTS_TOKEN (empty) bearer token callers must present (empty = open)

CLI flags --addr / --db / --token override the env.

Dashboard

A built-in read-only dashboard ships at /ui (hit the host root and you're redirected there). It's a single self-contained page that pulls /runs + /export and does everything in your browser: a per-model performance table — runs, minutes, findings, confirmed / false-positive / ungraded, points, points-per-minute, points-per-run, by-severity — with drill-down filters (date range, repo, provider, model, lens, grade/severity), free-text search, and a click-to-scope findings detail table.

True to the store's "no points" rule, scoring lives in the browser: the page has an editable points curve (default trivial=1, small=3, medium=5, high=8, critical=20) and computes points = Σ weight[severity]·count and value/min = points / minutes on the fly — retune it without touching stored data.

There's also an editable false-positive penalty × (default -0.5). A false positive has no graded severity, so it's penalized by the severity the model claimed (its lens verdict — Blocking→high, Minor→small): penalty × points[claimed]. So a Blocking-claimed FP at -0.5 costs high(8) × -0.5 = -4, and a model with the odd good find but many false positives nets down — even negative — instead of coasting on its hits.

And an editable solo-find bonus × (default 1.5). Because findings are content-addressed, the number of models that reported one is known, so a confirmed finding that only that model caught (no other model reported it) scores severity × bonus — rewarding catching what the swarm missed. The solo column counts those. This is derived from the data (reporter count); the grader never has to flag it. Set the bonus to 1 to disable.

Its mirror, solo-error penalty × (default 1.5), multiplies the FP penalty when a false positive was made by only that model — a unique wrong claim is noisier than a shared mistake. So a Blocking-claimed solo FP costs high(8) × -0.5 × 1.5 = -6 vs -4 for a shared one. Set to 1 to disable.

Hiding models. Each scoreboard row has a small × to hide that model — handy for retired ones (e.g. m1) you no longer want cluttering the view. Hidden models drop out of the table, the totals, and the findings drill-down (but not from solo-ness, which stays computed against all models — hiding is a view filter, not a rescoring). The hidden set persists in localStorage across reloads; a hidden (N): … bar lists them as click-to-restore chips, with a show all to clear.

Auth: the /ui shell is public (it holds no data); paste the store token into its connect box, or open /ui?token=<token> once (remembered in localStorage). Prefer your own dashboard? Point Grafana/Metabase/etc. at the SQLite file or the same /export + /scoreboard + /runs JSON.

How it fits together

  • gadfly POSTs findings here after each review when GADFLY_FINDINGS_URL points at this store (advisory; off by default).
  • gadfly-mcp is the MCP server Claude uses to list findings and record grades against this store.

Build / test

go build ./...
go test ./...
gofmt -l .   # must be clean

License

MIT © 2026 Steve Dudenhoeffer.

S
Description
No description provided
Readme MIT 120 KiB
Languages
Go 60.6%
HTML 37.6%
Dockerfile 1.8%