Files
cnx-network-clan/TODO.md
T
Berwn 4c7c74836d Add vmalert alerting rules for DNS and host health
vmalert on control evaluates rules (declared in git) against VictoriaMetrics and
remote-writes alert state back, so firing alerts show as the ALERTS series in
Grafana. Covers SOA divergence between ns1/ns2, secondary zone expiry, scrape
target down, and root disk full. No notifier yet (notifier.blackhole). Also adds
TODO.md roadmap.
2026-06-17 14:49:32 +07:00

1.5 KiB

Infra roadmap

Prioritized backlog for the cnx-network clan. See docs/ for how the current pieces work.

1. Alerting (done — pending deploy)

Rules evaluated by vmalert against VictoriaMetrics on control, declared in modules/monitoring/alerts.nix:

  • SOA serial divergence between ns1 and ns2 (secondary out of sync)
  • Zone-expiry countdown on the secondary approaching zero (transfers failing)
  • Any scrape target down (up == 0)
  • Root filesystem nearly full

Delivery stays minimal for now (notifier.blackhole): vmalert remote-writes alert state back to VM, so firing alerts show up as the ALERTS series in Grafana. Wiring a real notifier (Matrix) is a later step — drop blackhole and set settings."notifier.url" to an Alertmanager.

2. Backups of critical state

  • DNSSEC key material on ns1 (KSK/ZSK in Knot's KASP store) — losing it forces an emergency DS rollover at the registrar
  • VictoriaMetrics TSDB on control (optional, retention is 180d)

3. Blackbox DNS probing

  • blackbox_exporter on control doing real DNS + DNSSEC-validation queries against ns1/ns2 — catches outside-in resolution failures the Knot stats miss

4. Third secondary off Hetzner (resilience)

  • A secondary nameserver on a different provider/network so a single-provider outage doesn't take all authoritative DNS down (architectural — new machine)

5. Centralized logs

  • VictoriaLogs on control to grep journald across all three hosts, pairing with the existing VictoriaMetrics setup