4c7c74836d
vmalert on control evaluates rules (declared in git) against VictoriaMetrics and remote-writes alert state back, so firing alerts show as the ALERTS series in Grafana. Covers SOA divergence between ns1/ns2, secondary zone expiry, scrape target down, and root disk full. No notifier yet (notifier.blackhole). Also adds TODO.md roadmap.
1.5 KiB
1.5 KiB
Infra roadmap
Prioritized backlog for the cnx-network clan. See docs/ for how the current
pieces work.
1. Alerting (done — pending deploy)
Rules evaluated by vmalert against VictoriaMetrics on control, declared in
modules/monitoring/alerts.nix:
- SOA serial divergence between ns1 and ns2 (secondary out of sync)
- Zone-expiry countdown on the secondary approaching zero (transfers failing)
- Any scrape target down (
up == 0) - Root filesystem nearly full
Delivery stays minimal for now (notifier.blackhole): vmalert remote-writes
alert state back to VM, so firing alerts show up as the ALERTS series in
Grafana. Wiring a real notifier (Matrix) is a later step — drop blackhole and
set settings."notifier.url" to an Alertmanager.
2. Backups of critical state
- DNSSEC key material on ns1 (KSK/ZSK in Knot's KASP store) — losing it forces an emergency DS rollover at the registrar
- VictoriaMetrics TSDB on control (optional, retention is 180d)
3. Blackbox DNS probing
blackbox_exporteron control doing real DNS + DNSSEC-validation queries against ns1/ns2 — catches outside-in resolution failures the Knot stats miss
4. Third secondary off Hetzner (resilience)
- A secondary nameserver on a different provider/network so a single-provider outage doesn't take all authoritative DNS down (architectural — new machine)
5. Centralized logs
- VictoriaLogs on control to grep journald across all three hosts, pairing with the existing VictoriaMetrics setup