Add vmalert alerting rules for DNS and host health
vmalert on control evaluates rules (declared in git) against VictoriaMetrics and remote-writes alert state back, so firing alerts show as the ALERTS series in Grafana. Covers SOA divergence between ns1/ns2, secondary zone expiry, scrape target down, and root disk full. No notifier yet (notifier.blackhole). Also adds TODO.md roadmap.
This commit is contained in:
@@ -0,0 +1,40 @@
|
||||
# Infra roadmap
|
||||
|
||||
Prioritized backlog for the cnx-network clan. See `docs/` for how the current
|
||||
pieces work.
|
||||
|
||||
## 1. Alerting (done — pending deploy)
|
||||
|
||||
Rules evaluated by vmalert against VictoriaMetrics on control, declared in
|
||||
`modules/monitoring/alerts.nix`:
|
||||
|
||||
- [x] SOA serial divergence between ns1 and ns2 (secondary out of sync)
|
||||
- [x] Zone-expiry countdown on the secondary approaching zero (transfers failing)
|
||||
- [x] Any scrape target down (`up == 0`)
|
||||
- [x] Root filesystem nearly full
|
||||
|
||||
Delivery stays minimal for now (`notifier.blackhole`): vmalert remote-writes
|
||||
alert state back to VM, so firing alerts show up as the `ALERTS` series in
|
||||
Grafana. Wiring a real notifier (Matrix) is a later step — drop `blackhole` and
|
||||
set `settings."notifier.url"` to an Alertmanager.
|
||||
|
||||
## 2. Backups of critical state
|
||||
|
||||
- [ ] DNSSEC key material on ns1 (KSK/ZSK in Knot's KASP store) — losing it forces
|
||||
an emergency DS rollover at the registrar
|
||||
- [ ] VictoriaMetrics TSDB on control (optional, retention is 180d)
|
||||
|
||||
## 3. Blackbox DNS probing
|
||||
|
||||
- [ ] `blackbox_exporter` on control doing real DNS + DNSSEC-validation queries
|
||||
against ns1/ns2 — catches outside-in resolution failures the Knot stats miss
|
||||
|
||||
## 4. Third secondary off Hetzner (resilience)
|
||||
|
||||
- [ ] A secondary nameserver on a different provider/network so a single-provider
|
||||
outage doesn't take all authoritative DNS down (architectural — new machine)
|
||||
|
||||
## 5. Centralized logs
|
||||
|
||||
- [ ] VictoriaLogs on control to grep journald across all three hosts, pairing
|
||||
with the existing VictoriaMetrics setup
|
||||
Reference in New Issue
Block a user