# Infra roadmap Prioritized backlog for the cnx-network clan. See `docs/` for how the current pieces work. ## 1. Alerting (done — pending deploy) Rules evaluated by vmalert against VictoriaMetrics on control, declared in `modules/monitoring/alerts.nix`: - [x] SOA serial divergence between ns1 and ns2 (secondary out of sync) - [x] Zone-expiry countdown on the secondary approaching zero (transfers failing) - [x] Any scrape target down (`up == 0`) - [x] Root filesystem nearly full Delivery stays minimal for now (`notifier.blackhole`): vmalert remote-writes alert state back to VM, so firing alerts show up as the `ALERTS` series in Grafana. Wiring a real notifier (Matrix) is a later step — drop `blackhole` and set `settings."notifier.url"` to an Alertmanager. ## 2. Backups of critical state (DNSSEC done — pending vars + deploy) clan `borgbackup` instance in `clan.nix`: control is the server (repos under `/var/lib/borgbackup/`), ns1 the client. ns1 declares `clan.core.state.knot.folders = [ "/var/lib/knot" ]`, so the Knot KASP keystore is backed up nightly (01:00) over the mesh with repokey encryption — control never holds plaintext. ns1 maps the `control` machine name to its mesh IP via `networking.hosts` so the `borg@control` repo resolves. Before deploy: `clan vars generate ns1` (YubiKey) to mint the borgbackup ssh keypair + repokey; control won't evaluate until ns1's public key exists. Then deploy ns1 and control. - [x] DNSSEC key material on ns1 (KSK/ZSK in Knot's KASP store) — losing it forces an emergency DS rollover at the registrar - [ ] VictoriaMetrics TSDB on control (optional, retention is 180d) — deferred; regenerable over time and control is the backup server, so this needs a second client→server pair (e.g. control→ns2) rather than the same topology ## 3. Blackbox DNS probing (done — pending deploy) `blackbox_exporter` on control (loopback `:9115`), probing each nameserver's public v4+v6 address for every zone: an SOA query (zone served?) and a DNSKEY query (still signed?). Blackbox has no DO-bit option, so signing is checked by asking for DNSKEY directly and asserting the RRset is present. Probe defs live in `modules/monitoring/blackbox-probes.nix`, shared by the exporter (`blackbox.nix`) and the VM scrape jobs (`server.nix`). Verified live against ns1/ns2: SOA + DNSKEY succeed on both servers over v4 and v6. - [x] `blackbox_exporter` on control doing real DNS + DNSSEC-validation queries against ns1/ns2 — catches outside-in resolution failures the Knot stats miss - [x] paired with alerts (`DNSResolutionProbeFailed` / `DNSSECProbeFailed` in `alerts.nix`) and a "DNS probes (outside-in)" row on the CNX DNS dashboard ## 4. Third secondary off Hetzner (resilience) - [ ] A secondary nameserver on a different provider/network so a single-provider outage doesn't take all authoritative DNS down (architectural — new machine) ## 5. Centralized logs (done — pending deploy) VictoriaLogs on control (`:9428`, 30d retention, mesh-scoped) in `modules/monitoring/server.nix`, plus a VictoriaLogs Grafana datasource. All three hosts ship journald with systemd's own `services.journald.upload` to the `/insert/journald` endpoint (`modules/monitoring/exporters.nix`) — no extra agent. control uploads over loopback; ns1/ns2 over the mesh. - [x] VictoriaLogs on control to grep journald across all three hosts, pairing with the existing VictoriaMetrics setup