Files
cnx-network-clan/TODO.md
T
Berwn d4a171640b Add VictoriaLogs for centralized journald across all hosts
control runs VictoriaLogs (:9428, 30d, mesh-scoped) with a matching
Grafana datasource. Each host ships journald via systemd's own
journald.upload to the /insert/journald endpoint -- no extra agent.
control uploads over loopback so its logs survive a mesh outage; ns1
and ns2 push over the mesh.
2026-06-17 16:53:52 +07:00

3.4 KiB

Infra roadmap

Prioritized backlog for the cnx-network clan. See docs/ for how the current pieces work.

1. Alerting (done — pending deploy)

Rules evaluated by vmalert against VictoriaMetrics on control, declared in modules/monitoring/alerts.nix:

  • SOA serial divergence between ns1 and ns2 (secondary out of sync)
  • Zone-expiry countdown on the secondary approaching zero (transfers failing)
  • Any scrape target down (up == 0)
  • Root filesystem nearly full

Delivery stays minimal for now (notifier.blackhole): vmalert remote-writes alert state back to VM, so firing alerts show up as the ALERTS series in Grafana. Wiring a real notifier (Matrix) is a later step — drop blackhole and set settings."notifier.url" to an Alertmanager.

2. Backups of critical state (DNSSEC done — pending vars + deploy)

clan borgbackup instance in clan.nix: control is the server (repos under /var/lib/borgbackup/<client>), ns1 the client. ns1 declares clan.core.state.knot.folders = [ "/var/lib/knot" ], so the Knot KASP keystore is backed up nightly (01:00) over the mesh with repokey encryption — control never holds plaintext. ns1 maps the control machine name to its mesh IP via networking.hosts so the borg@control repo resolves.

Before deploy: clan vars generate ns1 (YubiKey) to mint the borgbackup ssh keypair + repokey; control won't evaluate until ns1's public key exists. Then deploy ns1 and control.

  • DNSSEC key material on ns1 (KSK/ZSK in Knot's KASP store) — losing it forces an emergency DS rollover at the registrar
  • VictoriaMetrics TSDB on control (optional, retention is 180d) — deferred; regenerable over time and control is the backup server, so this needs a second client→server pair (e.g. control→ns2) rather than the same topology

3. Blackbox DNS probing (done — pending deploy)

blackbox_exporter on control (loopback :9115), probing each nameserver's public v4+v6 address for every zone: an SOA query (zone served?) and a DNSKEY query (still signed?). Blackbox has no DO-bit option, so signing is checked by asking for DNSKEY directly and asserting the RRset is present. Probe defs live in modules/monitoring/blackbox-probes.nix, shared by the exporter (blackbox.nix) and the VM scrape jobs (server.nix). Verified live against ns1/ns2: SOA + DNSKEY succeed on both servers over v4 and v6.

  • blackbox_exporter on control doing real DNS + DNSSEC-validation queries against ns1/ns2 — catches outside-in resolution failures the Knot stats miss
  • paired with alerts (DNSResolutionProbeFailed / DNSSECProbeFailed in alerts.nix) and a "DNS probes (outside-in)" row on the CNX DNS dashboard

4. Third secondary off Hetzner (resilience)

  • A secondary nameserver on a different provider/network so a single-provider outage doesn't take all authoritative DNS down (architectural — new machine)

5. Centralized logs (done — pending deploy)

VictoriaLogs on control (:9428, 30d retention, mesh-scoped) in modules/monitoring/server.nix, plus a VictoriaLogs Grafana datasource. All three hosts ship journald with systemd's own services.journald.upload to the /insert/journald endpoint (modules/monitoring/exporters.nix) — no extra agent. control uploads over loopback; ns1/ns2 over the mesh.

  • VictoriaLogs on control to grep journald across all three hosts, pairing with the existing VictoriaMetrics setup