044891927b
clan borgbackup instance: control serves repos, ns1 backs up its clan.core.state (the KASP keystore at /var/lib/knot) nightly over the mesh with repokey encryption. ns1 maps the control machine name to its ZeroTier address so the borg@control repo resolves. Run `clan vars generate ns1` before deploy to mint the borg keypair.
54 lines
2.3 KiB
Markdown
54 lines
2.3 KiB
Markdown
# Infra roadmap
|
|
|
|
Prioritized backlog for the cnx-network clan. See `docs/` for how the current
|
|
pieces work.
|
|
|
|
## 1. Alerting (done — pending deploy)
|
|
|
|
Rules evaluated by vmalert against VictoriaMetrics on control, declared in
|
|
`modules/monitoring/alerts.nix`:
|
|
|
|
- [x] SOA serial divergence between ns1 and ns2 (secondary out of sync)
|
|
- [x] Zone-expiry countdown on the secondary approaching zero (transfers failing)
|
|
- [x] Any scrape target down (`up == 0`)
|
|
- [x] Root filesystem nearly full
|
|
|
|
Delivery stays minimal for now (`notifier.blackhole`): vmalert remote-writes
|
|
alert state back to VM, so firing alerts show up as the `ALERTS` series in
|
|
Grafana. Wiring a real notifier (Matrix) is a later step — drop `blackhole` and
|
|
set `settings."notifier.url"` to an Alertmanager.
|
|
|
|
## 2. Backups of critical state (DNSSEC done — pending vars + deploy)
|
|
|
|
clan `borgbackup` instance in `clan.nix`: control is the server (repos under
|
|
`/var/lib/borgbackup/<client>`), ns1 the client. ns1 declares
|
|
`clan.core.state.knot.folders = [ "/var/lib/knot" ]`, so the Knot KASP keystore
|
|
is backed up nightly (01:00) over the mesh with repokey encryption — control
|
|
never holds plaintext. ns1 maps the `control` machine name to its mesh IP via
|
|
`networking.hosts` so the `borg@control` repo resolves.
|
|
|
|
Before deploy: `clan vars generate ns1` (YubiKey) to mint the borgbackup ssh
|
|
keypair + repokey; control won't evaluate until ns1's public key exists. Then
|
|
deploy ns1 and control.
|
|
|
|
- [x] DNSSEC key material on ns1 (KSK/ZSK in Knot's KASP store) — losing it forces
|
|
an emergency DS rollover at the registrar
|
|
- [ ] VictoriaMetrics TSDB on control (optional, retention is 180d) — deferred;
|
|
regenerable over time and control is the backup server, so this needs a
|
|
second client→server pair (e.g. control→ns2) rather than the same topology
|
|
|
|
## 3. Blackbox DNS probing
|
|
|
|
- [ ] `blackbox_exporter` on control doing real DNS + DNSSEC-validation queries
|
|
against ns1/ns2 — catches outside-in resolution failures the Knot stats miss
|
|
|
|
## 4. Third secondary off Hetzner (resilience)
|
|
|
|
- [ ] A secondary nameserver on a different provider/network so a single-provider
|
|
outage doesn't take all authoritative DNS down (architectural — new machine)
|
|
|
|
## 5. Centralized logs
|
|
|
|
- [ ] VictoriaLogs on control to grep journald across all three hosts, pairing
|
|
with the existing VictoriaMetrics setup
|