Add CNX Backups dashboard and document the backup setup

Grafana dashboard (auto-provisioned from the dashboards dir) tracks borgbackup job health, time since last run, and per-job systemd state from the node_exporter systemd collector on the client. New docs page covers the ns1 -> control topology, secrets flow, and restore commands.
2026-06-17 15:13:47 +07:00
parent ed746b58c3
commit 1ea5bda23f
4 changed files with 268 additions and 3 deletions
@@ -4,3 +4,4 @@
 - [ZeroTier mesh](./mesh.md)
 - [DNS](./dns.md)
 - [Monitoring](./monitoring.md)
+- [Backups](./backups.md)
@@ -0,0 +1,61 @@
+# Backups
+
+Encrypted, deduplicating backups via clan's `borgbackup` service, declared in
+`clan.nix`. The only critical, non-regenerable state is the **Knot DNSSEC
+keystore** on `ns1` (the KSK/ZSK private keys under `/var/lib/knot`); losing it
+forces an emergency DS rollover at the registrar.
+
+## Topology
+
+- **control** is the borgbackup **server** — it hosts the repos under
+  `/var/lib/borgbackup/<client>` (so `ns1`'s repo is `/var/lib/borgbackup/ns1`).
+- **ns1** is the **client**. It backs up everything it declares as clan state
+  (`clan.core.state.knot.folders = [ "/var/lib/knot" ]`) once a day at 01:00,
+  over the ZeroTier mesh.
+
+The backup is cross-host so that losing `ns1` is recoverable, and stays
+self-contained (no third-party storage). Encryption is `repokey` with a
+generated passphrase, so `control` only ever stores ciphertext.
+
+Mesh peers have no name resolution, so `ns1` maps the `control` machine name to
+its ZeroTier address via `networking.hosts`; that is how the `borg@control` repo
+URL resolves.
+
+## Secrets
+
+The borgbackup ssh keypair and repokey passphrase are clan vars, generated once
+(needs the YubiKey). `control` will not evaluate until `ns1`'s public key
+exists, so generate before the first deploy:
+
+```
+clan vars generate ns1
+clan machines update ns1
+clan machines update control
+```
+
+## Operating
+
+Backups are driven by systemd on `ns1` (`borgbackup-job-control.timer`).
+
+```
+# trigger a backup now (on ns1)
+borgbackup-create
+
+# list archives (on ns1)
+borgbackup-list
+
+# restore selected folders from an archive (on ns1)
+NAME='<archive-name>' FOLDERS=/var/lib/knot borgbackup-restore
+```
+
+Retention is pruned automatically: all archives from the last day, then 7 daily
+and 4 weekly.
+
+## Monitoring
+
+The **CNX Backups** Grafana dashboard
+(`modules/monitoring/dashboards/backups.json`) tracks job health, time since the
+last successful run, and per-job state — all from the node_exporter systemd
+collector on the client. There is no dedicated borg metrics exporter; the unit
+state and the timer's last-trigger timestamp are enough to catch a backup that
+stops running or fails.
@@ -33,6 +33,10 @@ admin password is a clan var:
 clan vars get control grafana-admin/password
 ```

-The provisioned **CNX DNS** dashboard (`modules/monitoring/dashboards/dns.json`)
-shows per-nameserver SOA serials, zone expiry countdowns, query/response rates,
-and host CPU/memory/disk/load.
+Dashboards are provisioned from `modules/monitoring/dashboards/` (any JSON file
+there is picked up):
+
+- **CNX DNS** (`dns.json`) — firing alerts, per-nameserver SOA serials, zone
+  expiry countdowns, query/response rates, and host CPU/memory/disk/load.
+- **CNX Backups** (`backups.json`) — borgbackup job health, time since the last
+  run, and per-job state. See [Backups](./backups.md).