Add CNX Backups dashboard and document the backup setup

Grafana dashboard (auto-provisioned from the dashboards dir) tracks
borgbackup job health, time since last run, and per-job systemd state
from the node_exporter systemd collector on the client. New docs page
covers the ns1 -> control topology, secrets flow, and restore commands.
This commit is contained in:
Berwn
2026-06-17 15:13:47 +07:00
parent ed746b58c3
commit 1ea5bda23f
4 changed files with 268 additions and 3 deletions
+1
View File
@@ -4,3 +4,4 @@
- [ZeroTier mesh](./mesh.md)
- [DNS](./dns.md)
- [Monitoring](./monitoring.md)
- [Backups](./backups.md)
+61
View File
@@ -0,0 +1,61 @@
# Backups
Encrypted, deduplicating backups via clan's `borgbackup` service, declared in
`clan.nix`. The only critical, non-regenerable state is the **Knot DNSSEC
keystore** on `ns1` (the KSK/ZSK private keys under `/var/lib/knot`); losing it
forces an emergency DS rollover at the registrar.
## Topology
- **control** is the borgbackup **server** — it hosts the repos under
`/var/lib/borgbackup/<client>` (so `ns1`'s repo is `/var/lib/borgbackup/ns1`).
- **ns1** is the **client**. It backs up everything it declares as clan state
(`clan.core.state.knot.folders = [ "/var/lib/knot" ]`) once a day at 01:00,
over the ZeroTier mesh.
The backup is cross-host so that losing `ns1` is recoverable, and stays
self-contained (no third-party storage). Encryption is `repokey` with a
generated passphrase, so `control` only ever stores ciphertext.
Mesh peers have no name resolution, so `ns1` maps the `control` machine name to
its ZeroTier address via `networking.hosts`; that is how the `borg@control` repo
URL resolves.
## Secrets
The borgbackup ssh keypair and repokey passphrase are clan vars, generated once
(needs the YubiKey). `control` will not evaluate until `ns1`'s public key
exists, so generate before the first deploy:
```
clan vars generate ns1
clan machines update ns1
clan machines update control
```
## Operating
Backups are driven by systemd on `ns1` (`borgbackup-job-control.timer`).
```
# trigger a backup now (on ns1)
borgbackup-create
# list archives (on ns1)
borgbackup-list
# restore selected folders from an archive (on ns1)
NAME='<archive-name>' FOLDERS=/var/lib/knot borgbackup-restore
```
Retention is pruned automatically: all archives from the last day, then 7 daily
and 4 weekly.
## Monitoring
The **CNX Backups** Grafana dashboard
(`modules/monitoring/dashboards/backups.json`) tracks job health, time since the
last successful run, and per-job state — all from the node_exporter systemd
collector on the client. There is no dedicated borg metrics exporter; the unit
state and the timer's last-trigger timestamp are enough to catch a backup that
stops running or fails.
+7 -3
View File
@@ -33,6 +33,10 @@ admin password is a clan var:
clan vars get control grafana-admin/password
```
The provisioned **CNX DNS** dashboard (`modules/monitoring/dashboards/dns.json`)
shows per-nameserver SOA serials, zone expiry countdowns, query/response rates,
and host CPU/memory/disk/load.
Dashboards are provisioned from `modules/monitoring/dashboards/` (any JSON file
there is picked up):
- **CNX DNS** (`dns.json`) — firing alerts, per-nameserver SOA serials, zone
expiry countdowns, query/response rates, and host CPU/memory/disk/load.
- **CNX Backups** (`backups.json`) — borgbackup job health, time since the last
run, and per-job state. See [Backups](./backups.md).