Add CNX Backups dashboard and document the backup setup
Grafana dashboard (auto-provisioned from the dashboards dir) tracks borgbackup job health, time since last run, and per-job systemd state from the node_exporter systemd collector on the client. New docs page covers the ns1 -> control topology, secrets flow, and restore commands.
This commit is contained in:
@@ -4,3 +4,4 @@
|
||||
- [ZeroTier mesh](./mesh.md)
|
||||
- [DNS](./dns.md)
|
||||
- [Monitoring](./monitoring.md)
|
||||
- [Backups](./backups.md)
|
||||
|
||||
@@ -0,0 +1,61 @@
|
||||
# Backups
|
||||
|
||||
Encrypted, deduplicating backups via clan's `borgbackup` service, declared in
|
||||
`clan.nix`. The only critical, non-regenerable state is the **Knot DNSSEC
|
||||
keystore** on `ns1` (the KSK/ZSK private keys under `/var/lib/knot`); losing it
|
||||
forces an emergency DS rollover at the registrar.
|
||||
|
||||
## Topology
|
||||
|
||||
- **control** is the borgbackup **server** — it hosts the repos under
|
||||
`/var/lib/borgbackup/<client>` (so `ns1`'s repo is `/var/lib/borgbackup/ns1`).
|
||||
- **ns1** is the **client**. It backs up everything it declares as clan state
|
||||
(`clan.core.state.knot.folders = [ "/var/lib/knot" ]`) once a day at 01:00,
|
||||
over the ZeroTier mesh.
|
||||
|
||||
The backup is cross-host so that losing `ns1` is recoverable, and stays
|
||||
self-contained (no third-party storage). Encryption is `repokey` with a
|
||||
generated passphrase, so `control` only ever stores ciphertext.
|
||||
|
||||
Mesh peers have no name resolution, so `ns1` maps the `control` machine name to
|
||||
its ZeroTier address via `networking.hosts`; that is how the `borg@control` repo
|
||||
URL resolves.
|
||||
|
||||
## Secrets
|
||||
|
||||
The borgbackup ssh keypair and repokey passphrase are clan vars, generated once
|
||||
(needs the YubiKey). `control` will not evaluate until `ns1`'s public key
|
||||
exists, so generate before the first deploy:
|
||||
|
||||
```
|
||||
clan vars generate ns1
|
||||
clan machines update ns1
|
||||
clan machines update control
|
||||
```
|
||||
|
||||
## Operating
|
||||
|
||||
Backups are driven by systemd on `ns1` (`borgbackup-job-control.timer`).
|
||||
|
||||
```
|
||||
# trigger a backup now (on ns1)
|
||||
borgbackup-create
|
||||
|
||||
# list archives (on ns1)
|
||||
borgbackup-list
|
||||
|
||||
# restore selected folders from an archive (on ns1)
|
||||
NAME='<archive-name>' FOLDERS=/var/lib/knot borgbackup-restore
|
||||
```
|
||||
|
||||
Retention is pruned automatically: all archives from the last day, then 7 daily
|
||||
and 4 weekly.
|
||||
|
||||
## Monitoring
|
||||
|
||||
The **CNX Backups** Grafana dashboard
|
||||
(`modules/monitoring/dashboards/backups.json`) tracks job health, time since the
|
||||
last successful run, and per-job state — all from the node_exporter systemd
|
||||
collector on the client. There is no dedicated borg metrics exporter; the unit
|
||||
state and the timer's last-trigger timestamp are enough to catch a backup that
|
||||
stops running or fails.
|
||||
@@ -33,6 +33,10 @@ admin password is a clan var:
|
||||
clan vars get control grafana-admin/password
|
||||
```
|
||||
|
||||
The provisioned **CNX DNS** dashboard (`modules/monitoring/dashboards/dns.json`)
|
||||
shows per-nameserver SOA serials, zone expiry countdowns, query/response rates,
|
||||
and host CPU/memory/disk/load.
|
||||
Dashboards are provisioned from `modules/monitoring/dashboards/` (any JSON file
|
||||
there is picked up):
|
||||
|
||||
- **CNX DNS** (`dns.json`) — firing alerts, per-nameserver SOA serials, zone
|
||||
expiry countdowns, query/response rates, and host CPU/memory/disk/load.
|
||||
- **CNX Backups** (`backups.json`) — borgbackup job health, time since the last
|
||||
run, and per-job state. See [Backups](./backups.md).
|
||||
|
||||
Reference in New Issue
Block a user