Compare commits
2 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 1ea5bda23f | |||
| ed746b58c3 |
@@ -4,3 +4,4 @@
|
||||
- [ZeroTier mesh](./mesh.md)
|
||||
- [DNS](./dns.md)
|
||||
- [Monitoring](./monitoring.md)
|
||||
- [Backups](./backups.md)
|
||||
|
||||
@@ -0,0 +1,61 @@
|
||||
# Backups
|
||||
|
||||
Encrypted, deduplicating backups via clan's `borgbackup` service, declared in
|
||||
`clan.nix`. The only critical, non-regenerable state is the **Knot DNSSEC
|
||||
keystore** on `ns1` (the KSK/ZSK private keys under `/var/lib/knot`); losing it
|
||||
forces an emergency DS rollover at the registrar.
|
||||
|
||||
## Topology
|
||||
|
||||
- **control** is the borgbackup **server** — it hosts the repos under
|
||||
`/var/lib/borgbackup/<client>` (so `ns1`'s repo is `/var/lib/borgbackup/ns1`).
|
||||
- **ns1** is the **client**. It backs up everything it declares as clan state
|
||||
(`clan.core.state.knot.folders = [ "/var/lib/knot" ]`) once a day at 01:00,
|
||||
over the ZeroTier mesh.
|
||||
|
||||
The backup is cross-host so that losing `ns1` is recoverable, and stays
|
||||
self-contained (no third-party storage). Encryption is `repokey` with a
|
||||
generated passphrase, so `control` only ever stores ciphertext.
|
||||
|
||||
Mesh peers have no name resolution, so `ns1` maps the `control` machine name to
|
||||
its ZeroTier address via `networking.hosts`; that is how the `borg@control` repo
|
||||
URL resolves.
|
||||
|
||||
## Secrets
|
||||
|
||||
The borgbackup ssh keypair and repokey passphrase are clan vars, generated once
|
||||
(needs the YubiKey). `control` will not evaluate until `ns1`'s public key
|
||||
exists, so generate before the first deploy:
|
||||
|
||||
```
|
||||
clan vars generate ns1
|
||||
clan machines update ns1
|
||||
clan machines update control
|
||||
```
|
||||
|
||||
## Operating
|
||||
|
||||
Backups are driven by systemd on `ns1` (`borgbackup-job-control.timer`).
|
||||
|
||||
```
|
||||
# trigger a backup now (on ns1)
|
||||
borgbackup-create
|
||||
|
||||
# list archives (on ns1)
|
||||
borgbackup-list
|
||||
|
||||
# restore selected folders from an archive (on ns1)
|
||||
NAME='<archive-name>' FOLDERS=/var/lib/knot borgbackup-restore
|
||||
```
|
||||
|
||||
Retention is pruned automatically: all archives from the last day, then 7 daily
|
||||
and 4 weekly.
|
||||
|
||||
## Monitoring
|
||||
|
||||
The **CNX Backups** Grafana dashboard
|
||||
(`modules/monitoring/dashboards/backups.json`) tracks job health, time since the
|
||||
last successful run, and per-job state — all from the node_exporter systemd
|
||||
collector on the client. There is no dedicated borg metrics exporter; the unit
|
||||
state and the timer's last-trigger timestamp are enough to catch a backup that
|
||||
stops running or fails.
|
||||
@@ -33,6 +33,10 @@ admin password is a clan var:
|
||||
clan vars get control grafana-admin/password
|
||||
```
|
||||
|
||||
The provisioned **CNX DNS** dashboard (`modules/monitoring/dashboards/dns.json`)
|
||||
shows per-nameserver SOA serials, zone expiry countdowns, query/response rates,
|
||||
and host CPU/memory/disk/load.
|
||||
Dashboards are provisioned from `modules/monitoring/dashboards/` (any JSON file
|
||||
there is picked up):
|
||||
|
||||
- **CNX DNS** (`dns.json`) — firing alerts, per-nameserver SOA serials, zone
|
||||
expiry countdowns, query/response rates, and host CPU/memory/disk/load.
|
||||
- **CNX Backups** (`backups.json`) — borgbackup job health, time since the last
|
||||
run, and per-job state. See [Backups](./backups.md).
|
||||
|
||||
@@ -0,0 +1,199 @@
|
||||
{
|
||||
"uid": "cnx-backups",
|
||||
"title": "CNX Backups",
|
||||
"tags": ["backup", "borg", "cnx"],
|
||||
"timezone": "browser",
|
||||
"schemaVersion": 39,
|
||||
"version": 1,
|
||||
"refresh": "1m",
|
||||
"time": { "from": "now-7d", "to": "now" },
|
||||
"templating": { "list": [] },
|
||||
"annotations": { "list": [] },
|
||||
"panels": [
|
||||
{
|
||||
"type": "row",
|
||||
"title": "Backups",
|
||||
"id": 1,
|
||||
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 0 }
|
||||
},
|
||||
{
|
||||
"type": "stat",
|
||||
"title": "Backup health",
|
||||
"description": "1 if any borgbackup job is in the failed state, 0 otherwise. A successful run leaves the oneshot unit inactive (still OK); only a real failure shows FAILED. Derived from the node_exporter systemd collector on the backup client (ns1).",
|
||||
"id": 2,
|
||||
"datasource": { "type": "prometheus", "uid": "victoriametrics" },
|
||||
"gridPos": { "h": 5, "w": 8, "x": 0, "y": 1 },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": { "mode": "thresholds" },
|
||||
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }] },
|
||||
"noValue": "no data",
|
||||
"mappings": [
|
||||
{
|
||||
"type": "value",
|
||||
"options": {
|
||||
"0": { "text": "OK", "color": "green", "index": 0 },
|
||||
"1": { "text": "FAILED", "color": "red", "index": 1 }
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
|
||||
"colorMode": "background",
|
||||
"graphMode": "none",
|
||||
"textMode": "auto",
|
||||
"orientation": "auto"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"datasource": { "type": "prometheus", "uid": "victoriametrics" },
|
||||
"expr": "max(node_systemd_unit_state{name=~\"borgbackup-job-.+\\\\.service\",state=\"failed\"})",
|
||||
"instant": true
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"type": "stat",
|
||||
"title": "Last backup run",
|
||||
"description": "When the most recent backup timer last fired (the daily borgbackup job). 'No data' before the first run.",
|
||||
"id": 3,
|
||||
"datasource": { "type": "prometheus", "uid": "victoriametrics" },
|
||||
"gridPos": { "h": 5, "w": 8, "x": 8, "y": 1 },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "dateTimeFromNow",
|
||||
"color": { "mode": "fixed", "fixedColor": "text" },
|
||||
"noValue": "never"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
|
||||
"colorMode": "none",
|
||||
"graphMode": "none",
|
||||
"textMode": "auto",
|
||||
"orientation": "auto"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"datasource": { "type": "prometheus", "uid": "victoriametrics" },
|
||||
"expr": "max(node_systemd_timer_last_trigger_seconds{name=~\"borgbackup-job-.+\\\\.timer\"}) * 1000",
|
||||
"instant": true
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"type": "stat",
|
||||
"title": "Time since last backup",
|
||||
"description": "Age of the most recent backup. Backups run daily, so anything past ~25h means a run was missed. Red over 25h.",
|
||||
"id": 4,
|
||||
"datasource": { "type": "prometheus", "uid": "victoriametrics" },
|
||||
"gridPos": { "h": 5, "w": 8, "x": 16, "y": 1 },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "s",
|
||||
"color": { "mode": "thresholds" },
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{ "color": "green", "value": null },
|
||||
{ "color": "red", "value": 90000 }
|
||||
]
|
||||
},
|
||||
"noValue": "never"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
|
||||
"colorMode": "background",
|
||||
"graphMode": "none",
|
||||
"textMode": "auto",
|
||||
"orientation": "auto"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"datasource": { "type": "prometheus", "uid": "victoriametrics" },
|
||||
"expr": "time() - max(node_systemd_timer_last_trigger_seconds{name=~\"borgbackup-job-.+\\\\.timer\"})",
|
||||
"instant": true
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"type": "table",
|
||||
"title": "Backup jobs (current state)",
|
||||
"description": "Every borgbackup job and the systemd unit state it is currently in, per client. 'inactive' is the normal resting state of a oneshot job between runs.",
|
||||
"id": 5,
|
||||
"datasource": { "type": "prometheus", "uid": "victoriametrics" },
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 6 },
|
||||
"options": { "showHeader": true },
|
||||
"fieldConfig": {
|
||||
"defaults": { "custom": { "align": "auto" } },
|
||||
"overrides": []
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"datasource": { "type": "prometheus", "uid": "victoriametrics" },
|
||||
"expr": "node_systemd_unit_state{name=~\"borgbackup-job-.+\\\\.service\"} == 1",
|
||||
"format": "table",
|
||||
"instant": true
|
||||
}
|
||||
],
|
||||
"transformations": [
|
||||
{
|
||||
"id": "organize",
|
||||
"options": {
|
||||
"excludeByName": {
|
||||
"Time": true,
|
||||
"Value": true,
|
||||
"__name__": true,
|
||||
"job": true,
|
||||
"type": true
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"type": "timeseries",
|
||||
"title": "Failed state over time",
|
||||
"description": "1 while a backup job is in the failed state. A spike here is a backup that did not complete and was not retried before the next scrape.",
|
||||
"id": 6,
|
||||
"datasource": { "type": "prometheus", "uid": "victoriametrics" },
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 6 },
|
||||
"fieldConfig": { "defaults": { "unit": "short", "min": 0, "max": 1 }, "overrides": [] },
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"datasource": { "type": "prometheus", "uid": "victoriametrics" },
|
||||
"expr": "node_systemd_unit_state{name=~\"borgbackup-job-.+\\\\.service\",state=\"failed\"}",
|
||||
"legendFormat": "{{instance}} {{name}}"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"type": "timeseries",
|
||||
"title": "Time since last backup (history)",
|
||||
"description": "Age of the latest backup over time. The sawtooth should reset to near zero once a day; a steady climb without a reset means backups stopped running.",
|
||||
"id": 7,
|
||||
"datasource": { "type": "prometheus", "uid": "victoriametrics" },
|
||||
"gridPos": { "h": 8, "w": 24, "x": 0, "y": 14 },
|
||||
"fieldConfig": { "defaults": { "unit": "s" }, "overrides": [] },
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"datasource": { "type": "prometheus", "uid": "victoriametrics" },
|
||||
"expr": "time() - node_systemd_timer_last_trigger_seconds{name=~\"borgbackup-job-.+\\\\.timer\"}",
|
||||
"legendFormat": "{{instance}} {{name}}"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1 @@
|
||||
../../../../../../sops/groups/admins
|
||||
@@ -0,0 +1 @@
|
||||
../../../../../../sops/machines/ns1
|
||||
@@ -0,0 +1,22 @@
|
||||
{
|
||||
"data": "ENC[AES256_GCM,data:vyvZucue+ciufz/bP77IImR2lGaJZRbO6iaSkuWzzda2aL3l65M=,iv:jjtXLphn3dkd3t9cBpOfX6gra2QTmzf7OYcEqW77+1M=,tag:iggC2dJx4zvJmna6QQUcnw==,type:str]",
|
||||
"sops": {
|
||||
"age": [
|
||||
{
|
||||
"recipient": "age1fanu282vm7njjweqhrpcfcwpttuhce8js4tsyfry98l0neaqpewqs5s7nt",
|
||||
"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBBTTJDQzh6NGtOTWo1S0Jl\nQlMvNzZsTE9NdWJqY21rRHhiTFlmaWpja2hBCnFQdWoyeDBxV3Q4YVFWU1ZyZkdW\neEdyRndrbUQ1VkFIbkpZUFlTYVdMQTAKLS0tIEVSYWlYbTJJSUNPbWtlME9vUmo1\ncWc2dXBVUXhScTNadlNZNWZVR1MzbUkKZoZq6Kwq8P97PQqKjQXrpUSJijLhvym8\n8r2ytkxZXnOjYNSRpAUCKdQnZWBqPp0sEP5Ry4nPB0h9GDP2D8VBoQ==\n-----END AGE ENCRYPTED FILE-----\n"
|
||||
},
|
||||
{
|
||||
"recipient": "age1hlzrpqqgndcthq5m5yj9egfgyet2fzrxwa6ynjzwx2r22uy6m3hqr3rd06",
|
||||
"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSB6OWowU0MyNlBpamQxOVB1\nZitFNjRCZnZ4a2VzditsRDFYRmx6WGJEYkdNCm9HTFdBdXpjU3hMS2Jsa1BMZCtp\nSTd5NEtPL2NsZXN4anRyT080RndsOGcKLS0tIFF4QTdsUFdPeW1aNm42SXZ0b09h\nQjA4MzZ4UXBxWmFySG5CYkkzZnpTUVkKSR7ZN9fCVzSTfzEC0HrhRM7NcVTb93N/\nioq2auI+l+BJovqzp1gr8STrW+qtn6uwtToo8+9Mz3sfF9AN4rBNgw==\n-----END AGE ENCRYPTED FILE-----\n"
|
||||
},
|
||||
{
|
||||
"recipient": "age1yubikey1qd859y9ehz2ya8j2cftwrtmdeqhuk7r7yc52zp64wpff6068gwrac3q6nsa",
|
||||
"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IHBpdi1wMjU2IHFWcTVydyBBaVR4WUFZ\nbDJmMmpLVGRHUGtHWkc0TlZ0aU1tdFBzQ0dzVml0eHVCbE9lNgoyWTJHaTNKaGFw\nU0hzclZZY0ZRSHQ4SVRtU0hTY2pNUEc4VW11dzNjRkV3Ci0tLSBHS3lFUFhTVFo5\ndm5xdFJYWmZuL3BnY1hZS3VQL2RFZ2pQMStkTHMrbWtBCk3lMHBj2/GV4fz8dVTY\ni6ybl4FrAiD6QgavX9fos4ruOtLqpPGLdZt0pBpe9I9FEKyQSPXFWuSTz8UxbRGt\n9yU=\n-----END AGE ENCRYPTED FILE-----\n"
|
||||
}
|
||||
],
|
||||
"lastmodified": "2026-06-17T08:07:13Z",
|
||||
"mac": "ENC[AES256_GCM,data:kqgU9TV7vGfEWTVGb6QtHLdBPXoRdx0O/Z9SN09VqjZVlhS9J8ROzf4qQWa9JE6fjiGJBQLWDuJlCYcOBx8rtT5cS/hxsBhIS9isCVgkVwjVGpaaf6WuD58eKZB8Z7cuqLfw3Zwlo2tTBGjbIjGHTZcZbEAeT3Jz9rWVW7Q9XIo=,iv:Sy4Lou8jFqxPB701ElaeqAN3EHRYoZmf9bdxDJ/dd90=,tag:M9GnkO0Sqh1yKey/9YB6yQ==,type:str]",
|
||||
"version": "3.12.1"
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1 @@
|
||||
../../../../../../sops/users/berwn
|
||||
@@ -0,0 +1 @@
|
||||
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIH12h7xNDiJB5JXsKVKbsUYiAxPFqo3Klr2K7LDY5ZWG
|
||||
@@ -0,0 +1 @@
|
||||
../../../../../../sops/groups/admins
|
||||
@@ -0,0 +1 @@
|
||||
../../../../../../sops/machines/ns1
|
||||
@@ -0,0 +1,22 @@
|
||||
{
|
||||
"data": "ENC[AES256_GCM,data:TI5IZHMlZwRL622yWmG3XMjuIJ8QkB/0D8FFkSKwZaO1PxuLvIZNeoo4v6CM/e3+Bzzur7fBiJZO2o3r50jxY3Hnbel4hGqXf2KY9IJp51MEYn/OjkfSR0XNyci5LZx5Pqj9wppyVyeqW1W6Hk7UG0e23HCNUHUZsOikIa65UTXPu367iF6VDeNAE7XpPd6VVCiUe2s68cdCVE0cwrlEgo06EVGreJOQ25L9Y3Xh7EOl/CqoVt5T8IfPj4zEkemGjSnuUPR4OKTtlF7UgRSKnvHTtIcj35I/uS07n9238HT218tcWsEjRKD0tYJ5NHx+wCvoZn4HSTfUKVG5NpYOgMdIufcQrz0O+nj89h6cU3dkKd26TF6wHeJW/NRQGVCZOg+Bd6SJDRcvfnPG2PukGeBpf5Y+3XM8RuPfj1k42e55kC/ADEUw7HhlPrYLLv8KB9UXDSNICGNptwSdfTY3xlbrKYCdK2kh5YsY2ykKhuJ6fuq0e+5mXUTGs51Bjm7Fq51b,iv:ONXGx48Ha9/y/nqXIoXtXY3/knjj66jnoolTqz6AMxI=,tag:bx2pi9axH7dUm22CFWM3uw==,type:str]",
|
||||
"sops": {
|
||||
"age": [
|
||||
{
|
||||
"recipient": "age1fanu282vm7njjweqhrpcfcwpttuhce8js4tsyfry98l0neaqpewqs5s7nt",
|
||||
"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSBZREtVYlQ4cnRXeEZCYWhJ\ndTVPT1FDeWNMWnR5eXlzUGtaYnhJcW4zaFVNCk8vZ2J1aXUrazlGNVc3amhCOWYr\nMWxoblZtM0M0TnZick1PSGVzazZKOFkKLS0tIENmMmFxNWxJemIzWHVjZXg5NEtM\nK3Q0M0xmQVZjaWVJQ2FFSjZmSGFhcncKbFD7tC1EBkwxa09ICPMMI9rfsEltPhLA\nGVjLNt+08/GIXY8GOBCmGLsN6sQY42fb9kQuEqkWxRgb4/Sifoxbbg==\n-----END AGE ENCRYPTED FILE-----\n"
|
||||
},
|
||||
{
|
||||
"recipient": "age1hlzrpqqgndcthq5m5yj9egfgyet2fzrxwa6ynjzwx2r22uy6m3hqr3rd06",
|
||||
"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IFgyNTUxOSB2aWJRNWxIT25HZWRvOFJo\nWkNjWExTNTZzR213bEpaOG85SXRDNUMyRWtjCk04WE5JNzIyTktPb2MzdkxUVVBP\nUmR3UncxZDFvcVBaZGUwUHhDeVR5UTAKLS0tIHJOc0tPOFBhcjdhTUEvUloxQVhx\ncmJqQWVWa0E4UG9MaUtGaFV4Wm9TUUkKi8QrZbQA+G6olFEoSlIPcyewcuEVw3Bl\n8QNbpSSqupRZvx58p2dEOoG1LXX/8Z7mR2iYSsYtrtMR1Y3CFvtqiQ==\n-----END AGE ENCRYPTED FILE-----\n"
|
||||
},
|
||||
{
|
||||
"recipient": "age1yubikey1qd859y9ehz2ya8j2cftwrtmdeqhuk7r7yc52zp64wpff6068gwrac3q6nsa",
|
||||
"enc": "-----BEGIN AGE ENCRYPTED FILE-----\nYWdlLWVuY3J5cHRpb24ub3JnL3YxCi0+IHBpdi1wMjU2IHFWcTVydyBBcGY2ZXN3\nMzRDQjcwcGd6UTJOVTdTc00wUlpsbnFmTzVZRzlORjlabGJOKwpJaE1DbWtGejhi\nN3phMENzNEpvMkVyQ2VpVDZHbnpMNmJ5VjdoQ2g3WHlVCi0tLSAvVWpyQXlqNjZl\na0ttRVJHTFFXNDJ0cWV2VVhoUFp2dU9qcFN5V0paR2hvCgWYzW/nsvD9RcVXP0Z+\nPtV1wBMWT+/rHmx8i9RIbn8eNf01immQozFzz+F+/FkdAfKVJ722UGR6MPSdzy4d\nA/k=\n-----END AGE ENCRYPTED FILE-----\n"
|
||||
}
|
||||
],
|
||||
"lastmodified": "2026-06-17T08:07:13Z",
|
||||
"mac": "ENC[AES256_GCM,data:cpipxsIxRvhm8HukKx24qv5g9zEgjuVllNgfQcAuSJxGg+GpAY95JgjGhbG+JekKi7O2B1/Tsb3w1UwXZToAeM6T7XOzG+h5790ufQLcWuKhONtNxTQKKrmTxR6e8pOsCOROKd4uKjQWPhrvhpnmLLmsetjEhQc8igrfJRScqlw=,iv:Kb5ZpWBI/F4dqzJEY2m+bycyLi2mA/VZyA3pPwY9g5c=,tag:4b42mtSzonaHwZYySKCrtw==,type:str]",
|
||||
"version": "3.12.1"
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1 @@
|
||||
../../../../../../sops/users/berwn
|
||||
Reference in New Issue
Block a user