Add mdBook infra runbook served by Caddy on control

Docs live in docs/ (DNS, ZeroTier mesh, monitoring), built at Nix-build time and
served as static files over the ZeroTier mesh on control:8080. Commit-to-edit:
change the markdown and redeploy to publish.
This commit is contained in:
Berwn
2026-06-17 14:26:21 +07:00
parent 3a8fe660a5
commit a7d4c0e567
8 changed files with 221 additions and 0 deletions
+12
View File
@@ -0,0 +1,12 @@
[book]
title = "CNX Infra Runbook"
description = "Operational docs for the cnx-network clan: DNS, ZeroTier mesh, monitoring."
authors = ["B4L"]
src = "src"
language = "en"
[output.html]
default-theme = "navy"
preferred-dark-theme = "navy"
git-repository-url = "https://git.b4l.co.th/B4L/cnx-network-clan"
edit-url-template = "https://git.b4l.co.th/B4L/cnx-network-clan/_edit/main/docs/{path}"
+6
View File
@@ -0,0 +1,6 @@
# Summary
- [Overview](./overview.md)
- [ZeroTier mesh](./mesh.md)
- [DNS](./dns.md)
- [Monitoring](./monitoring.md)
+59
View File
@@ -0,0 +1,59 @@
# DNS
Authoritative DNS for three zones, served by Knot:
- `cnx.network`
- `buildfor.life`
- `cnx.email`
Add a zone in `modules/dns/domains.nix` **and** drop a matching `<domain>.zone`
file in `modules/dns/zones/`.
## Primary / secondary
- **`ns1` = primary (master).** Loads each zone from its file, signs it, and
notifies `ns2`. Config in `machines/ns1/configuration.nix`.
- **`ns2` = secondary (slave).** Pulls every zone from `ns1` (AXFR/IXFR) and
accepts its NOTIFY. Config in `machines/ns2/configuration.nix`.
Zone transfers run **over the ZeroTier mesh**, authenticated with a shared TSIG
key (`dns-tsig`, a clan var copied to both machines).
## Serial handling
`ns1` uses `zonefile-load = difference-no-serial` with `serial-policy = unixtime`:
edit records without touching the SOA serial — Knot diffs the file, assigns a
strictly-monotonic unixtime serial, signs, and transfers. `journal-content = all`
holds the live signed zone (required by `difference-no-serial`).
## DNSSEC
Automatic signing on `ns1` only, policy `cnx`: ECDSA P-256/SHA-256. The ZSK
auto-rolls; the KSK is kept stable, so the DS at the registrar only changes on a
manual KSK rollover.
> **Pending (manual):** submit DS records for `buildfor.life` and `cnx.email`
> once they're at a DNSSEC-capable registrar.
## ACME DNS-01
A dedicated TSIG key (`acme_ddns`), scoped by `acl_acme` to `TXT` updates at or
under `_acme-challenge.<zone>` on `ns1` only. Knot signs the record and transfers
it to `ns2`, which never needs this key. Retrieve the client config with:
```
clan vars get ns1 dns-acme-tsig/acme.conf
```
## Runbook: stale secondary
If `ns2` serves stale records while SOA serials match (e.g. after a manual zone
edit that didn't bump the serial as expected), force a fresh transfer on `ns2`:
```
knotc zone-retransfer <zone>
```
Watch the **CNX DNS** Grafana dashboard: the per-nameserver SOA serial table
should agree across `ns1`/`ns2`, and "seconds until zone expiry" on the secondary
should reset on each successful transfer rather than counting toward zero.
+39
View File
@@ -0,0 +1,39 @@
# ZeroTier mesh
A private IPv6 overlay that every machine (and admin laptops) shares. DNS zone
transfers and metrics scraping ride this mesh, never the public net.
- **Controller:** `control` (the `zerotier` instance in `clan.nix`).
- **Peers:** every machine (`roles.peer.tags.all`).
- **Prefix:** `fd06:1bad:ece2:92ad:ba99:9300::/88` (RFC 4193: `fd` + network id + `0x9993`).
## The mesh map
`modules/mesh-hosts.nix` does **not** hardcode addresses. It reads each machine's
IP from the public clan vars that clan-core's zerotier generator already writes
(`vars/per-machine/<m>/zerotier/zerotier-ip/value`) and derives the `/88` subnet
from `control`'s `zerotier-network-id`. Regenerate or re-key a node and the map
follows automatically.
Consumers: `modules/dns/authoritative.nix` (transfer ACLs), `modules/monitoring/*`
(scrape targets and firewall scoping).
## Admitting external members
Inventory machines are auto-accepted. External devices (admin laptops) are listed
in `clan.nix` under the controller's `allowedIps`. Because this clan-core pins the
`allowedIps` interface (admit by network IPv6), we keep a **node-id** list and a
`ztMemberIp` helper derives each device's IP on this network:
```nix
roles.controller.settings.allowedIps = map ztMemberIp [
"8802c8d7e0" # alex-nixos
"2bd36db8cc" # kurogeek-thinkpad
];
```
A device's 10-char node id comes from `zerotier-cli info` on that device. After
editing, deploy `control`; the controller admits the new member on its next run.
> A newer clan-core exposes `allowedIds` (admit by node id directly), but adopting
> it means a zerotier vars-schema migration, so we stay on the IP-derivation path.
+38
View File
@@ -0,0 +1,38 @@
# Monitoring
Metrics and dashboards live on `control`, reachable only over the ZeroTier mesh.
## Collection
- **node_exporter** (`:9100`) on every machine — CPU, memory, disk, systemd units.
Binds all interfaces; the scrape ports are firewall-scoped to the mesh subnet
(`modules/monitoring/exporters.nix`).
- **knot-exporter** (`:9433`) on `ns1`/`ns2` only — reads Knot's control socket,
fed by the `mod-stats` module (query/response counters per zone).
## Storage & scraping
**VictoriaMetrics** on `control`, bound to `127.0.0.1:8428`, 180-day retention
(`modules/monitoring/server.nix`). It scrapes `control` over loopback and `ns1`/
`ns2` over the mesh.
> The scraper dials IPv4-only by default, so mesh (IPv6) targets need
> `extraOptions = [ "-enableTCP6" ]`. Without it, ns1/ns2 are dropped with
> "no suitable address found". Check live target health on `control`:
>
> ```
> curl -s http://127.0.0.1:8428/api/v1/targets | jq '.data.activeTargets[] | {i:.labels.instance, h:.health, e:.lastError}'
> ```
## Dashboards
**Grafana** on `control` (`:3000`), mesh-only, anonymous access disabled. The
admin password is a clan var:
```
clan vars get control grafana-admin/password
```
The provisioned **CNX DNS** dashboard (`modules/monitoring/dashboards/dns.json`)
shows per-nameserver SOA serials, zone expiry countdowns, query/response rates,
and host CPU/memory/disk/load.
+26
View File
@@ -0,0 +1,26 @@
# Overview
This is the operational runbook for the **cnx-network** clan. Everything here is
managed declaratively from the [clan repo](https://git.b4l.co.th/B4L/cnx-network-clan);
this book is built from `docs/` and served on `control` over the ZeroTier mesh.
## Machines
| Machine | Role | Public IPv4 | Public IPv6 |
| --------- | -------------------------------------- | ---------------- | --------------------------- |
| `control` | ZeroTier controller, monitoring, docs | `77.42.68.181` | `2a01:4f9:c013:e6d0::1` |
| `ns1` | Knot DNS **primary** (master) | `46.224.170.206` | `2a01:4f8:c014:b5c5::1` |
| `ns2` | Knot DNS **secondary** (slave) | `157.180.70.82` | `2a01:4f9:c014:6d87::1` |
## Access
- Admin SSH and all internal services ride the **ZeroTier mesh**, not the public
net. Public SSH (22) is intentionally closed at the Hetzner cloud firewall.
- clan reaches machines by their public IPs first (the `internet` instance), with
the mesh and Tor as automatic fallbacks.
## Editing these docs
Commit-to-edit: change the markdown under `docs/src/`, commit, and redeploy
`control`. There is no in-browser editor by design — the docs are versioned and
reviewed alongside the config that they describe.