cnx-network-clan

Author	SHA1	Message	Date
Berwn	48bf7fb250	Add web01 public reverse proxy with DNS-01 wildcard TLS web01 terminates TLS for grafana.cnx.network and proxies to Grafana on control over the mesh. Caddy serves a *.cnx.network wildcard cert obtained via ACME DNS-01, using a dedicated acme_web01 TSIG key scoped on ns1 to _acme-challenge on the cnx.network zone only. Ports 80/443 are the only public exposure (80 just redirects); admin and the backend ride ZeroTier. Also reload Caddy on cert renewal for both web01 and mx1, since both reference the cert via explicit tls file paths and would otherwise keep serving a stale cert after a silent renewal.	2026-06-21 03:05:54 +07:00
Berwn	f42569e992	Add provisioned Grafana uptime dashboard for all hosts	2026-06-21 01:57:08 +07:00
Berwn	1dd3aadb97	Add mail.cnx.email client alias as a cert SAN A mail.cnx.email CNAME (-> mx1.cnx.email) lets clients (Thunderbird etc.) use a friendly hostname for submission/IMAP. To avoid a TLS name mismatch the cert now carries mail.cnx.email as a SAN, so the acme_mx1 key is authorized to write _acme-challenge.mail too. The MX still points at mx1.cnx.email and --reuse-key keeps the DANE TLSA digest valid across the re-issue.	2026-06-18 15:01:03 +07:00
Berwn	dc21348727	Format drifted files to satisfy the treefmt flake-check gate Pure formatting (nixfmt/prettier/yamlfmt); no behavior change. These files predate the current treefmt config and were failing nix flake check; reformatting them makes the gate green again.	2026-06-18 14:49:48 +07:00
Berwn	1cb6f39ea2	Add declarative SNM mail stack on mx1 with DNS-01, DANE, MTA-STS mx1 runs Simple NixOS Mailserver (Postfix/Dovecot/Rspamd/OpenDKIM) for cnx.email. The TLS cert is obtained via ACME DNS-01 using a dedicated, scoped TSIG key (acme_mx1) that ns1 authorizes for only _acme-challenge.mx1 and _acme-challenge.mta-sts on the cnx.email zone, so the credential can write nothing else. Mailbox passwords are auto-minted by a clan vars generator (four-word passphrase + number). DANE TLSA (3 1 1) is published for _25._tcp.mx1; --reuse-key keeps the key digest stable across renewals. MTA-STS is enforced via a Caddy vhost serving the policy on :443 from the same cert (mta-sts SAN). Firewall opens 25/587/465/143/993/443; 80 stays closed.	2026-06-18 14:47:20 +07:00
Berwn	d1b24017aa	Use no-store for docs: epoch mtimes make revalidation serve stale	2026-06-18 12:24:38 +07:00
Berwn	77a18df257	Stop browsers serving stale docs by forcing revalidation	2026-06-18 12:19:42 +07:00
Berwn	6e4178df04	Onboard mx1 mail host and factor out per-host public IPs - Register mx1 in the inventory and as a direct-SSH `internet` host; give it a static public IPv6 (2a01:4ff:2f0:1963::1). - Point the cnx.email MX (plus SPF/DMARC) at mx1 and add its A record. - Bring mx1 into monitoring: import exporters, add it to the mesh map and the node scrape job so its host metrics and journald reach control. - Add a clan-mx1 Hetzner firewall: inbound SMTP + ZeroTier + ICMP, no public SSH (admin rides the mesh like the other hosts). 587/465/993 held for now. - Extract per-host public IPv4/IPv6 into modules/hosts.nix, consumed by clan.nix's internet hosts and each machine's cnx.staticIPv6, so each address is declared once instead of being duplicated across configs. - docs: add mx1 to the machines table.	2026-06-18 11:53:14 +07:00
Berwn	9c8a2abf3f	Bind VictoriaLogs on IPv6 so the mesh can ship journald to it VictoriaLogs, like the VM scraper, is IPv4-only by default: ":9428" binds 0.0.0.0 only, so ns1/ns2 pushing journald over the IPv6 mesh got "connection refused" while control's own loopback (v4) upload worked. Add -enableTCP6 so it binds [::] (dual-stack), matching the flag already used for the scraper. Also simplify the systemd-journal-upload override to just startLimitIntervalSec=0 (retry forever / self-heal) and drop the SuccessExitStatus masking: a persistent sink failure should stay loud rather than be hidden behind a green deploy.	2026-06-17 17:27:56 +07:00
Berwn	0eb883061b	Keep systemd-journal-upload retrying instead of failing a deploy The uploader exits when VictoriaLogs is unreachable. Upstream already sets Restart=always/RestartSec=3sec, but the default start-rate limit lets the unit give up permanently and trip switch-to-configuration when the sink is briefly down. Disable the limit (startLimitIntervalSec=0) so logging stays best-effort and never wedges a host or a deploy.	2026-06-17 17:09:30 +07:00
Berwn	d4a171640b	Add VictoriaLogs for centralized journald across all hosts control runs VictoriaLogs (:9428, 30d, mesh-scoped) with a matching Grafana datasource. Each host ships journald via systemd's own journald.upload to the /insert/journald endpoint -- no extra agent. control uploads over loopback so its logs survive a mesh outage; ns1 and ns2 push over the mesh.	2026-06-17 16:53:52 +07:00
Berwn	c7b0f206c8	Alert on and chart blackbox DNS probe failures DNSResolutionProbeFailed and DNSSECProbeFailed fire when an SOA or DNSKEY probe to a public nameserver address stays down for 5m. The CNX DNS dashboard gains a "DNS probes (outside-in)" row: per-zone/server status table, probe success, and probe latency.	2026-06-17 15:42:13 +07:00
Berwn	54f607d063	Add blackbox exporter for outside-in DNS probes control runs blackbox_exporter on loopback, probing each nameserver's public v4+v6 address for every zone: SOA (zone served) and DNSKEY (still signed, since blackbox has no DO-bit option). Probe definitions are shared between the exporter config and the VictoriaMetrics scrape jobs so they can't drift. Verified live against ns1/ns2 over v4 and v6.	2026-06-17 15:37:45 +07:00
Berwn	0544bf95e5	Add vmalert rules for failed and stale backups BackupJobFailed fires when a borgbackup job enters the systemd failed state; BackupStale fires when the daily timer has not run in over 26h (or has never run). Both read the node_exporter systemd collector on the backup client, matching the CNX Backups dashboard.	2026-06-17 15:17:12 +07:00
Berwn	1ea5bda23f	Add CNX Backups dashboard and document the backup setup Grafana dashboard (auto-provisioned from the dashboards dir) tracks borgbackup job health, time since last run, and per-job systemd state from the node_exporter systemd collector on the client. New docs page covers the ns1 -> control topology, secrets flow, and restore commands.	2026-06-17 15:13:47 +07:00
Berwn	7ae3221b83	Add Active alerts panel to the top of the CNX DNS dashboard Surfaces vmalert's firing ALERTS series as a table at the top of the dashboard, so the minimal-delivery alerts are visible at a glance. Existing panels shift down by one row.	2026-06-17 14:51:33 +07:00
Berwn	4c7c74836d	Add vmalert alerting rules for DNS and host health vmalert on control evaluates rules (declared in git) against VictoriaMetrics and remote-writes alert state back, so firing alerts show as the ALERTS series in Grafana. Covers SOA divergence between ns1/ns2, secondary zone expiry, scrape target down, and root disk full. No notifier yet (notifier.blackhole). Also adds TODO.md roadmap.	2026-06-17 14:49:32 +07:00
Berwn	a7d4c0e567	Add mdBook infra runbook served by Caddy on control Docs live in docs/ (DNS, ZeroTier mesh, monitoring), built at Nix-build time and served as static files over the ZeroTier mesh on control:8080. Commit-to-edit: change the markdown and redeploy to publish.	2026-06-17 14:26:21 +07:00
Berwn	848c4ec47d	Read mesh host map from clan zerotier vars instead of hardcoding The control/ns1/ns2 mesh IPs and the /88 subnet were duplicated literals in mesh-hosts.nix. clan-core's zerotier generator already writes each machine's IP as a public var (vars/per-machine/<m>/zerotier/zerotier-ip), so read from there and derive the subnet from zerotier-network-id. Pure refactor: the rendered values are identical and the system derivation hash is unchanged.	2026-06-17 11:53:56 +07:00
Berwn	8ac96b2d10	Enable IPv6 dialing for VictoriaMetrics scrapes The scraper defaults to IPv4-only, so the ns1/ns2 mesh ULA targets were dropped with 'no suitable address found'. -enableTCP6 lets VM scrape them.	2026-06-17 10:51:31 +07:00
Berwn	33ac7e106b	Add VictoriaMetrics + Grafana DNS monitoring over the mesh control runs VictoriaMetrics (loopback) and Grafana; every machine exports node metrics and the nameservers export Knot stats (mod-stats + knot-exporter). Scraping and the Grafana UI ride the ZeroTier mesh only, scoped by nftables to the mesh /88; the public side stays closed by the Hetzner cloud firewall. The provisioned DNS dashboard includes a per-zone SOA serial table to catch primary/secondary drift. ZeroTier ULAs are centralised in mesh-hosts.nix.	2026-06-17 10:17:27 +07:00
Berwn	63446173bc	monitor.cnx.network DNS test	2026-06-16 19:03:49 +07:00
Berwn	e795960dcf	Configure static public IPv6 on control, ns1, ns2	2026-06-16 18:04:33 +07:00
Berwn	de7d950596	Format tree with treefmt	2026-06-16 16:53:00 +07:00
Berwn	dc51cfbdb5	Enable DNSSEC and automatic SOA serials on the DNS zones ns1 (primary) now signs every zone with an ECDSA P-256/SHA-256 policy and manages the SOA serial itself: zonefile-load = difference-no-serial (with journal-content = all) plus serial-policy = dateserial let records be edited without bumping the serial by hand. ns2 needs no change; it transfers the already-signed zone. Also point the ns1/ns2 AAAA glue at the public Hetzner IPv6 addresses; they previously pointed at unroutable ZeroTier mesh ULAs.	2026-06-14 16:27:30 +07:00
Berwn	5864054b00	Move Hetzner firewall rules into a separate data file Extract the per-firewall rule data out of control's configuration into modules/hetzner-firewall-rules.nix, imported like the DNS domains list. The evaluated rules are unchanged.	2026-06-14 15:49:00 +07:00
Berwn	344f432640	Add Hetzner Cloud firewall auto-sync from clan config control runs a oneshot on each deploy that creates each firewall if missing and replaces its rules via the Hetzner API set_rules action, using a Read/Write token stored as a clan secret. Public SSH is not exposed; admin access rides the ZeroTier mesh, with emergency-access as the console fallback.	2026-06-14 15:40:05 +07:00
Berwn	56f0af3153	Fix knot startup on ns1/ns2: TSIG key perms and port 53 conflict knotd runs as the "knot" user, so the shared TSIG key file needs owner/group knot — it was root-only and knot couldn't read it. systemd-resolved's stub listener was holding port 53, so knot's 0.0.0.0@53 / ::@53 TCP bind failed. Disable the stub (resolution still works via nss-resolve) to free the port.	2026-06-14 14:49:10 +07:00
Berwn	807785cdab	Add authoritative DNS on ns1/ns2 and finalize clan config - Knot authoritative DNS: ns1 primary, ns2 secondary serving cnx.network, buildfor.life and cnx.email over TSIG-secured zone transfer (modules/dns) - Knot listens publicly + over ZeroTier; firewall opens port 53 - Complete clan inventory: name/domain, admin SSH key, control as the zerotier controller, tor on all nixos machines - Enable age yubikey/fido2-hmac secret plugins	2026-06-14 13:24:23 +07:00
Berwn	0faa5884f2	Initial commit	2026-06-14 12:11:16 +07:00

30 Commits