cnx-network-clan

Author	SHA1	Message	Date
Berwn	b65f526ea2	Update vars via generator emergency-access for machine mx1	2026-06-18 11:33:03 +07:00
Berwn	3a0bc2dba4	Add machine mx1 to secrets	2026-06-18 11:33:03 +07:00
Berwn	6098fe9a3b	Update secret mx1-age.key	2026-06-18 11:33:03 +07:00
Berwn	8d9981ee5a	Set disk schema of machine: mx1 to single-disk	2026-06-18 11:32:33 +07:00
Berwn	afc2e997c0	machines/mx1/facter.json: update hardware configuration	2026-06-18 11:32:22 +07:00
Berwn	faaa7b66c0	Add machine mx1	2026-06-18 11:21:27 +07:00
Berwn	9c8a2abf3f	Bind VictoriaLogs on IPv6 so the mesh can ship journald to it VictoriaLogs, like the VM scraper, is IPv4-only by default: ":9428" binds 0.0.0.0 only, so ns1/ns2 pushing journald over the IPv6 mesh got "connection refused" while control's own loopback (v4) upload worked. Add -enableTCP6 so it binds [::] (dual-stack), matching the flag already used for the scraper. Also simplify the systemd-journal-upload override to just startLimitIntervalSec=0 (retry forever / self-heal) and drop the SuccessExitStatus masking: a persistent sink failure should stay loud rather than be hidden behind a green deploy.	2026-06-17 17:27:56 +07:00
Berwn	0eb883061b	Keep systemd-journal-upload retrying instead of failing a deploy The uploader exits when VictoriaLogs is unreachable. Upstream already sets Restart=always/RestartSec=3sec, but the default start-rate limit lets the unit give up permanently and trip switch-to-configuration when the sink is briefly down. Disable the limit (startLimitIntervalSec=0) so logging stays best-effort and never wedges a host or a deploy.	2026-06-17 17:09:30 +07:00
Berwn	d4a171640b	Add VictoriaLogs for centralized journald across all hosts control runs VictoriaLogs (:9428, 30d, mesh-scoped) with a matching Grafana datasource. Each host ships journald via systemd's own journald.upload to the /insert/journald endpoint -- no extra agent. control uploads over loopback so its logs survive a mesh outage; ns1 and ns2 push over the mesh.	2026-06-17 16:53:52 +07:00
Berwn	c7b0f206c8	Alert on and chart blackbox DNS probe failures DNSResolutionProbeFailed and DNSSECProbeFailed fire when an SOA or DNSKEY probe to a public nameserver address stays down for 5m. The CNX DNS dashboard gains a "DNS probes (outside-in)" row: per-zone/server status table, probe success, and probe latency.	2026-06-17 15:42:13 +07:00
Berwn	54f607d063	Add blackbox exporter for outside-in DNS probes control runs blackbox_exporter on loopback, probing each nameserver's public v4+v6 address for every zone: SOA (zone served) and DNSKEY (still signed, since blackbox has no DO-bit option). Probe definitions are shared between the exporter config and the VictoriaMetrics scrape jobs so they can't drift. Verified live against ns1/ns2 over v4 and v6.	2026-06-17 15:37:45 +07:00
Berwn	0544bf95e5	Add vmalert rules for failed and stale backups BackupJobFailed fires when a borgbackup job enters the systemd failed state; BackupStale fires when the daily timer has not run in over 26h (or has never run). Both read the node_exporter systemd collector on the backup client, matching the CNX Backups dashboard.	2026-06-17 15:17:12 +07:00
Berwn	1ea5bda23f	Add CNX Backups dashboard and document the backup setup Grafana dashboard (auto-provisioned from the dashboards dir) tracks borgbackup job health, time since last run, and per-job systemd state from the node_exporter systemd collector on the client. New docs page covers the ns1 -> control topology, secrets flow, and restore commands.	2026-06-17 15:13:47 +07:00
Berwn	ed746b58c3	Update vars via generator borgbackup for machine ns1	2026-06-17 15:07:13 +07:00
Berwn	044891927b	Back up Knot DNSSEC keystore from ns1 to control via borgbackup clan borgbackup instance: control serves repos, ns1 backs up its clan.core.state (the KASP keystore at /var/lib/knot) nightly over the mesh with repokey encryption. ns1 maps the control machine name to its ZeroTier address so the borg@control repo resolves. Run `clan vars generate ns1` before deploy to mint the borg keypair.	2026-06-17 15:06:58 +07:00
Berwn	7ae3221b83	Add Active alerts panel to the top of the CNX DNS dashboard Surfaces vmalert's firing ALERTS series as a table at the top of the dashboard, so the minimal-delivery alerts are visible at a glance. Existing panels shift down by one row.	2026-06-17 14:51:33 +07:00
Berwn	4c7c74836d	Add vmalert alerting rules for DNS and host health vmalert on control evaluates rules (declared in git) against VictoriaMetrics and remote-writes alert state back, so firing alerts show as the ALERTS series in Grafana. Covers SOA divergence between ns1/ns2, secondary zone expiry, scrape target down, and root disk full. No notifier yet (notifier.blackhole). Also adds TODO.md roadmap.	2026-06-17 14:49:32 +07:00
Berwn	a7d4c0e567	Add mdBook infra runbook served by Caddy on control Docs live in docs/ (DNS, ZeroTier mesh, monitoring), built at Nix-build time and served as static files over the ZeroTier mesh on control:8080. Commit-to-edit: change the markdown and redeploy to publish.	2026-06-17 14:26:21 +07:00
Berwn	3a8fe660a5	Swap ZeroTier external members: drop Alex/Alex-gateway, add alex-nixos	2026-06-17 12:15:26 +07:00
Berwn	9aa83d70a2	Admit external ZeroTier members to the mesh by node id clan.nix gains an allowedIps list for the zerotier controller, fed via a ztMemberIp helper that derives each member's IPv6 on this network from its 10-char node id + the zerotier-network-id var. Lets us list external devices (admin laptops) by their stable node id, which this clan-core's allowedIps interface consumes as --member-ip on control.	2026-06-17 12:13:47 +07:00
Berwn	848c4ec47d	Read mesh host map from clan zerotier vars instead of hardcoding The control/ns1/ns2 mesh IPs and the /88 subnet were duplicated literals in mesh-hosts.nix. clan-core's zerotier generator already writes each machine's IP as a public var (vars/per-machine/<m>/zerotier/zerotier-ip), so read from there and derive the subnet from zerotier-network-id. Pure refactor: the rendered values are identical and the system derivation hash is unchanged.	2026-06-17 11:53:56 +07:00
Berwn	8ac96b2d10	Enable IPv6 dialing for VictoriaMetrics scrapes The scraper defaults to IPv4-only, so the ns1/ns2 mesh ULA targets were dropped with 'no suitable address found'. -enableTCP6 lets VM scrape them.	2026-06-17 10:51:31 +07:00
Berwn	1405605eac	Remove key(s) for user berwn from secrets	2026-06-17 10:29:23 +07:00
Berwn	ad0c47e046	Add key(s) for user berwn to secrets	2026-06-17 10:26:55 +07:00
Berwn	fb7b269f68	Update vars via generator grafana-admin for machine control	2026-06-17 10:17:45 +07:00
Berwn	33ac7e106b	Add VictoriaMetrics + Grafana DNS monitoring over the mesh control runs VictoriaMetrics (loopback) and Grafana; every machine exports node metrics and the nameservers export Knot stats (mod-stats + knot-exporter). Scraping and the Grafana UI ride the ZeroTier mesh only, scoped by nftables to the mesh /88; the public side stays closed by the Hetzner cloud firewall. The provisioned DNS dashboard includes a per-zone SOA serial table to catch primary/secondary drift. ZeroTier ULAs are centralised in mesh-hosts.nix.	2026-06-17 10:17:27 +07:00
Berwn	63446173bc	monitor.cnx.network DNS test	2026-06-16 19:03:49 +07:00
Berwn	aa604bda9a	Switch ns1 zone serial-policy to unixtime dateserial (YYYYMMDDnn) only has a 2-digit same-day counter held in Knot's journal; a journal reset restarted the counter and let ns1 mint a serial ns2 had already seen with older content, so ns2 never retransferred. unixtime is strictly monotonic per reload, eliminating the shared-serial collision.	2026-06-16 18:59:45 +07:00
Berwn	e795960dcf	Configure static public IPv6 on control, ns1, ns2	2026-06-16 18:04:33 +07:00
Berwn	6783ad7c17	Add internet networking service for direct SSH to public IPs	2026-06-16 18:04:29 +07:00
Berwn	a49aea3c7a	vars fix	2026-06-16 16:59:54 +07:00
Berwn	de7d950596	Format tree with treefmt	2026-06-16 16:53:00 +07:00
Berwn	cf0d796bee	Add treefmt formatter (nix fmt + flake check gate)	2026-06-16 16:53:00 +07:00
kurogeek	3302b70485	clan.core.sops.defaultGroups to all machines	2026-06-16 16:46:55 +07:00
kurogeek	c85da6b8fc	Add user berwn to group admins	2026-06-16 16:44:32 +07:00
kurogeek	d50603743e	Add user kurogeek to group admins	2026-06-16 16:44:25 +07:00
Berwn	95b9375324	Grant kurogeek admin SSH access on all machines	2026-06-16 16:30:18 +07:00
Berwn	70cbfe84b1	Add user kurogeek to secrets	2026-06-16 16:24:23 +07:00
Berwn	a3482face5	Allow ACME DNS-01 dynamic updates on ns1 Add a dedicated acme_ddns TSIG key (scoped to ns1 only) and an acl_acme rule that limits it to TXT updates at or under _acme-challenge.<zone>. An external ACME client can now write challenge records via RFC 2136; Knot signs them and transfers to ns2, which never holds the key.	2026-06-14 17:12:17 +07:00
Berwn	8330eaa8ce	Update vars via generator dns-acme-tsig for machine ns1	2026-06-14 17:07:17 +07:00
Berwn	dc51cfbdb5	Enable DNSSEC and automatic SOA serials on the DNS zones ns1 (primary) now signs every zone with an ECDSA P-256/SHA-256 policy and manages the SOA serial itself: zonefile-load = difference-no-serial (with journal-content = all) plus serial-policy = dateserial let records be edited without bumping the serial by hand. ns2 needs no change; it transfers the already-signed zone. Also point the ns1/ns2 AAAA glue at the public Hetzner IPv6 addresses; they previously pointed at unroutable ZeroTier mesh ULAs.	2026-06-14 16:27:30 +07:00
Berwn	5864054b00	Move Hetzner firewall rules into a separate data file Extract the per-firewall rule data out of control's configuration into modules/hetzner-firewall-rules.nix, imported like the DNS domains list. The evaluated rules are unchanged.	2026-06-14 15:49:00 +07:00
Berwn	344f432640	Add Hetzner Cloud firewall auto-sync from clan config control runs a oneshot on each deploy that creates each firewall if missing and replaces its rules via the Hetzner API set_rules action, using a Read/Write token stored as a clan secret. Public SSH is not exposed; admin access rides the ZeroTier mesh, with emergency-access as the console fallback.	2026-06-14 15:40:05 +07:00
Berwn	dbb67dbd9c	Update vars via generator hetzner-firewall for machine control	2026-06-14 15:37:25 +07:00
Berwn	2506b21ffa	Enable emergency-access recovery service Add the clan-core emergency-access service on all nixos machines; it sets a per-machine recovery root password for console login when a machine fails to boot.	2026-06-14 15:02:34 +07:00
Berwn	306a2cf61e	Set per-machine timezones and enable NTP control and ns2 use UTC+3 (Etc/GMT-3), ns1 uses UTC+1 (Etc/GMT-1) — fixed offsets, no DST. Make systemd-timesyncd explicit on all three.	2026-06-14 15:02:34 +07:00
Berwn	91578a2b43	Update vars via generator emergency-access for machine ns2	2026-06-14 15:00:25 +07:00
Berwn	ab8288aef9	Update vars via generator emergency-access for machine ns1	2026-06-14 15:00:24 +07:00
Berwn	7b292b8279	Update vars via generator emergency-access for machine control	2026-06-14 15:00:24 +07:00
Berwn	56f0af3153	Fix knot startup on ns1/ns2: TSIG key perms and port 53 conflict knotd runs as the "knot" user, so the shared TSIG key file needs owner/group knot — it was root-only and knot couldn't read it. systemd-resolved's stub listener was holding port 53, so knot's 0.0.0.0@53 / ::@53 TCP bind failed. Disable the stub (resolution still works via nss-resolve) to free the port.	2026-06-14 14:49:10 +07:00

1 2

88 Commits