cnx-network-clan

Author	SHA1	Message	Date
Berwn	60db8c60b0	Add parsedmarc DMARC report analyzer on control Deliver cnx.email DMARC aggregate/forensic reports to a dedicated dmarc@cnx.email mailbox on mx1 and analyze them with parsedmarc on control, storing parsed reports in a local loopback Elasticsearch and visualizing via the auto-provisioned Grafana dashboard. parsedmarc fetches the mailbox over IMAPS across the mesh (mx1.cnx.email pinned to its mesh address so TLS still validates), using a shared mail-dmarc-cred clan var so mx1's mailserver and control see the same password.	2026-06-21 03:27:23 +07:00
Berwn	415a050f6a	Scrape web01 node_exporter into VictoriaMetrics Add web01 to the mesh map and the node scrape job so it appears on the uptime/host dashboards alongside the other hosts.	2026-06-21 03:08:56 +07:00
Berwn	f42569e992	Add provisioned Grafana uptime dashboard for all hosts	2026-06-21 01:57:08 +07:00
Berwn	dc21348727	Format drifted files to satisfy the treefmt flake-check gate Pure formatting (nixfmt/prettier/yamlfmt); no behavior change. These files predate the current treefmt config and were failing nix flake check; reformatting them makes the gate green again.	2026-06-18 14:49:48 +07:00
Berwn	6e4178df04	Onboard mx1 mail host and factor out per-host public IPs - Register mx1 in the inventory and as a direct-SSH `internet` host; give it a static public IPv6 (2a01:4ff:2f0:1963::1). - Point the cnx.email MX (plus SPF/DMARC) at mx1 and add its A record. - Bring mx1 into monitoring: import exporters, add it to the mesh map and the node scrape job so its host metrics and journald reach control. - Add a clan-mx1 Hetzner firewall: inbound SMTP + ZeroTier + ICMP, no public SSH (admin rides the mesh like the other hosts). 587/465/993 held for now. - Extract per-host public IPv4/IPv6 into modules/hosts.nix, consumed by clan.nix's internet hosts and each machine's cnx.staticIPv6, so each address is declared once instead of being duplicated across configs. - docs: add mx1 to the machines table.	2026-06-18 11:53:14 +07:00
Berwn	9c8a2abf3f	Bind VictoriaLogs on IPv6 so the mesh can ship journald to it VictoriaLogs, like the VM scraper, is IPv4-only by default: ":9428" binds 0.0.0.0 only, so ns1/ns2 pushing journald over the IPv6 mesh got "connection refused" while control's own loopback (v4) upload worked. Add -enableTCP6 so it binds [::] (dual-stack), matching the flag already used for the scraper. Also simplify the systemd-journal-upload override to just startLimitIntervalSec=0 (retry forever / self-heal) and drop the SuccessExitStatus masking: a persistent sink failure should stay loud rather than be hidden behind a green deploy.	2026-06-17 17:27:56 +07:00
Berwn	0eb883061b	Keep systemd-journal-upload retrying instead of failing a deploy The uploader exits when VictoriaLogs is unreachable. Upstream already sets Restart=always/RestartSec=3sec, but the default start-rate limit lets the unit give up permanently and trip switch-to-configuration when the sink is briefly down. Disable the limit (startLimitIntervalSec=0) so logging stays best-effort and never wedges a host or a deploy.	2026-06-17 17:09:30 +07:00
Berwn	d4a171640b	Add VictoriaLogs for centralized journald across all hosts control runs VictoriaLogs (:9428, 30d, mesh-scoped) with a matching Grafana datasource. Each host ships journald via systemd's own journald.upload to the /insert/journald endpoint -- no extra agent. control uploads over loopback so its logs survive a mesh outage; ns1 and ns2 push over the mesh.	2026-06-17 16:53:52 +07:00
Berwn	c7b0f206c8	Alert on and chart blackbox DNS probe failures DNSResolutionProbeFailed and DNSSECProbeFailed fire when an SOA or DNSKEY probe to a public nameserver address stays down for 5m. The CNX DNS dashboard gains a "DNS probes (outside-in)" row: per-zone/server status table, probe success, and probe latency.	2026-06-17 15:42:13 +07:00
Berwn	54f607d063	Add blackbox exporter for outside-in DNS probes control runs blackbox_exporter on loopback, probing each nameserver's public v4+v6 address for every zone: SOA (zone served) and DNSKEY (still signed, since blackbox has no DO-bit option). Probe definitions are shared between the exporter config and the VictoriaMetrics scrape jobs so they can't drift. Verified live against ns1/ns2 over v4 and v6.	2026-06-17 15:37:45 +07:00
Berwn	0544bf95e5	Add vmalert rules for failed and stale backups BackupJobFailed fires when a borgbackup job enters the systemd failed state; BackupStale fires when the daily timer has not run in over 26h (or has never run). Both read the node_exporter systemd collector on the backup client, matching the CNX Backups dashboard.	2026-06-17 15:17:12 +07:00
Berwn	1ea5bda23f	Add CNX Backups dashboard and document the backup setup Grafana dashboard (auto-provisioned from the dashboards dir) tracks borgbackup job health, time since last run, and per-job systemd state from the node_exporter systemd collector on the client. New docs page covers the ns1 -> control topology, secrets flow, and restore commands.	2026-06-17 15:13:47 +07:00
Berwn	7ae3221b83	Add Active alerts panel to the top of the CNX DNS dashboard Surfaces vmalert's firing ALERTS series as a table at the top of the dashboard, so the minimal-delivery alerts are visible at a glance. Existing panels shift down by one row.	2026-06-17 14:51:33 +07:00
Berwn	4c7c74836d	Add vmalert alerting rules for DNS and host health vmalert on control evaluates rules (declared in git) against VictoriaMetrics and remote-writes alert state back, so firing alerts show as the ALERTS series in Grafana. Covers SOA divergence between ns1/ns2, secondary zone expiry, scrape target down, and root disk full. No notifier yet (notifier.blackhole). Also adds TODO.md roadmap.	2026-06-17 14:49:32 +07:00
Berwn	848c4ec47d	Read mesh host map from clan zerotier vars instead of hardcoding The control/ns1/ns2 mesh IPs and the /88 subnet were duplicated literals in mesh-hosts.nix. clan-core's zerotier generator already writes each machine's IP as a public var (vars/per-machine/<m>/zerotier/zerotier-ip), so read from there and derive the subnet from zerotier-network-id. Pure refactor: the rendered values are identical and the system derivation hash is unchanged.	2026-06-17 11:53:56 +07:00
Berwn	8ac96b2d10	Enable IPv6 dialing for VictoriaMetrics scrapes The scraper defaults to IPv4-only, so the ns1/ns2 mesh ULA targets were dropped with 'no suitable address found'. -enableTCP6 lets VM scrape them.	2026-06-17 10:51:31 +07:00
Berwn	33ac7e106b	Add VictoriaMetrics + Grafana DNS monitoring over the mesh control runs VictoriaMetrics (loopback) and Grafana; every machine exports node metrics and the nameservers export Knot stats (mod-stats + knot-exporter). Scraping and the Grafana UI ride the ZeroTier mesh only, scoped by nftables to the mesh /88; the public side stays closed by the Hetzner cloud firewall. The provisioned DNS dashboard includes a per-zone SOA serial table to catch primary/secondary drift. ZeroTier ULAs are centralised in mesh-hosts.nix.	2026-06-17 10:17:27 +07:00

17 Commits