Commit Graph

11 Commits

Author SHA1 Message Date
Berwn 0eb883061b Keep systemd-journal-upload retrying instead of failing a deploy
The uploader exits when VictoriaLogs is unreachable. Upstream already sets
Restart=always/RestartSec=3sec, but the default start-rate limit lets the unit
give up permanently and trip switch-to-configuration when the sink is briefly
down. Disable the limit (startLimitIntervalSec=0) so logging stays best-effort
and never wedges a host or a deploy.
2026-06-17 17:09:30 +07:00
Berwn d4a171640b Add VictoriaLogs for centralized journald across all hosts
control runs VictoriaLogs (:9428, 30d, mesh-scoped) with a matching
Grafana datasource. Each host ships journald via systemd's own
journald.upload to the /insert/journald endpoint -- no extra agent.
control uploads over loopback so its logs survive a mesh outage; ns1
and ns2 push over the mesh.
2026-06-17 16:53:52 +07:00
Berwn c7b0f206c8 Alert on and chart blackbox DNS probe failures
DNSResolutionProbeFailed and DNSSECProbeFailed fire when an SOA or
DNSKEY probe to a public nameserver address stays down for 5m. The CNX
DNS dashboard gains a "DNS probes (outside-in)" row: per-zone/server
status table, probe success, and probe latency.
2026-06-17 15:42:13 +07:00
Berwn 54f607d063 Add blackbox exporter for outside-in DNS probes
control runs blackbox_exporter on loopback, probing each nameserver's
public v4+v6 address for every zone: SOA (zone served) and DNSKEY (still
signed, since blackbox has no DO-bit option). Probe definitions are
shared between the exporter config and the VictoriaMetrics scrape jobs
so they can't drift. Verified live against ns1/ns2 over v4 and v6.
2026-06-17 15:37:45 +07:00
Berwn 0544bf95e5 Add vmalert rules for failed and stale backups
BackupJobFailed fires when a borgbackup job enters the systemd failed
state; BackupStale fires when the daily timer has not run in over 26h
(or has never run). Both read the node_exporter systemd collector on
the backup client, matching the CNX Backups dashboard.
2026-06-17 15:17:12 +07:00
Berwn 1ea5bda23f Add CNX Backups dashboard and document the backup setup
Grafana dashboard (auto-provisioned from the dashboards dir) tracks
borgbackup job health, time since last run, and per-job systemd state
from the node_exporter systemd collector on the client. New docs page
covers the ns1 -> control topology, secrets flow, and restore commands.
2026-06-17 15:13:47 +07:00
Berwn 7ae3221b83 Add Active alerts panel to the top of the CNX DNS dashboard
Surfaces vmalert's firing ALERTS series as a table at the top of the dashboard,
so the minimal-delivery alerts are visible at a glance. Existing panels shift
down by one row.
2026-06-17 14:51:33 +07:00
Berwn 4c7c74836d Add vmalert alerting rules for DNS and host health
vmalert on control evaluates rules (declared in git) against VictoriaMetrics and
remote-writes alert state back, so firing alerts show as the ALERTS series in
Grafana. Covers SOA divergence between ns1/ns2, secondary zone expiry, scrape
target down, and root disk full. No notifier yet (notifier.blackhole). Also adds
TODO.md roadmap.
2026-06-17 14:49:32 +07:00
Berwn 848c4ec47d Read mesh host map from clan zerotier vars instead of hardcoding
The control/ns1/ns2 mesh IPs and the /88 subnet were duplicated literals in
mesh-hosts.nix. clan-core's zerotier generator already writes each machine's IP
as a public var (vars/per-machine/<m>/zerotier/zerotier-ip), so read from there
and derive the subnet from zerotier-network-id. Pure refactor: the rendered
values are identical and the system derivation hash is unchanged.
2026-06-17 11:53:56 +07:00
Berwn 8ac96b2d10 Enable IPv6 dialing for VictoriaMetrics scrapes
The scraper defaults to IPv4-only, so the ns1/ns2 mesh ULA targets were
dropped with 'no suitable address found'. -enableTCP6 lets VM scrape them.
2026-06-17 10:51:31 +07:00
Berwn 33ac7e106b Add VictoriaMetrics + Grafana DNS monitoring over the mesh
control runs VictoriaMetrics (loopback) and Grafana; every machine exports
node metrics and the nameservers export Knot stats (mod-stats + knot-exporter).
Scraping and the Grafana UI ride the ZeroTier mesh only, scoped by nftables to
the mesh /88; the public side stays closed by the Hetzner cloud firewall. The
provisioned DNS dashboard includes a per-zone SOA serial table to catch
primary/secondary drift. ZeroTier ULAs are centralised in mesh-hosts.nix.
2026-06-17 10:17:27 +07:00