VictoriaLogs, like the VM scraper, is IPv4-only by default: ":9428" binds
0.0.0.0 only, so ns1/ns2 pushing journald over the IPv6 mesh got "connection
refused" while control's own loopback (v4) upload worked. Add -enableTCP6 so it
binds [::] (dual-stack), matching the flag already used for the scraper.
Also simplify the systemd-journal-upload override to just startLimitIntervalSec=0
(retry forever / self-heal) and drop the SuccessExitStatus masking: a persistent
sink failure should stay loud rather than be hidden behind a green deploy.
The uploader exits when VictoriaLogs is unreachable. Upstream already sets
Restart=always/RestartSec=3sec, but the default start-rate limit lets the unit
give up permanently and trip switch-to-configuration when the sink is briefly
down. Disable the limit (startLimitIntervalSec=0) so logging stays best-effort
and never wedges a host or a deploy.
control runs VictoriaLogs (:9428, 30d, mesh-scoped) with a matching
Grafana datasource. Each host ships journald via systemd's own
journald.upload to the /insert/journald endpoint -- no extra agent.
control uploads over loopback so its logs survive a mesh outage; ns1
and ns2 push over the mesh.
DNSResolutionProbeFailed and DNSSECProbeFailed fire when an SOA or
DNSKEY probe to a public nameserver address stays down for 5m. The CNX
DNS dashboard gains a "DNS probes (outside-in)" row: per-zone/server
status table, probe success, and probe latency.
control runs blackbox_exporter on loopback, probing each nameserver's
public v4+v6 address for every zone: SOA (zone served) and DNSKEY (still
signed, since blackbox has no DO-bit option). Probe definitions are
shared between the exporter config and the VictoriaMetrics scrape jobs
so they can't drift. Verified live against ns1/ns2 over v4 and v6.
BackupJobFailed fires when a borgbackup job enters the systemd failed
state; BackupStale fires when the daily timer has not run in over 26h
(or has never run). Both read the node_exporter systemd collector on
the backup client, matching the CNX Backups dashboard.
Grafana dashboard (auto-provisioned from the dashboards dir) tracks
borgbackup job health, time since last run, and per-job systemd state
from the node_exporter systemd collector on the client. New docs page
covers the ns1 -> control topology, secrets flow, and restore commands.
clan borgbackup instance: control serves repos, ns1 backs up its
clan.core.state (the KASP keystore at /var/lib/knot) nightly over the
mesh with repokey encryption. ns1 maps the control machine name to its
ZeroTier address so the borg@control repo resolves.
Run `clan vars generate ns1` before deploy to mint the borg keypair.
Surfaces vmalert's firing ALERTS series as a table at the top of the dashboard,
so the minimal-delivery alerts are visible at a glance. Existing panels shift
down by one row.
vmalert on control evaluates rules (declared in git) against VictoriaMetrics and
remote-writes alert state back, so firing alerts show as the ALERTS series in
Grafana. Covers SOA divergence between ns1/ns2, secondary zone expiry, scrape
target down, and root disk full. No notifier yet (notifier.blackhole). Also adds
TODO.md roadmap.
Docs live in docs/ (DNS, ZeroTier mesh, monitoring), built at Nix-build time and
served as static files over the ZeroTier mesh on control:8080. Commit-to-edit:
change the markdown and redeploy to publish.
clan.nix gains an allowedIps list for the zerotier controller, fed via a
ztMemberIp helper that derives each member's IPv6 on this network from its
10-char node id + the zerotier-network-id var. Lets us list external devices
(admin laptops) by their stable node id, which this clan-core's allowedIps
interface consumes as --member-ip on control.
The control/ns1/ns2 mesh IPs and the /88 subnet were duplicated literals in
mesh-hosts.nix. clan-core's zerotier generator already writes each machine's IP
as a public var (vars/per-machine/<m>/zerotier/zerotier-ip), so read from there
and derive the subnet from zerotier-network-id. Pure refactor: the rendered
values are identical and the system derivation hash is unchanged.
control runs VictoriaMetrics (loopback) and Grafana; every machine exports
node metrics and the nameservers export Knot stats (mod-stats + knot-exporter).
Scraping and the Grafana UI ride the ZeroTier mesh only, scoped by nftables to
the mesh /88; the public side stays closed by the Hetzner cloud firewall. The
provisioned DNS dashboard includes a per-zone SOA serial table to catch
primary/secondary drift. ZeroTier ULAs are centralised in mesh-hosts.nix.
dateserial (YYYYMMDDnn) only has a 2-digit same-day counter held in Knot's
journal; a journal reset restarted the counter and let ns1 mint a serial ns2
had already seen with older content, so ns2 never retransferred. unixtime is
strictly monotonic per reload, eliminating the shared-serial collision.
Add a dedicated acme_ddns TSIG key (scoped to ns1 only) and an acl_acme rule
that limits it to TXT updates at or under _acme-challenge.<zone>. An external
ACME client can now write challenge records via RFC 2136; Knot signs them and
transfers to ns2, which never holds the key.
ns1 (primary) now signs every zone with an ECDSA P-256/SHA-256 policy and
manages the SOA serial itself: zonefile-load = difference-no-serial (with
journal-content = all) plus serial-policy = dateserial let records be edited
without bumping the serial by hand. ns2 needs no change; it transfers the
already-signed zone.
Also point the ns1/ns2 AAAA glue at the public Hetzner IPv6 addresses; they
previously pointed at unroutable ZeroTier mesh ULAs.
Extract the per-firewall rule data out of control's configuration into
modules/hetzner-firewall-rules.nix, imported like the DNS domains list.
The evaluated rules are unchanged.
control runs a oneshot on each deploy that creates each firewall if
missing and replaces its rules via the Hetzner API set_rules action,
using a Read/Write token stored as a clan secret. Public SSH is not
exposed; admin access rides the ZeroTier mesh, with emergency-access as
the console fallback.
Add the clan-core emergency-access service on all nixos machines; it
sets a per-machine recovery root password for console login when a
machine fails to boot.
knotd runs as the "knot" user, so the shared TSIG key file needs
owner/group knot — it was root-only and knot couldn't read it.
systemd-resolved's stub listener was holding port 53, so knot's
0.0.0.0@53 / ::@53 TCP bind failed. Disable the stub (resolution
still works via nss-resolve) to free the port.