Commit Graph

125 Commits

Author SHA1 Message Date
Berwn dfdeb84ab8 Set time.timeZone on mx1 and web01
Both had NTP (timesyncd) enabled but no timezone, unlike control/ns1/ns2.
Default to Etc/GMT-3 to match the majority of hosts.
2026-06-21 03:07:31 +07:00
Berwn 48bf7fb250 Add web01 public reverse proxy with DNS-01 wildcard TLS
web01 terminates TLS for grafana.cnx.network and proxies to Grafana on
control over the mesh. Caddy serves a *.cnx.network wildcard cert obtained
via ACME DNS-01, using a dedicated acme_web01 TSIG key scoped on ns1 to
_acme-challenge on the cnx.network zone only. Ports 80/443 are the only
public exposure (80 just redirects); admin and the backend ride ZeroTier.

Also reload Caddy on cert renewal for both web01 and mx1, since both
reference the cert via explicit tls file paths and would otherwise keep
serving a stale cert after a silent renewal.
2026-06-21 03:05:54 +07:00
Berwn 86a2928825 update(inventory.json): Installed web01 2026-06-21 02:28:43 +07:00
Berwn f6da01ba18 Add web01 to secret vars/shared/dns-acme-web01-secret/secret 2026-06-21 02:26:44 +07:00
Berwn eeed40bcb5 Update vars via generator dns-acme-web01-rfc2136 for machine web01 2026-06-21 02:26:44 +07:00
Berwn aac8f9d8e6 Update vars via generator dns-acme-web01-knot for machine ns1 2026-06-21 02:26:43 +07:00
Berwn f5874bc337 Update vars via generator zerotier for machine web01 2026-06-21 02:26:33 +07:00
Berwn 2481d4bf92 Update vars via generator tor_tor for machine web01 2026-06-21 02:26:32 +07:00
Berwn 2d8096ee57 Update vars via generator state-version for machine web01 2026-06-21 02:26:30 +07:00
Berwn 1a4a749d78 Update vars via generator root-password for machine web01 2026-06-21 02:26:30 +07:00
Berwn 1c779d8013 Update vars via generator openssh for machine web01 2026-06-21 02:26:30 +07:00
Berwn 9c4e036b09 Update vars via generator emergency-access for machine web01 2026-06-21 02:26:30 +07:00
Berwn 8139b91fbc Add machine web01 to secrets 2026-06-21 02:26:30 +07:00
Berwn c436389619 Update secret web01-age.key 2026-06-21 02:26:29 +07:00
Berwn 9fc97e65b2 Update vars via generator dns-acme-web01-secret for machine ns1 2026-06-21 02:26:29 +07:00
Berwn bd84bf7c85 Set disk schema of machine: web01 to single-disk 2026-06-21 02:25:24 +07:00
Berwn 848dc0dff7 machines/web01/facter.json: update hardware configuration 2026-06-21 02:23:00 +07:00
Berwn 95aff44f86 Add machine web01 2026-06-21 01:58:59 +07:00
Berwn f42569e992 Add provisioned Grafana uptime dashboard for all hosts 2026-06-21 01:57:08 +07:00
Berwn 1dd3aadb97 Add mail.cnx.email client alias as a cert SAN
A mail.cnx.email CNAME (-> mx1.cnx.email) lets clients (Thunderbird etc.)
use a friendly hostname for submission/IMAP. To avoid a TLS name
mismatch the cert now carries mail.cnx.email as a SAN, so the acme_mx1
key is authorized to write _acme-challenge.mail too. The MX still points
at mx1.cnx.email and --reuse-key keeps the DANE TLSA digest valid across
the re-issue.
2026-06-18 15:01:03 +07:00
Berwn dc21348727 Format drifted files to satisfy the treefmt flake-check gate
Pure formatting (nixfmt/prettier/yamlfmt); no behavior change. These
files predate the current treefmt config and were failing nix flake
check; reformatting them makes the gate green again.
2026-06-18 14:49:48 +07:00
Berwn 1cb6f39ea2 Add declarative SNM mail stack on mx1 with DNS-01, DANE, MTA-STS
mx1 runs Simple NixOS Mailserver (Postfix/Dovecot/Rspamd/OpenDKIM) for
cnx.email. The TLS cert is obtained via ACME DNS-01 using a dedicated,
scoped TSIG key (acme_mx1) that ns1 authorizes for only
_acme-challenge.mx1 and _acme-challenge.mta-sts on the cnx.email zone, so
the credential can write nothing else. Mailbox passwords are auto-minted
by a clan vars generator (four-word passphrase + number).

DANE TLSA (3 1 1) is published for _25._tcp.mx1; --reuse-key keeps the
key digest stable across renewals. MTA-STS is enforced via a Caddy vhost
serving the policy on :443 from the same cert (mta-sts SAN). Firewall
opens 25/587/465/143/993/443; 80 stays closed.
2026-06-18 14:47:20 +07:00
Berwn 026a26dd53 Add ns1 to secret vars/shared/dns-acme-mx1-secret/secret 2026-06-18 14:11:40 +07:00
Berwn 7e5d50b260 Update vars via generator dns-acme-mx1-knot for machine ns1 2026-06-18 14:11:40 +07:00
Berwn 312de984c1 Update vars via generator dns-acme-rfc2136 for machine mx1 2026-06-18 14:11:40 +07:00
Berwn d76aa8cc8d Update vars via generator mail-passwd-postmaster-at-cnx-email for machine mx1 2026-06-18 14:11:36 +07:00
Berwn 0a78cad06e Update vars via generator dns-acme-mx1-secret for machine mx1 2026-06-18 14:11:36 +07:00
Berwn d1b24017aa Use no-store for docs: epoch mtimes make revalidation serve stale 2026-06-18 12:24:38 +07:00
Berwn 77a18df257 Stop browsers serving stale docs by forcing revalidation 2026-06-18 12:19:42 +07:00
Berwn a4fe2a7b3a Document how to pull registrar DS records from Knot on ns1
Explain that key material is auto-managed in the KASP keystore under
/var/lib/knot, and that the registrar DS is generated per zone with
`sudo -u knot keymgr <zone> ds`.
2026-06-18 12:12:10 +07:00
Berwn 6e4178df04 Onboard mx1 mail host and factor out per-host public IPs
- Register mx1 in the inventory and as a direct-SSH `internet` host; give it
  a static public IPv6 (2a01:4ff:2f0:1963::1).
- Point the cnx.email MX (plus SPF/DMARC) at mx1 and add its A record.
- Bring mx1 into monitoring: import exporters, add it to the mesh map and the
  node scrape job so its host metrics and journald reach control.
- Add a clan-mx1 Hetzner firewall: inbound SMTP + ZeroTier + ICMP, no public
  SSH (admin rides the mesh like the other hosts). 587/465/993 held for now.
- Extract per-host public IPv4/IPv6 into modules/hosts.nix, consumed by
  clan.nix's internet hosts and each machine's cnx.staticIPv6, so each address
  is declared once instead of being duplicated across configs.
- docs: add mx1 to the machines table.
2026-06-18 11:53:14 +07:00
Berwn 2c89ab913c update(inventory.json): Installed mx1 2026-06-18 11:35:22 +07:00
Berwn 84c3eece58 Update vars via generator zerotier for machine mx1 2026-06-18 11:33:06 +07:00
Berwn 7f5227d2e2 Update vars via generator tor_tor for machine mx1 2026-06-18 11:33:06 +07:00
Berwn ebf4efe5c9 Update vars via generator state-version for machine mx1 2026-06-18 11:33:04 +07:00
Berwn 64b7eb1934 Update vars via generator root-password for machine mx1 2026-06-18 11:33:04 +07:00
Berwn e763d76ae9 Update vars via generator openssh for machine mx1 2026-06-18 11:33:03 +07:00
Berwn b65f526ea2 Update vars via generator emergency-access for machine mx1 2026-06-18 11:33:03 +07:00
Berwn 3a0bc2dba4 Add machine mx1 to secrets 2026-06-18 11:33:03 +07:00
Berwn 6098fe9a3b Update secret mx1-age.key 2026-06-18 11:33:03 +07:00
Berwn 8d9981ee5a Set disk schema of machine: mx1 to single-disk 2026-06-18 11:32:33 +07:00
Berwn afc2e997c0 machines/mx1/facter.json: update hardware configuration 2026-06-18 11:32:22 +07:00
Berwn faaa7b66c0 Add machine mx1 2026-06-18 11:21:27 +07:00
Berwn 9c8a2abf3f Bind VictoriaLogs on IPv6 so the mesh can ship journald to it
VictoriaLogs, like the VM scraper, is IPv4-only by default: ":9428" binds
0.0.0.0 only, so ns1/ns2 pushing journald over the IPv6 mesh got "connection
refused" while control's own loopback (v4) upload worked. Add -enableTCP6 so it
binds [::] (dual-stack), matching the flag already used for the scraper.

Also simplify the systemd-journal-upload override to just startLimitIntervalSec=0
(retry forever / self-heal) and drop the SuccessExitStatus masking: a persistent
sink failure should stay loud rather than be hidden behind a green deploy.
2026-06-17 17:27:56 +07:00
Berwn 0eb883061b Keep systemd-journal-upload retrying instead of failing a deploy
The uploader exits when VictoriaLogs is unreachable. Upstream already sets
Restart=always/RestartSec=3sec, but the default start-rate limit lets the unit
give up permanently and trip switch-to-configuration when the sink is briefly
down. Disable the limit (startLimitIntervalSec=0) so logging stays best-effort
and never wedges a host or a deploy.
2026-06-17 17:09:30 +07:00
Berwn d4a171640b Add VictoriaLogs for centralized journald across all hosts
control runs VictoriaLogs (:9428, 30d, mesh-scoped) with a matching
Grafana datasource. Each host ships journald via systemd's own
journald.upload to the /insert/journald endpoint -- no extra agent.
control uploads over loopback so its logs survive a mesh outage; ns1
and ns2 push over the mesh.
2026-06-17 16:53:52 +07:00
Berwn c7b0f206c8 Alert on and chart blackbox DNS probe failures
DNSResolutionProbeFailed and DNSSECProbeFailed fire when an SOA or
DNSKEY probe to a public nameserver address stays down for 5m. The CNX
DNS dashboard gains a "DNS probes (outside-in)" row: per-zone/server
status table, probe success, and probe latency.
2026-06-17 15:42:13 +07:00
Berwn 54f607d063 Add blackbox exporter for outside-in DNS probes
control runs blackbox_exporter on loopback, probing each nameserver's
public v4+v6 address for every zone: SOA (zone served) and DNSKEY (still
signed, since blackbox has no DO-bit option). Probe definitions are
shared between the exporter config and the VictoriaMetrics scrape jobs
so they can't drift. Verified live against ns1/ns2 over v4 and v6.
2026-06-17 15:37:45 +07:00
Berwn 0544bf95e5 Add vmalert rules for failed and stale backups
BackupJobFailed fires when a borgbackup job enters the systemd failed
state; BackupStale fires when the daily timer has not run in over 26h
(or has never run). Both read the node_exporter systemd collector on
the backup client, matching the CNX Backups dashboard.
2026-06-17 15:17:12 +07:00
Berwn 1ea5bda23f Add CNX Backups dashboard and document the backup setup
Grafana dashboard (auto-provisioned from the dashboards dir) tracks
borgbackup job health, time since last run, and per-job systemd state
from the node_exporter systemd collector on the client. New docs page
covers the ns1 -> control topology, secrets flow, and restore commands.
2026-06-17 15:13:47 +07:00