From 9c8a2abf3f90029bdf97270a78b85d5c45503d36 Mon Sep 17 00:00:00 2001 From: Berwn Date: Wed, 17 Jun 2026 17:27:56 +0700 Subject: [PATCH] Bind VictoriaLogs on IPv6 so the mesh can ship journald to it VictoriaLogs, like the VM scraper, is IPv4-only by default: ":9428" binds 0.0.0.0 only, so ns1/ns2 pushing journald over the IPv6 mesh got "connection refused" while control's own loopback (v4) upload worked. Add -enableTCP6 so it binds [::] (dual-stack), matching the flag already used for the scraper. Also simplify the systemd-journal-upload override to just startLimitIntervalSec=0 (retry forever / self-heal) and drop the SuccessExitStatus masking: a persistent sink failure should stay loud rather than be hidden behind a green deploy. --- docs/src/monitoring.md | 8 ++++++++ modules/monitoring/exporters.nix | 11 ++++++----- modules/monitoring/server.nix | 9 ++++++++- 3 files changed, 22 insertions(+), 6 deletions(-) diff --git a/docs/src/monitoring.md b/docs/src/monitoring.md index 675974a..85873e9 100644 --- a/docs/src/monitoring.md +++ b/docs/src/monitoring.md @@ -56,6 +56,14 @@ systemd's own `services.journald.upload` → the `/insert/journald` endpoint loopback so its logs survive a mesh outage, `ns1`/`ns2` push over the mesh, and 9428 is firewall-scoped to the mesh like everything else. +> Same IPv4-only default as the scraper: VictoriaLogs binds `0.0.0.0:9428` for a +> bare `:9428`, so mesh (IPv6) pushes from ns1/ns2 are refused until you pass +> `extraOptions = [ "-enableTCP6" ]` (binds `[::]`). Verify the bind on `control`: +> +> ``` +> ss -tlnp | grep 9428 # want [::]:9428, not 0.0.0.0:9428 +> ``` + Query logs from Grafana via the provisioned **VictoriaLogs** datasource (Explore view, LogsQL), or directly in the built-in UI at `http://[control]:9428/select/vmui`. Logs are tagged with `_HOSTNAME` and `_SYSTEMD_UNIT`, so to follow one service diff --git a/modules/monitoring/exporters.nix b/modules/monitoring/exporters.nix index c6a5434..eaf4361 100644 --- a/modules/monitoring/exporters.nix +++ b/modules/monitoring/exporters.nix @@ -103,11 +103,12 @@ in "http://${dest}/insert/journald"; }; - # systemd-journal-upload exits if the sink is unreachable. The upstream module - # already sets Restart=always/RestartSec=3sec, but the default start-rate limit - # (5 tries / 10s) still lets the unit give up permanently and fail a deploy when - # VictoriaLogs is briefly down. Logging is best-effort: disable the limit so it - # retries forever instead of wedging the host (or switch-to-configuration). + # systemd-journal-upload exits if the sink is unreachable. Upstream already + # restarts it (Restart=always/RestartSec=3sec), but the default start-rate limit + # (5 tries / 10s) lets it give up permanently — so a transient VictoriaLogs + # outage leaves the uploader dead until the next deploy. Disable the limit so it + # retries forever and self-heals once the sink returns. (A persistent failure + # still surfaces loudly in a deploy, which is what we want.) systemd.services.systemd-journal-upload.startLimitIntervalSec = 0; # Scrape ports reachable only from the ZeroTier mesh. diff --git a/modules/monitoring/server.nix b/modules/monitoring/server.nix index 3eae4e0..aeeda91 100644 --- a/modules/monitoring/server.nix +++ b/modules/monitoring/server.nix @@ -69,7 +69,14 @@ in services.victorialogs = { enable = true; listenAddress = ":${toString logsPort}"; - extraOptions = [ "-retentionPeriod=30d" ]; + # -enableTCP6: like the scraper above, VictoriaLogs is IPv4-only by default + # for *listening* too — ":9428" binds 0.0.0.0 only, so ns1/ns2 pushing over + # the IPv6 mesh get "connection refused". This makes it bind [::] (dual-stack) + # so the mesh can reach it. Retention has no dedicated NixOS option. + extraOptions = [ + "-retentionPeriod=30d" + "-enableTCP6" + ]; }; # Admin password generated once and stored as a clan secret. Retrieve with: