Bind VictoriaLogs on IPv6 so the mesh can ship journald to it

VictoriaLogs, like the VM scraper, is IPv4-only by default: ":9428" binds 0.0.0.0 only, so ns1/ns2 pushing journald over the IPv6 mesh got "connection refused" while control's own loopback (v4) upload worked. Add -enableTCP6 so it binds [::] (dual-stack), matching the flag already used for the scraper. Also simplify the systemd-journal-upload override to just startLimitIntervalSec=0 (retry forever / self-heal) and drop the SuccessExitStatus masking: a persistent sink failure should stay loud rather than be hidden behind a green deploy.
2026-06-17 17:27:56 +07:00
parent 0eb883061b
commit 9c8a2abf3f
3 changed files with 22 additions and 6 deletions
@@ -56,6 +56,14 @@ systemd's own `services.journald.upload` → the `/insert/journald` endpoint
 loopback so its logs survive a mesh outage, `ns1`/`ns2` push over the mesh, and
 9428 is firewall-scoped to the mesh like everything else.
 > Same IPv4-only default as the scraper: VictoriaLogs binds `0.0.0.0:9428` for a
 > bare `:9428`, so mesh (IPv6) pushes from ns1/ns2 are refused until you pass
 > `extraOptions = [ "-enableTCP6" ]` (binds `[::]`). Verify the bind on `control`:
 >
 > ```
 > ss -tlnp | grep 9428   # want [::]:9428, not 0.0.0.0:9428
 > ```
 Query logs from Grafana via the provisioned **VictoriaLogs** datasource (Explore
 view, LogsQL), or directly in the built-in UI at `http://[control]:9428/select/vmui`.
 Logs are tagged with `_HOSTNAME` and `_SYSTEMD_UNIT`, so to follow one service
@@ -103,11 +103,12 @@ in
      "http://${dest}/insert/journald";
  };
-  # systemd-journal-upload exits if the sink is unreachable. The upstream module
+  # systemd-journal-upload exits if the sink is unreachable. Upstream already
-  # already sets Restart=always/RestartSec=3sec, but the default start-rate limit
+  # restarts it (Restart=always/RestartSec=3sec), but the default start-rate limit
-  # (5 tries / 10s) still lets the unit give up permanently and fail a deploy when
+  # (5 tries / 10s) lets it give up permanently — so a transient VictoriaLogs
-  # VictoriaLogs is briefly down. Logging is best-effort: disable the limit so it
+  # outage leaves the uploader dead until the next deploy. Disable the limit so it
-  # retries forever instead of wedging the host (or switch-to-configuration).
+  # retries forever and self-heals once the sink returns. (A persistent failure
  # still surfaces loudly in a deploy, which is what we want.)
  systemd.services.systemd-journal-upload.startLimitIntervalSec = 0;
  # Scrape ports reachable only from the ZeroTier mesh.
@@ -69,7 +69,14 @@ in
  services.victorialogs = {
    enable = true;
    listenAddress = ":${toString logsPort}";
-    extraOptions = [ "-retentionPeriod=30d" ];
+    # -enableTCP6: like the scraper above, VictoriaLogs is IPv4-only by default
    # for *listening* too — ":9428" binds 0.0.0.0 only, so ns1/ns2 pushing over
    # the IPv6 mesh get "connection refused". This makes it bind [::] (dual-stack)
    # so the mesh can reach it. Retention has no dedicated NixOS option.
    extraOptions = [
      "-retentionPeriod=30d"
      "-enableTCP6"
    ];
  };
  # Admin password generated once and stored as a clan secret. Retrieve with: