Thomas Lamprecht <t.lampre...@proxmox.com> writes:
> Am 19.05.25 um 15:09 schrieb Maximiliano Sandoval: >> One sync comes after warning that the watchdog is about to expire, and a >> second right after the watchdog expires. >> >> To maximize the chances the log will contain entries relevant to a fence >> event. This would be extremely useful for detecting whether a node >> fenced. >> >> Signed-off-by: Maximiliano Sandoval <m.sando...@proxmox.com> >> --- >> src/watchdog-mux.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/src/watchdog-mux.c b/src/watchdog-mux.c >> index e14c768..8669b10 100644 >> --- a/src/watchdog-mux.c >> +++ b/src/watchdog-mux.c >> @@ -268,11 +268,13 @@ main(void) >> ) { >> client_list[i].warning_state = WARNING_ISSUED; >> fprintf(stderr, "client watchdog is about to >> expire\n"); >> + sync_journal_unsafe(); > > The "unsafe" is there for a reason, on a loaded machine doing above > might trigger a few times and create a zombie left over process for > each of those. > > Simplest fix might be doing a double fork there so that the parent > process does not exist anymore, in which case systemd collects the > child process exit status, albeit that wouldn't be the most efficient > solution. > >> } >> >> if ((ctime - client_list[i].time) > >> client_watchdog_timeout) { >> update_watchdog = 0; >> fprintf(stderr, "client watchdog expired - >> disable watchdog updates\n"); >> + sync_journal_unsafe(); > > This is basically useless compared to the status quo, there is already > such a call a few (compiled) instructions after that branch hits anyway > as we break the main loop then. We do not (always) break out of the loop. ```c for (;;) { nfds = epoll_wait(epollfd, events, MAX_EVENTS, 1000); if (nfds == -1) { ... } if (nfds == 0) { // timeout // check for timeouts if (update_watchdog) { ... } if (update_watchdog) { ... } continue; } if (!update_watchdog) { break; } ``` if the wait_epoll keeps timing out, then nfds is 0 and we `continue` before hitting the break. This is what I observe locally whenever I test a fence on my local cluster by disconnecting all corosync NICs on a host hosting a HA resource. _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel