tested it by applying this series to a node with HA guests and then disabling the corosync network completely or, to test the "averted" log, sleeping for 45 seconds before bringing the corosync network back up.

So far, it seems that the "about to expire" warning did make it into the journal in my tests.

We will see in the future, how well that will work in production systems, depending on the underlying storage layer.


Some smaller remarks on patch 2/3.

Considers this series:
Tested-By: Aaron Lauterer <a.laute...@proxmox.com>
Reviewed-By: Aaron Lauterer <a.laute...@proxmox.com>



On  2025-05-19  15:09, Maximiliano Sandoval wrote:
It is very hard to provide a definitive answer to whether a host fenced or not.
In some cases the journal on the disk can be missing up to 2 minutes since its
last logged entry and the time where another node detects the corosync link is
down, with such a gap, the fenced node would not even record that it lost
conenction and it is not possible to fully-determine if the node was fenced or
not.

This series:
  - adds a second warning 10 seconds before the watchdog expires
  - syncs the journal to disk after the warning was issued
  - syncs the journal to disk after the watchdog expires

The variable names in the second commit could use some feedback. The way the
warning timeout is defined was arbitrary (10 seconds before the fence).

Maximiliano Sandoval (3):
   watchdog: separate if in two parts
   watchdog: warn when about to expire
   watchdog: sync journal after sending expiration related messages

  src/watchdog-mux.c | 40 +++++++++++++++++++++++++++++++++-------
  1 file changed, 33 insertions(+), 7 deletions(-)




_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Reply via email to