> Ivaylo Markov via pve-devel <pve-devel@lists.proxmox.com> hat am 04.02.2025 
> 13:44 CET geschrieben:
> Greetings,
> 
> I was pointed here to discuss the StorPool storage plugin[0] with the 
> dev team.
> If I understand correctly, there is a concern with the our HA watchdog 
> daemon, and I'd like to explain the why and how.

Hi!

I am not sure whether there were previous discussions on some other channel, it 
might be helpful to include pointers to them if there are! Thanks for reaching 
out to our devel list, IMHO it's always best to get to a common understanding 
and hopefully a solution together, instead of on our own :)

> As a distributed storage system, StorPool has its own internal 
> clustering mechanisms; it can run
> on networks that are independent from the PVE cluster one, and thus 
> remain unaffected by network
> partitions or other problems that would cause the standard PVE watchdog 
> to reboot a node.
> In the case of HCI (compute + storage) nodes, this reboot can interrupt 
> the normal operation of the
> StorPool cluster, causing reduced performance or downtime, which could 
> be avoided if the host is not restarted.
> This is why we do our best to avoid such behavior across the different 
> cloud management platforms.

This is similar to other storage providers like Ceph, which come with their own 
quorum/clustering/.. mechanism. In general, co-hosting two different systems 
like that will not increase overall availability or reliability, unless you can 
make them cooperate with eachother, which is usually quite tricky/hard.

E.g., in the case of Ceph+PVE (which I am obviously much more familiar with 
than your approach/solution):
- PVE clustering uses corosync+pmxcfs+PVE's HA stack, with HA enabled this 
entails fencing, otherwise the cluster mostly goes read-only
- Ceph will use its own monitors to determine quorum, and go read-only or 
inaccessible depending on how much of the cluster is up and how it is configured

Since the quorum mechanisms are mostly independent (which doesn't mean they 
can't go down at the same time for the same or unrelated reasons), you can have 
partial failure scenarios:
- Ceph could go read-only or down, while PVE itself is fine, but guests using 
Ceph are still experiencing I/O errors
- PVE could go read-only, but already running guests can still write to the 
Ceph storage
- PVE could fence a node which only hosts OSDs, and the remaining cluster can 
take over with just a short downtime of HA guests which were running on the 
fenced node
- PVE could fence all nodes running Ceph monitors, Ceph goes down hard, but PVE 
itself remains operable with the remaining majority of nodes
- ...

If you want to reduce this interference, then HCI is not the way to go, but 
separating compute and storage into entirely independent parts of you 
environment (you probably already know this ;) and we both know this can be a 
hard sell as it's the more expensive approach for small to medium setups).

> Currently, when our daemon detects an unexpected exit of a resource 
> manager, it will SIGKILL PVE
> HA services and running VMs on the node, which should prevent 2 
> instances of the same VM running at
> the same time. PVE services and our block storage client daemon are 
> restarted as well.
> 
> We're open to discussion and suggestions for our approach and 
> implementation.

I just took a very quick peek, and maybe I understood something wrong (please 
correct me if I did!). as far as I can tell your watchdog implementation 
replaces ours, which means that there would be no more fencing in case a 
HA-enabled node leaves the quorate partition of the corosync cluster (this 
seems to be the whole point of your watchdog takeover - to avoid fencing)? Even 
if you kill all HA resources/guests and the HA services, this is still 
dangerous as the other nodes in the cluster will assume that the node has 
fenced itself after the grace period is over. This self-fencing property is a 
hard requirement for our HA stack, if that is undesirable for your use case 
you'd need to not allow HA in the first place (in which case, you also don't 
need to take over the watchdog, since it won't be armed). Note that while 
running guests and tasks are the most "high risk" parts, you simply cannot know 
what other processes/.. On the failing node is potentially accessing (writing 
to) state (
 such as VM disks) on shared storage(s) and thus can cause corruption if the 
node is not fully fenced by the time another node takes over.

Could you maybe describe a bit more how your clustering works, and what your 
watchdog setup entails? The repo didn't provide much high level details and I 
don't want to read through all the code to try to map that back to a rough 
design (feel free to link to documentation of course!), since you can probably 
provide that overview much better and easier.

Fabian


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Reply via email to