--- Begin Message ---

On 04/02/2025 16:46, Fabian Grünbichler wrote:
Ivaylo Markov via pve-devel<pve-devel@lists.proxmox.com>  hat am 04.02.2025 
13:44 CET geschrieben:
Greetings,

I was pointed here to discuss the StorPool storage plugin[0] with the
dev team.
If I understand correctly, there is a concern with the our HA watchdog
daemon, and I'd like to explain the why and how.
Hi!

I am not sure whether there were previous discussions on some other channel, it 
might be helpful to include pointers to them if there are! Thanks for reaching 
out to our devel list, IMHO it's always best to get to a common understanding 
and hopefully a solution together, instead of on our own :)
Apologies for the confusion - there was a conversation at the management level between companies in regard to StorPool becoming a solution provider partner, and my understanding was that the PVE team had some concerns regarding the HA changes in our storage plugin.

The storage plugin is functional by itself and is used by some customers with the stock PVE watchdog in non-HCI scenarios.

The replacement watchdog is supposed to be used only in Proxmox+StorPool HCI deployments, where no other storage is used. All of the deployments are under continuous monitoring by our team, so we can put guard rails in to avoid unsupported configurations. We take the responsibility for debugging HA issues in these deployments. It is developed and tested by us and it is in use in production by a couple of customers.

It makes sense to move the HCI-specific watchdog functionality in a separate repo, so that the storage plugin repo is cleaner. We will do so shortly.


As a distributed storage system, StorPool has its own internal
clustering mechanisms; it can run
on networks that are independent from the PVE cluster one, and thus
remain unaffected by network
partitions or other problems that would cause the standard PVE watchdog
to reboot a node.
In the case of HCI (compute + storage) nodes, this reboot can interrupt
the normal operation of the
StorPool cluster, causing reduced performance or downtime, which could
be avoided if the host is not restarted.
This is why we do our best to avoid such behavior across the different
cloud management platforms.
This is similar to other storage providers like Ceph, which come with their own 
quorum/clustering/.. mechanism. In general, co-hosting two different systems 
like that will not increase overall availability or reliability, unless you can 
make them cooperate with eachother, which is usually quite tricky/hard.

E.g., in the case of Ceph+PVE (which I am obviously much more familiar with 
than your approach/solution):
- PVE clustering uses corosync+pmxcfs+PVE's HA stack, with HA enabled this 
entails fencing, otherwise the cluster mostly goes read-only
- Ceph will use its own monitors to determine quorum, and go read-only or 
inaccessible depending on how much of the cluster is up and how it is configured

Since the quorum mechanisms are mostly independent (which doesn't mean they 
can't go down at the same time for the same or unrelated reasons), you can have 
partial failure scenarios:
- Ceph could go read-only or down, while PVE itself is fine, but guests using 
Ceph are still experiencing I/O errors
- PVE could go read-only, but already running guests can still write to the 
Ceph storage
- PVE could fence a node which only hosts OSDs, and the remaining cluster can 
take over with just a short downtime of HA guests which were running on the 
fenced node
- PVE could fence all nodes running Ceph monitors, Ceph goes down hard, but PVE 
itself remains operable with the remaining majority of nodes
- ...

If you want to reduce this interference, then HCI is not the way to go, but 
separating compute and storage into entirely independent parts of you 
environment (you probably already know this ;) and we both know this can be a 
hard sell as it's the more expensive approach for small to medium setups).

I agree, non-HCI setups are simpler (and simple can often be better), but HCI also has advantages and is demanded by customers. We run a couple of KVM HCI clouds for our own production workloads and test/dev/lab use-cases, so we know why customers chose HCI.



Currently, when our daemon detects an unexpected exit of a resource
manager, it will SIGKILL PVE
HA services and running VMs on the node, which should prevent 2
instances of the same VM running at
the same time. PVE services and our block storage client daemon are
restarted as well.

We're open to discussion and suggestions for our approach and
implementation.
I just took a very quick peek, and maybe I understood something wrong (please correct me 
if I did!). as far as I can tell your watchdog implementation replaces ours, which means 
that there would be no more fencing in case a HA-enabled node leaves the quorate 
partition of the corosync cluster (this seems to be the whole point of your watchdog 
takeover - to avoid fencing)? Even if you kill all HA resources/guests and the HA 
services, this is still dangerous as the other nodes in the cluster will assume that the 
node has fenced itself after the grace period is over. This self-fencing property is a 
hard requirement for our HA stack, if that is undesirable for your use case you'd need to 
not allow HA in the first place (in which case, you also don't need to take over the 
watchdog, since it won't be armed). Note that while running guests and tasks are the most 
"high risk" parts, you simply cannot know what other processes/.. On the 
failing node is potentially accessing (writing to) state (
  such as VM disks) on shared storage(s) and thus can cause corruption if the 
node is not fully fenced by the time another node takes over.

Could you maybe describe a bit more how your clustering works, and what your 
watchdog setup entails? The repo didn't provide much high level details and I 
don't want to read through all the code to try to map that back to a rough 
design (feel free to link to documentation of course!), since you can probably 
provide that overview much better and easier.

Fabian
The goal of our StorPool+Proxmox HCI efforts has been to enable HCI deployments without decreasing the availability of the StorPool and Proxmox clusters. This is achieved by making sure Proxmox's clustering cannot restart nodes and making sure that VMs and other Proxmox services are killed when Proxmox wants to fence a node. The StorPool cluster doesn't need or use node fencing (how is a matter of a separate, longer conversation), so it does not affect the Proxmox cluster directly.

In HCI scenarios with StorPool, which are supported only when StorPool is the only shared storage configured, we replace the standard PVE watchdog with our own implementation.

When a node needs to be fenced our watchdog replacement performs the following actions:
SIGKILLs all guests
force-detaches SP volumes, and ensures our client block device cannot submit new IOs. "Force detach" in StorPool ensures that no further IO can be submitted by the client, even if it was temporarily disconnected.

Additionally, when a VM is started, the storage plugin first force-detaches its volumes from all hosts other than the one it is about to be started on. With these precautions in place there should be sufficient protection against parallel writes from multiple nodes. Writes to pmxcfs are handled by PVE’s clustering components, and we don’t expect any problems there.

We will also make sure that there are no other storages configured by means of monitoring of the Proxmox storage configuration.

What we've done so far seems to be sufficient to achieve the goals - it effectively removes the possibility of the Proxmox cluster killing off a storage node, while still effectively fencing VMs and other services. As with any piece of software, there are things which can be done to make it even better. A few non-committed examples:
 - support for containers, not just VMs
 - automatic recovery so it has UX similar to the default watchdog

Please let us know your thoughts and any further concerns, we'd like to address them as Proxmox HCI support is important to us.

Thank you,
Ivaylo


--
Ivaylo Markov
Quality & Automation Engineer
StorPool Storage
https://www.storpool.com


--- End Message ---
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Reply via email to