muthukrishnang1100 commented on issue #13510:
URL: https://github.com/apache/cloudstack/issues/13510#issuecomment-4841878780
Hi @kiranchavala
Thank you for checking.
Our environment is different from the original NFS-primary-storage case.
Environment
Apache CloudStack: `4.20.3.0`
Hypervisor: KVM on Ubuntu 24.04
Management servers: 3-node management cluster
Database: 3-node MariaDB Galera cluster behind HAProxy
VM primary storage: Ceph RBD
Ceph pool for VM disks: `cloudstack-primary`
Ceph health during the test: `HEALTH_OK`
QEMU version on all compute hosts:
```text
QEMU emulator version 8.2.2
(Debian 1:8.2.2+ds-0ubuntu1.16)
```
Important: NFS is not used as VM primary storage in our environment.
NFS usage
NFSv4.2 is used only for the KVM Host HA heartbeat path.
Example heartbeat mount from compute hosts:
```text
/mnt/<heartbeat-mount-id> from
<management-server-ip>:/var/cloudstack-ha-heartbeat
Flags:
rw,nosuid,nodev,noexec,relatime,vers=4.2,rsize=1048576,wsize=1048576,hard,proto=tcp,timeo=600,retrans=2,sec=sys,local_lock=none
```
We also use NFSv4.2 for CloudStack secondary storage. However, all user VM
root and data disks are on Ceph RBD.
Before adding the dedicated NFS heartbeat storage, Host HA did not trigger
correctly with Ceph RBD primary storage. After configuring the NFS heartbeat
storage, Host HA detection and fencing are working correctly.
Relevant HA settings
```text
vm.ha.enabled = true
vm.ha.alerts.enabled = true
ha.workers = 5
enable.ha.storage.migration = true
ha.fence.builders.exclude = RecreatableFencer
ha.investigators.order = KVMInvestigator, PingInvestigator,
SimpleInvestigator
ha.max.concurrent.activity.check.operations = 25
ha.max.concurrent.fence.operations = 25
ha.max.concurrent.health.check.operations = 50
ha.max.concurrent.recovery.operations = 25
kvm.ha.activity.check.failure.ratio = 0.7
kvm.ha.activity.check.interval = 60
kvm.ha.activity.check.max.attempts = 10
kvm.ha.activity.check.timeout = 60
kvm.ha.degraded.max.period = 300
kvm.ha.fence.on.storage.heartbeat.failure = true
kvm.ha.fence.timeout = 60
kvm.ha.health.check.timeout = 10
kvm.ha.recover.failure.threshold = 1
kvm.ha.recover.timeout = 60
kvm.ha.recover.wait.period = 600
ha.vm.restart.hostup = true
vm.ha.migration.max.retries = 5
```
Test performed
We performed repeated ungraceful host-failure tests by hard-powering off a
KVM compute host through OOBM/IPMI.
Example failed host:
```text
Host name: ltchl1pacscom04
CloudStack Host ID: 19
Host IP: <redacted>
```
At the time of one test, there were 9 user VMs running on this host.
Expected behavior:
```text
Host Down
-> fencing succeeds
-> all VMs restart on another KVM host
```
Actual behavior:
```text
Host Down
-> KVM fencing succeeds
-> most VMs restart successfully on other hosts
-> a few VMs remain permanently in Stopping state
-> HA work later becomes Done
-> manual recovery is required
```
For example, during one 9-VM test, 6 VMs restarted successfully and the
following VMs remained in `Stopping`:
```text
i-22-248-VM
i-22-257-VM
i-22-258-VM
```
Management-server log observations
CloudStack correctly identifies the failed host as Down and KVM fencing
succeeds:
```text
Fencer KVMFenceBuilder returned true
```
For an affected VM, the HA stop/network cleanup workflow encountered a
database transaction failure:
```text
Unable to release some network resources for the VM in Stopping state
com.cloud.utils.exception.CloudRuntimeException:
Unable to commit or close the connection.
Caused by:
com.mysql.cj.jdbc.exceptions.MySQLTransactionRollbackException:
Deadlock found when trying to get lock; try restarting transaction
```
For other affected VMs, HA logs:
```text
Encountered unhandled exception during HA process, reschedule work
com.cloud.utils.exception.CloudRuntimeException:
Unable to find by id on DB, due to:
Deadlock found when trying to get lock; try restarting transaction
```
The HA work is initially rescheduled. During the next HA attempt, CloudStack
detects that the VM state has changed from `Running` to `Stopping`:
```text
HA on VM instance {... state="Stopping" ...}
VM ... has been changed.
Current State = Stopping
Previous State = Running
```
Then the HA work is completed:
```text
Completed work HAWork[...]
```
However, the VM remains in `Stopping` and is not restarted on another host.
Database observation
For the affected VMs, the database shows:
```text
vm_instance.state = Stopping
vm_instance.host_id = failed host ID
Latest op_ha_work:
step = Done
```
The MariaDB Galera cluster was healthy after the test:
```text
wsrep_cluster_status = Primary
wsrep_cluster_size = 3
wsrep_connected = ON
wsrep_ready = ON
wsrep_local_state_comment = Synced
```
Current counters on one database node were:
```text
Innodb_deadlocks = 0
wsrep_local_bf_aborts = 68
wsrep_local_cert_failures = 75
wsrep_local_replays = 4
```
These Galera counters are cumulative and were not captured before and
immediately after every HA test, so we cannot confirm that they are directly
caused by this specific HA event. However, the CloudStack management-server
logs clearly show `MySQLTransactionRollbackException: Deadlock found when
trying to get lock` during the affected HA recovery workflow.
Ceph RBD lock observation
For VMs that restarted successfully, the RBD exclusive locks moved from the
failed host to the destination compute hosts.
For VMs stuck in `Stopping`, their root-volume RBD locks remained on the
failed source host. Example:
```text
rbd lock ls cloudstack-primary/<rbd-image>
There is 1 exclusive lock on this image.
Locker ID Address
client.<id> auto <id>
<failed-compute-host-ip>:0/<client-session>
```
Since the source host was hard powered off, this stale RBD lock is expected.
However, CloudStack does not complete the HA recovery because the VM remains in
`Stopping` and the HA work becomes `Done`.
Comparison with issue #13510
I understand that the original issue uses shared NFSv4.2 primary storage and
the first restart failure is caused by a QEMU/NFS write-lock error.
Our initial HA failure appears different:
```text
Issue #13510:
QEMU cannot acquire NFS write lock
-> initial restart fails
-> HA retry detects update counter mismatch
-> HA work completes without retry
Our environment:
CloudStack HA stop/network cleanup encounters DB transaction deadlock
-> VM remains Stopping
-> HA retry detects changed VM state/update state
-> HA work completes without successful restart
```
The storage backend is different, but the final behavior appears similar:
after an initial HA recovery failure, CloudStack stops retrying and completes
the HA work even though the VM remains unavailable.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]