muthukrishnang1100 commented on issue #13510:
URL: https://github.com/apache/cloudstack/issues/13510#issuecomment-4841878780

   Hi @kiranchavala 
   Thank you for checking.
   Our environment is different from the original NFS-primary-storage case.
   Environment
   Apache CloudStack: `4.20.3.0`
   Hypervisor: KVM on Ubuntu 24.04
   Management servers: 3-node management cluster
   Database: 3-node MariaDB Galera cluster behind HAProxy
   VM primary storage: Ceph RBD
   Ceph pool for VM disks: `cloudstack-primary`
   Ceph health during the test: `HEALTH_OK`
   QEMU version on all compute hosts:
   ```text
   QEMU emulator version 8.2.2
   (Debian 1:8.2.2+ds-0ubuntu1.16)
   ```
   Important: NFS is not used as VM primary storage in our environment.
   NFS usage
   NFSv4.2 is used only for the KVM Host HA heartbeat path.
   Example heartbeat mount from compute hosts:
   ```text
   /mnt/<heartbeat-mount-id> from 
<management-server-ip>:/var/cloudstack-ha-heartbeat
   Flags: 
rw,nosuid,nodev,noexec,relatime,vers=4.2,rsize=1048576,wsize=1048576,hard,proto=tcp,timeo=600,retrans=2,sec=sys,local_lock=none
   ```
   We also use NFSv4.2 for CloudStack secondary storage. However, all user VM 
root and data disks are on Ceph RBD.
   Before adding the dedicated NFS heartbeat storage, Host HA did not trigger 
correctly with Ceph RBD primary storage. After configuring the NFS heartbeat 
storage, Host HA detection and fencing are working correctly.
   Relevant HA settings
   ```text
   vm.ha.enabled = true
   vm.ha.alerts.enabled = true
   ha.workers = 5
   
   enable.ha.storage.migration = true
   ha.fence.builders.exclude = RecreatableFencer
   ha.investigators.order = KVMInvestigator, PingInvestigator, 
SimpleInvestigator
   
   ha.max.concurrent.activity.check.operations = 25
   ha.max.concurrent.fence.operations = 25
   ha.max.concurrent.health.check.operations = 50
   ha.max.concurrent.recovery.operations = 25
   
   kvm.ha.activity.check.failure.ratio = 0.7
   kvm.ha.activity.check.interval = 60
   kvm.ha.activity.check.max.attempts = 10
   kvm.ha.activity.check.timeout = 60
   kvm.ha.degraded.max.period = 300
   
   kvm.ha.fence.on.storage.heartbeat.failure = true
   kvm.ha.fence.timeout = 60
   kvm.ha.health.check.timeout = 10
   kvm.ha.recover.failure.threshold = 1
   kvm.ha.recover.timeout = 60
   kvm.ha.recover.wait.period = 600
   
   ha.vm.restart.hostup = true
   vm.ha.migration.max.retries = 5
   ```
   Test performed
   We performed repeated ungraceful host-failure tests by hard-powering off a 
KVM compute host through OOBM/IPMI.
   Example failed host:
   ```text
   Host name: ltchl1pacscom04
   CloudStack Host ID: 19
   Host IP: <redacted>
   ```
   At the time of one test, there were 9 user VMs running on this host.
   Expected behavior:
   ```text
   Host Down
   -> fencing succeeds
   -> all VMs restart on another KVM host
   ```
   Actual behavior:
   ```text
   Host Down
   -> KVM fencing succeeds
   -> most VMs restart successfully on other hosts
   -> a few VMs remain permanently in Stopping state
   -> HA work later becomes Done
   -> manual recovery is required
   ```
   For example, during one 9-VM test, 6 VMs restarted successfully and the 
following VMs remained in `Stopping`:
   ```text
   i-22-248-VM
   i-22-257-VM
   i-22-258-VM
   ```
   Management-server log observations
   CloudStack correctly identifies the failed host as Down and KVM fencing 
succeeds:
   ```text
   Fencer KVMFenceBuilder returned true
   ```
   For an affected VM, the HA stop/network cleanup workflow encountered a 
database transaction failure:
   ```text
   Unable to release some network resources for the VM in Stopping state
   
   com.cloud.utils.exception.CloudRuntimeException:
   Unable to commit or close the connection.
   
   Caused by:
   com.mysql.cj.jdbc.exceptions.MySQLTransactionRollbackException:
   Deadlock found when trying to get lock; try restarting transaction
   ```
   For other affected VMs, HA logs:
   ```text
   Encountered unhandled exception during HA process, reschedule work
   
   com.cloud.utils.exception.CloudRuntimeException:
   Unable to find by id on DB, due to:
   Deadlock found when trying to get lock; try restarting transaction
   ```
   The HA work is initially rescheduled. During the next HA attempt, CloudStack 
detects that the VM state has changed from `Running` to `Stopping`:
   ```text
   HA on VM instance {... state="Stopping" ...}
   
   VM ... has been changed.
   Current State = Stopping
   Previous State = Running
   ```
   Then the HA work is completed:
   ```text
   Completed work HAWork[...]
   ```
   However, the VM remains in `Stopping` and is not restarted on another host.
   Database observation
   For the affected VMs, the database shows:
   ```text
   vm_instance.state = Stopping
   vm_instance.host_id = failed host ID
   
   Latest op_ha_work:
   step = Done
   ```
   The MariaDB Galera cluster was healthy after the test:
   ```text
   wsrep_cluster_status      = Primary
   wsrep_cluster_size        = 3
   wsrep_connected           = ON
   wsrep_ready               = ON
   wsrep_local_state_comment = Synced
   ```
   Current counters on one database node were:
   ```text
   Innodb_deadlocks          = 0
   wsrep_local_bf_aborts     = 68
   wsrep_local_cert_failures = 75
   wsrep_local_replays       = 4
   ```
   These Galera counters are cumulative and were not captured before and 
immediately after every HA test, so we cannot confirm that they are directly 
caused by this specific HA event. However, the CloudStack management-server 
logs clearly show `MySQLTransactionRollbackException: Deadlock found when 
trying to get lock` during the affected HA recovery workflow.
   Ceph RBD lock observation
   For VMs that restarted successfully, the RBD exclusive locks moved from the 
failed host to the destination compute hosts.
   For VMs stuck in `Stopping`, their root-volume RBD locks remained on the 
failed source host. Example:
   ```text
   rbd lock ls cloudstack-primary/<rbd-image>
   
   There is 1 exclusive lock on this image.
   Locker          ID                    Address
   client.<id>     auto <id>             
<failed-compute-host-ip>:0/<client-session>
   ```
   Since the source host was hard powered off, this stale RBD lock is expected. 
However, CloudStack does not complete the HA recovery because the VM remains in 
`Stopping` and the HA work becomes `Done`.
   Comparison with issue #13510
   I understand that the original issue uses shared NFSv4.2 primary storage and 
the first restart failure is caused by a QEMU/NFS write-lock error.
   Our initial HA failure appears different:
   ```text
   Issue #13510:
   QEMU cannot acquire NFS write lock
   -> initial restart fails
   -> HA retry detects update counter mismatch
   -> HA work completes without retry
   
   Our environment:
   CloudStack HA stop/network cleanup encounters DB transaction deadlock
   -> VM remains Stopping
   -> HA retry detects changed VM state/update state
   -> HA work completes without successful restart
   ```
   The storage backend is different, but the final behavior appears similar: 
after an initial HA recovery failure, CloudStack stops retrying and completes 
the HA work even though the VM remains unavailable.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to