Hi Li, You would like to take a look at the next PR from Wido - https://github.com/apache/cloudstack/pull/2985 - this is 4.12 only.
In other words, you are using Mimic, non-LTS release of Ceph - and I have a hard time believing that anyone is using this in production with CloudStack (since it's decently recent Ceph release). Test a ACS 4.12 and see if your problem goes away. @Wido den Hollander <w...@42on.com> , any thought? Regards, Andrija On Tue, 28 May 2019 at 06:24, li jerry <div...@hotmail.com> wrote: > Hello guys > > we’ve deployed an environment with CloudStack 4.11.2 and KVM(CentOS7.6), > and Ceph 13.2.5 is deployed as the primary storage. > We found some issues with the HA solution, and we are here to ask for you > suggestions. > > We’ve both enabled VM HA and Host HA feature in CloudStack, and the > compute offering is tagged as ha. > When we try to perform a power failure test (unplug 1 node of 4), the > running VMs on the removed node is automatically rescheduled to the other > living nodes after 5 minutes, but all of them can not boot into the OS. We > found the booting procedure is stuck by the IO read/write failure. > > > > The following information is prompted after VM starts: > > Generating "/run/initramfs/rdsosreport.txt" > > Entering emergency mode. Exit the shell to continue. > Type "journalctl" to view system logs. > You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick or > /boot > after mounting them and attach it to a bug report > > :/# > > > > We found this is caused by the lock on the image: > [root@cn01-nodea ~]# rbd lock list a93010b0-2be2-49bd-b25e-ec89b3a98b4b > There is 1 exclusive lock on this image. > Locker ID Address > client.1164351 auto 94464726847232 10.226.16.128:0/3002249644 > > If we remove the lock from the image, and restart the VM under CloudStack, > this VM will boot successfully. > > We know that if we disable the Exclusive Lock feature (by setting > rbd_default_features = 3) for Ceph would solve this problem. But we don’t > think it’s the best solution for the HA, so could you please give us some > ideas about how you are doing and what is the best practice for this > feature? > > Thanks. > > -- Andrija Panić