When I stopped the NFS service, I was connect to a VM over ssh. I was also connected to one of the physical hosts over ssh, and was running top.
I observed that server load continued to increase over time on the physical host. Several of the VMs (perhaps all?), including the one I was connected to, went down due to an underlying storage issue. It appears to me that HA VMs were restarted automatically. For example, I see the following in the oVirt Manager Event Log (domain names changed / redacted): Jun 4, 2021, 4:25:42 AM Highly Available VM server2.example.com failed. It will be restarted automatically. Jun 4, 2021, 4:25:42 AM Highly Available VM mail.example.com failed. It will be restarted automatically. Jun 4, 2021, 4:25:42 AM Highly Available VM core1.mgt.example.com failed. It will be restarted automatically. Jun 4, 2021, 4:25:42 AM VM cha1-shared.example.com has been paused due to unknown storage error. Jun 4, 2021, 4:25:42 AM VM server.example.org has been paused due to storage I/O problem. Jun 4, 2021, 4:25:42 AM VM server.example.com has been paused. Jun 4, 2021, 4:25:42 AM VM server.example.org has been paused. Jun 4, 2021, 4:25:41 AM VM server.example.org has been paused due to unknown storage error. Jun 4, 2021, 4:25:41 AM VM HostedEngine has been paused due to storage I/O problem. During this outage, I also noticed that customer websites were not working. So I clearly took an outage. > If you have a good way to reproduce the issue please file a bug with > all the logs, we try to improve this situation. I don't have a separate lab environment, but if I'm able to reproduce the issue off hours, I may try to do so. What logs would be helpful? > NFS storage domain will always affect other storage domains, but if you mount > your NFS storage outside of ovirt, the mount will not affect the system. > > Then you can backup to this mount, for example using backup_vm.py: > https://github.com/oVirt/ovirt-engine-sdk/blob/master/sdk/examples/backup_vm.py If I'm understanding you correctly, it sounds like you're suggesting that I just connect 1 (or multiple) hosts to the NFS mount manually, and don't use the oVirt manager to build the backup domain. Then just run this script on a cron or something - is that correct? Sent with ProtonMail Secure Email. ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Friday, June 4, 2021 12:29 PM, Nir Soffer <nsof...@redhat.com> wrote: > On Fri, Jun 4, 2021 at 12:11 PM David White via Users users@ovirt.org wrote: > > > I'm trying to figure out how to keep a "broken" NFS mount point from > > causing the entire HCI cluster to crash. > > HCI is working beautifully. > > Last night, I finished adding some NFS storage to the cluster - this is > > storage that I don't necessarily need to be HA, and I was hoping to store > > some backups and less-important VMs on, since my Gluster (sssd) storage > > availability is pretty limited. > > But as a test, after I got everything setup, I stopped the nfs-server. > > This caused the entire cluster to go down, and several VMs - that are not > > stored on the NFS storage - went belly up. > > Please explain in more detail "went belly up". > > In general vms not using he nfs storage domain should not be affected, but > due to unfortunate design of vdsm, all storage domain share the same global > lock > and when one storage domain has trouble, it can cause delays in > operations on other > domains. This may lead to timeouts and vms reported as non-responsive, > but the actual > vms, should not be affected. > > If you have a good way to reproduce the issue please file a bug with > all the logs, we try > to improve this situation. > > > Once I started the NFS server process again, HCI did what it was supposed > > to do, and was able to automatically recover. > > My concern is that NFS is a single point of failure, and if VMs that don't > > even rely on that storage are affected if the NFS storage goes away, then I > > don't want anything to do with it. > > You need to understand the actual effect on the vms before you reject NFS. > > > On the other hand, I'm still struggling to come up with a good way to run > > on-site backups and snapshots without using up more gluster space on my > > (more expensive) sssd storage. > > NFS is useful for this purpose. You don't need synchronous replication, and > you want the backups outside of your cluster so in case of disaster you can > restore the backups on another system. > > Snapshots are always on the same storage so it will not help. > > > Is there any way to setup NFS storage for a Backup Domain - as well as a > > Data domain (for lesser important VMs) - such that, if the NFS server > > crashed, all of my non-NFS stuff would be unaffected? > > NFS storage domain will always affect other storage domains, but if you mount > your NFS storage outside of ovirt, the mount will not affect the system. > > Then you can backup to this mount, for example using backup_vm.py: > https://github.com/oVirt/ovirt-engine-sdk/blob/master/sdk/examples/backup_vm.py > > Or one of the backup solutions, all of them are not using a storage domain for > keeping the backups so the mount should not affect the system. > > Nir > > Users mailing list -- users@ovirt.org > To unsubscribe send an email to users-le...@ovirt.org > Privacy Statement: https://www.ovirt.org/privacy-policy.html > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ > List Archives: > https://lists.ovirt.org/archives/list/users@ovirt.org/message/MYQAQTMXRAZT7EYAYCMYXBJYZHSNJT7G/
publickey - dmwhite823@protonmail.com - 0x320CD582.asc
Description: application/pgp-keys
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/MUXCOH6H7EYR7R637IBJJMDO2VI6QDW7/