When I stopped the NFS service, I was connect to a VM over ssh.
I was also connected to one of the physical hosts over ssh, and was running top.

I observed that server load continued to increase over time on the physical 
host.
Several of the VMs (perhaps all?), including the one I was connected to, went 
down due to an underlying storage issue.
It appears to me that HA VMs were restarted automatically. For example, I see 
the following in the oVirt Manager Event Log (domain names changed / redacted):


Jun 4, 2021, 4:25:42 AM
Highly Available VM server2.example.com failed. It will be restarted 
automatically.

Jun 4, 2021, 4:25:42 AM
Highly Available VM mail.example.com failed. It will be restarted automatically.

Jun 4, 2021, 4:25:42 AM
Highly Available VM core1.mgt.example.com failed. It will be restarted 
automatically.

Jun 4, 2021, 4:25:42 AM
VM cha1-shared.example.com has been paused due to unknown storage error.

Jun 4, 2021, 4:25:42 AM
VM server.example.org has been paused due to storage I/O problem.

Jun 4, 2021, 4:25:42 AM
VM server.example.com has been paused.

Jun 4, 2021, 4:25:42 AM
VM server.example.org has been paused.

Jun 4, 2021, 4:25:41 AM
VM server.example.org has been paused due to unknown storage error.

Jun 4, 2021, 4:25:41 AM
VM HostedEngine has been paused due to storage I/O problem.


During this outage, I also noticed that customer websites were not working.
So I clearly took an outage.

> If you have a good way to reproduce the issue please file a bug with
> all the logs, we try to improve this situation.

I don't have a separate lab environment, but if I'm able to reproduce the issue 
off hours, I may try to do so.
What logs would be helpful? 


> NFS storage domain will always affect other storage domains, but if you mount
> your NFS storage outside of ovirt, the mount will not affect the system.
> 

> Then you can backup to this mount, for example using backup_vm.py:
> https://github.com/oVirt/ovirt-engine-sdk/blob/master/sdk/examples/backup_vm.py

If I'm understanding you correctly, it sounds like you're suggesting that I 
just connect 1 (or multiple) hosts to the NFS mount manually, and don't use the 
oVirt manager to build the backup domain. Then just run this script on a cron 
or something - is that correct?


Sent with ProtonMail Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Friday, June 4, 2021 12:29 PM, Nir Soffer <nsof...@redhat.com> wrote:

> On Fri, Jun 4, 2021 at 12:11 PM David White via Users users@ovirt.org wrote:
> 

> > I'm trying to figure out how to keep a "broken" NFS mount point from 
> > causing the entire HCI cluster to crash.
> > HCI is working beautifully.
> > Last night, I finished adding some NFS storage to the cluster - this is 
> > storage that I don't necessarily need to be HA, and I was hoping to store 
> > some backups and less-important VMs on, since my Gluster (sssd) storage 
> > availability is pretty limited.
> > But as a test, after I got everything setup, I stopped the nfs-server.
> > This caused the entire cluster to go down, and several VMs - that are not 
> > stored on the NFS storage - went belly up.
> 

> Please explain in more detail "went belly up".
> 

> In general vms not using he nfs storage domain should not be affected, but
> due to unfortunate design of vdsm, all storage domain share the same global 
> lock
> and when one storage domain has trouble, it can cause delays in
> operations on other
> domains. This may lead to timeouts and vms reported as non-responsive,
> but the actual
> vms, should not be affected.
> 

> If you have a good way to reproduce the issue please file a bug with
> all the logs, we try
> to improve this situation.
> 

> > Once I started the NFS server process again, HCI did what it was supposed 
> > to do, and was able to automatically recover.
> > My concern is that NFS is a single point of failure, and if VMs that don't 
> > even rely on that storage are affected if the NFS storage goes away, then I 
> > don't want anything to do with it.
> 

> You need to understand the actual effect on the vms before you reject NFS.
> 

> > On the other hand, I'm still struggling to come up with a good way to run 
> > on-site backups and snapshots without using up more gluster space on my 
> > (more expensive) sssd storage.
> 

> NFS is useful for this purpose. You don't need synchronous replication, and
> you want the backups outside of your cluster so in case of disaster you can
> restore the backups on another system.
> 

> Snapshots are always on the same storage so it will not help.
> 

> > Is there any way to setup NFS storage for a Backup Domain - as well as a 
> > Data domain (for lesser important VMs) - such that, if the NFS server 
> > crashed, all of my non-NFS stuff would be unaffected?
> 

> NFS storage domain will always affect other storage domains, but if you mount
> your NFS storage outside of ovirt, the mount will not affect the system.
> 

> Then you can backup to this mount, for example using backup_vm.py:
> https://github.com/oVirt/ovirt-engine-sdk/blob/master/sdk/examples/backup_vm.py
> 

> Or one of the backup solutions, all of them are not using a storage domain for
> keeping the backups so the mount should not affect the system.
> 

> Nir
> 

> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/privacy-policy.html
> oVirt Code of Conduct: 
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives: 
> https://lists.ovirt.org/archives/list/users@ovirt.org/message/MYQAQTMXRAZT7EYAYCMYXBJYZHSNJT7G/

Attachment: publickey - dmwhite823@protonmail.com - 0x320CD582.asc
Description: application/pgp-keys

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/MUXCOH6H7EYR7R637IBJJMDO2VI6QDW7/

Reply via email to