Hi Did,

Thanks for your comments.
Yes, I do have redundancy for network and storage connections.
I`m testing a catastrophic scenario of losing all communication from a host and 
having a crashed host on which the SHE runs.
I intend to understand what to expect from running VMs and the Engine 
application. 
As you said, all VMs running on other hosts keep running without impacting them.
I will try to collect more information from the logs and understand the 
reference codes and constants you mentioned.
Thanks again for your help.

Marcos Sungaila

-----Original Message-----
From: Yedidyah Bar David <d...@redhat.com> 
Sent: Wednesday, September 21, 2022 2:46 AM
To: Marcos Sungaila <marcos.sunga...@oracle.com>
Cc: users@ovirt.org
Subject: [External] : Re: [ovirt-users] Self-hosted-engine timeout and 
recovering time

On Wed, Sep 21, 2022 at 12:22 AM Marcos Sungaila <marcos.sunga...@oracle.com> 
wrote:
>
> Hi all,
>
> I have a cluster running the 4.4.10 release with 6 KVM hosts and 
> Self-Hosted-Engine.

What storage?

> I'm testing some network outage scenarios, and I faced strange behavior.

I suppose you have redundancy in your network.

It's important to clarify (for yourself, mainly) what exactly you test, what's 
important, what's expected, etc.

> After disconnecting the KVM hosts hosting the SHE, there was a long timeout 
> until switching the Self-Hosted-Engine to another host as expected.

I suggest studying the ha-agent logs, /var/log/ovirt-hosted-engine-ha/agent.log.

Much of the relevant code is in ovirt_hosted_engine_ha/agent/states.py
(in the git repo, or under /usr/lib/python3.6/site-packages/ on your machine).

> Also, there took a relatively long time to take over the HA VMs from the 
> failing server.

That's a separate issue, about which I personally know very little.
You might want to start a separate thread about it.

I do know, though, that if you keep the storage connected, the host might be 
able to keep updating VM leases on the storage. See e.g.:

https://urldefense.com/v3/__https://www.ovirt.org/develop/release-management/features/storage/vm-leases.html__;!!ACWV5N9M2RV99hQ!KF3i9SPDHIMQrdvgAH0oYZq2WWUuuJh_n-h9jSrpeG1Ppvek5ZamgKXusC2ixiZhSUmqaL8MCljMw1zY$
  

I didn't check the admin guide, but I suppose it has some material about HA VMs.

> Is there a configuration where I can reduce the SHE timeout to make this 
> recover process faster?

IIRC there is nothing user-configurable.

You can see most relevant constants in
ovirt_hosted_engine_ha/agent/constants.py{,.in}.
Nothing stops you from changing them, but please note that this is somewhat 
risky, and I strongly suggest to do very careful testing with your new 
settings. It might make sense to try to methodically go through all the 
possible state changes in the above state machine.

The general assumption is that network and storage, for critical setups, are 
redundant, and that the engine itself is not considered critical, in the sense 
that if it's dead, all your VMs are still alive. And also, that it's more 
important to not corrupt VM disk images (e.g. by starting the VM concurrently 
on two hosts) than to keep the VM alive.

Best regards,
--
Didi

_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/74I7QITRJOOSRHWZW226VRZG4DUK3LCU/

Reply via email to