[ovirt-users] Re: Ovirt 4.3.1 problem with HA agent

Strahil Nikolov Mon, 18 Mar 2019 04:57:32 -0700

 Hi Alexei,
In order to debug it check the following:
1. Check gluster:1.1 All bricks up ?1.2 All bricks healed (gluster volume heal 
data info summary) and no split-brain
2. Go to the problematic host and check the mount point is there2.1. Check 
permissions (should be vdsm:kvm) and fix with chown -R if needed2.2. Check the 
OVF_STORE from the logs that it exists2.3. Check that vdsm can extract the 
file:sudo -u vdsm tar -tvf 
/rhev/data-center/mnt/glusterSD/msk-gluster-facility.xxxx:_data/DOMAIN-UUID/Volume-UUID/Image-ID
3 Configure virsh alias, as it's quite helpful:alias virsh='virsh -c 
qemu:///system?authfile=/etc/ovirt-hosted-engine/virsh_auth.conf'
4. If VM is running - go to the host and get the xml:virsh dumpxml HostedEngine 
> /root/HostedEngine.xml4.1. Get the Network:virsh net-dumpxml vdsm-ovirtmgmt > 
/root/vdsm-ovirtmgmt.xml4.2 If not , Here is mine:[root@ovirt1 ~]# virsh 
net-dumpxml vdsm-ovirtmgmt
<network>
  <name>vdsm-ovirtmgmt</name>
  <uuid>7ae538ce-d419-4dae-93b8-3a4d27700227</uuid>
  <forward mode='bridge'/>
  <bridge name='ovirtmgmt'/>
</network>


UUID is not important, as my first recovery was with different one.
5. If you Hosted Engine is down:5.1 Remove the VM (if exists anywhere)on all 
nodes:virsh undefine HostedEngine5.2 Verify that the nodes are in global 
maintenance:hosted-engine --vm-status5.3 Define the Engine on only 1 
machinevirsh define HostedEngine.xmlvirsh net-define vdsm-ovirtmgmt.xml
virsh start HostedEngine

Note: if it complains about the storage - there is no link in 
/var/run/vdsm/storage/DOMAIN-UUID/Volume-UUID to your Volume-UUIDHere is how it 
looks mine:[root@ovirt1 808423f9-8a5c-40cd-bc9f-2568c85b8c74]# ll 
/var/run/vdsm/storage/808423f9-8a5c-40cd-bc9f-2568c85b8c74
total 24
lrwxrwxrwx. 1 vdsm kvm 139 Mar 17 07:42 2c74697a-8bd9-4472-8a98-bf624f3462d5 -> 
/rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/2c74697a-8bd9-4472-8a98-bf624f3462d5
lrwxrwxrwx. 1 vdsm kvm 139 Mar 17 07:45 3ec27d6d-921c-4348-b799-f50543b6f919 -> 
/rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/3ec27d6d-921c-4348-b799-f50543b6f919
lrwxrwxrwx. 1 vdsm kvm 139 Mar 17 08:28 441abdc8-6cb1-49a4-903f-a1ec0ed88429 -> 
/rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/441abdc8-6cb1-49a4-903f-a1ec0ed88429
lrwxrwxrwx. 1 vdsm kvm 139 Mar 17 21:15 8ec7a465-151e-4ac3-92a7-965ecf854501 -> 
/rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/8ec7a465-151e-4ac3-92a7-965ecf854501
lrwxrwxrwx. 1 vdsm kvm 139 Mar 17 08:28 94ade632-6ecc-4901-8cec-8e39f3d69cb0 -> 
/rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/94ade632-6ecc-4901-8cec-8e39f3d69cb0
lrwxrwxrwx. 1 vdsm kvm 139 Mar 17 07:42 fe62a281-51e9-4b23-87b3-2deb52357304 -> 
/rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/fe62a281-51e9-4b23-87b3-2deb52357304


Once you create your link , start it again.
6. Wait till OVF is fixed (takes more than the settings in the engine :) )
Good Luck!
Best Regards,Strahil Nikolov


    В понеделник, 18 март 2019 г., 12:57:30 ч. Гринуич+2, Николаев Алексей 
<[email protected]> написа:  
 
 Hi all! I have a very similar problem after update one of the two nodes to 
version 4.3.1. This node77-02 lost connection to gluster volume named DATA, but 
not to volume with hosted engine.  node77-02 /var/log/messages Mar 18 13:40:00 
node77-02 journal: ovirt-ha-agent 
ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR Failed 
scanning for OVF_STORE due to Command Volume.getInfo with args 
{'storagepoolID': '00000000-0000-0000-0000-000000000000', 'storagedomainID': 
'2ee71105-1810-46eb-9388-cc6caccf9fac', 'volumeID': 
u'224e4b80-2744-4d7f-bd9f-43eb8fe6cf11', 'imageID': 
u'43b75b50-cad4-411f-8f51-2e99e52f4c77'} failed:#012(code=201, message=Volume 
does not exist: (u'224e4b80-2744-4d7f-bd9f-43eb8fe6cf11',))Mar 18 13:40:00 
node77-02 journal: ovirt-ha-agent 
ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR Unable 
to identify the OVF_STORE volume, falling back to initial vm.conf. Please 
ensure you already added your first data domain for regular VMs HostedEngine VM 
works fine on all nodes. But node77-02 failed witherror in webUI: 
ConnectStoragePoolVDS failed: Cannot find master domain: 
u'spUUID=5a5cca91-01f8-01af-0297-00000000025f, 
msdUUID=7d5de684-58ff-4fbc-905d-3048fc55b2b1' node77-02 vdsm.log 2019-03-18 
13:51:46,287+0300 WARN  (jsonrpc/7) [storage.StorageServer.MountConnection] 
gluster server u'msk-gluster-facility.xxxx' is not in bricks 
['node-msk-gluster203', 'node-msk-gluster205', 'node-msk-gluster201'], possibly 
mounting duplicate servers (storageServer:317)2019-03-18 13:51:46,287+0300 INFO 
 (jsonrpc/7) [storage.Mount] mounting msk-gluster-facility.ipt.fsin.uis:/data 
at /rhev/data-center/mnt/glusterSD/msk-gluster-facility.xxxx:_data 
(mount:204)2019-03-18 13:51:46,474+0300 ERROR (jsonrpc/7) [storage.HSM] Could 
not connect to storageServer (hsm:2415)Traceback (most recent call last):  File 
"/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2412, in 
connectStorageServer    conObj.connect()  File 
"/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 179, in 
connect    six.reraise(t, v, tb)  File 
"/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 171, in 
connect    self._mount.mount(self.options, self._vfsType, cgroup=self.CGROUP)  
File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 207, in 
mount    cgroup=cgroup)  File 
"/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in 
__call__    return callMethod()  File 
"/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 53, in 
<lambda>    **kwargs)  File "<string>", line 2, in mount  File 
"/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod    
raise convert_to_error(kind, result)MountError: (1, ';Running scope as unit 
run-10121.scope.\nMount failed. Please check the log file for more details.\n') 
------------------------------ 2019-03-18 13:51:46,830+0300 ERROR (jsonrpc/4) 
[storage.TaskManager.Task] (Task='fe81642e-2421-4169-a08b-51467e8f01fe') 
Unexpected error (task:875)Traceback (most recent call last):  File 
"/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run    
return fn(*args, **kargs)  File "<string>", line 2, in connectStoragePool  File 
"/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method    
ret = func(*args, **kwargs)  File 
"/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 1035, in 
connectStoragePool    spUUID, hostID, msdUUID, masterVersion, domainsMap)  File 
"/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 1097, in 
_connectStoragePool    res = pool.connect(hostID, msdUUID, masterVersion)  File 
"/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 700, in connect    
self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)  File 
"/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1274, in __rebuild  
  self.setMasterDomain(msdUUID, masterVersion)  File 
"/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1495, in 
setMasterDomain    raise se.StoragePoolMasterNotFound(self.spUUID, 
msdUUID)StoragePoolMasterNotFound: Cannot find master domain: 
u'spUUID=5a5cca91-01f8-01af-0297-00000000025f, 
msdUUID=7d5de684-58ff-4fbc-905d-3048fc55b2b1' What the bestpractice to recovery 
this problem?    15.03.2019, 13:47, "Strahil Nikolov" <[email protected]>:
 On Fri, Mar 15, 2019 at 8:12 AM Strahil Nikolov <[email protected]> wrote:
Ok, I have managed to recover again and no issues are detected this time.I 
guess this case is quite rare and nobody has experienced that.

 >Hi,>can you please explain how you fixed it? I have set again to global 
maintenance, defined the HostedEngine from the old xml (taken from old vdsm 
log) , defined the network and powered it off.Set the OVF update period to 5 
min , but it took several hours until the OVF_STORE were updated. Once this 
happened I restarted the ovirt-ha-agent ovirt-ha-broker on both nodes.Then I 
powered off the HostedEngine and undefined it from ovirt1. then I set the 
maintenance to 'none' and the VM powered on ovirt1.In order to test a failure, 
I removed the global maintenance and powered off the HostedEngine from itself 
(via ssh). It was brought back to the other node. In order to test failure of 
ovirt2, I set ovirt1 in local maintenance and removed it (mode 'none') and 
again shutdown the VM via ssh and it started again to ovirt1. It seems to be 
working, as I have later shut down the Engine several times and it managed to 
start without issues. I'm not sure this is related, but I had detected that 
ovirt2 was out-of-sync of the vdsm-ovirtmgmt network , but it got fixed easily 
via the UI.   Best Regards,Strahil Nikolov ,
_______________________________________________
Users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/[email protected]/message/3B7OQUA733ETUA66TB7HF5Y24BLSI4XO/
_______________________________________________
Users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/[email protected]/message/EMPIGC7JHHWZOONGOLYJWOHNXMYDDSHX/

_______________________________________________
Users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/[email protected]/message/PAIGMGPCB4OAMORZOPH7B2OHZF7ZELUT/

[ovirt-users] Re: Ovirt 4.3.1 problem with HA agent

Reply via email to