We have number of clusters connected to ovirt-engine. Some of these are single
host clusters (running ovirt-release43-4.3.5.2-1 on CentOS7) with local
storage. Recently, ovirt-engine started reporting one of these hosts
NonResponsive, VMs were still running on the host but ovirt seems unable to
communicate with it, testing shows no issues connecting engine -> host:vdsm and
likewise the host can communicate with the engine on port 80 and 443.
The host in question cannot be managed via IPMI for power management but we are
able to perform an SSH reboot via the engine interface. We opted to login to
the running virtual machines, shut them down and issue the SSH reboot from the
engine. The server changes to rebooting status for some time and then reports
NonResponsive state.
We are unable to maintenance the host or confirm host has been rebooted
manually as we are presented with the following
"Error while executing action: Cannot perform confirm 'Host has been rebooted'.
Another power management action is already in progress."
The VDSM logs on the host in question are continually showing:
2020-04-16 08:23:51,478+0000 INFO (vmrecovery) [vds] recovery: waiting for
storage pool to go up (clientIF:711)
2020-04-16 08:23:52,332+0000 INFO (jsonrpc/7) [vdsm.api] FINISH
getStoragePoolInfo error=Unknown pool id, pool not connected:
(u'6baea5dc-b049-47c2-a94f-5229c37c62d0',) from=::ffff:10.10.1.252,33680,
task_id=420249a4-55c0-436d-92c7-ea1286a0e287 (api:52)
2020-04-16 08:23:52,332+0000 ERROR (jsonrpc/7) [storage.TaskManager.Task]
(Task='420249a4-55c0-436d-92c7-ea1286a0e287') Unexpected error (task:875)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in
_run
return fn(*args, **kargs)
File "<string>", line 2, in getStoragePoolInfo
File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method
ret = func(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2550, in
getStoragePoolInfo
pool = self.getPool(spUUID)
File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 351, in
getPool
raise se.StoragePoolUnknown(spUUID)
StoragePoolUnknown: Unknown pool id, pool not connected:
(u'6baea5dc-b049-47c2-a94f-5229c37c62d0',)
2020-04-16 08:23:52,333+0000 INFO (jsonrpc/7) [storage.TaskManager.Task]
(Task='420249a4-55c0-436d-92c7-ea1286a0e287') aborting: Task is aborted:
"Unknown pool id, pool not connected:
(u'6baea5dc-b049-47c2-a94f-5229c37c62d0',)" - code 309 (task:1181)
During this period, the following is observed in the engine logs:
2020-04-16 08:23:52,307Z ERROR
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(EE-ManagedThreadFactory-engineScheduled-Thread-31) [] EVENT_ID:
VDS_BROKER_COMMAND_FAILURE(10,802), VDSM compute01.ovirt.local command
SpmStatusVDS failed: Message timeout which can be caused by communication issues
2020-04-16 08:23:52,307Z ERROR
[org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand]
(EE-ManagedThreadFactory-engineScheduled-Thread-31) [] Command
'SpmStatusVDSCommand(HostName = compute01.ovirt.local,
SpmStatusVDSCommandParameters:{hostId='67dc53da-d5ee-461e-87de-2ca6dd78637f',
storagePoolId='6baea5dc-b049-47c2-a94f-5229c37c62d0'})' execution failed:
VDSGenericException: VDSNetworkException: Message timeout which can be caused
by communication issues
2020-04-16 08:23:52,346Z ERROR
[org.ovirt.engine.core.vdsbroker.irsbroker.GetStoragePoolInfoVDSCommand]
(EE-ManagedThreadFactory-engineScheduled-Thread-31) [] Failed in
'GetStoragePoolInfoVDS' method
2020-04-16 08:23:52,355Z ERROR
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(EE-ManagedThreadFactory-engineScheduled-Thread-31) [] EVENT_ID:
IRS_BROKER_COMMAND_FAILURE(10,803), VDSM command GetStoragePoolInfoVDS failed:
Unknown pool id, pool not connected: (u'6baea5dc-b049-47c2-a94f-5229c37c62d0',)
2020-04-16 08:23:52,356Z ERROR
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
(EE-ManagedThreadFactory-engineScheduled-Thread-31) []
IrsBroker::Failed::GetStoragePoolInfoVDS: IRSGenericException:
IRSErrorException: Failed to GetStoragePoolInfoVDS, error = Unknown pool id,
pool not connected: (u'6baea5dc-b049-47c2-a94f-5229c37c62d0',), code = 309
The metadata file for the local storage domain looks fine?
ALIGNMENT=1048576
BLOCK_SIZE=512
CLASS=Data
DESCRIPTION=compute01_local_storage
IOOPTIMEOUTSEC=10
LEASERETRIES=3
LEASETIMESEC=60
LOCKPOLICY=
LOCKRENEWALINTERVALSEC=5
MASTER_VERSION=1
POOL_DESCRIPTION=compute01_local
POOL_DOMAINS=1cc26dea-688c-40cc-bda6-38b00054001e:Active
POOL_SPM_ID=-1
POOL_SPM_LVER=-1
POOL_UUID=6baea5dc-b049-47c2-a94f-5229c37c62d0
REMOTE_PATH=/mnt/ovirt_datastore
ROLE=Master
SDUUID=1cc26dea-688c-40cc-bda6-38b00054001e
TYPE=LOCALFS
VERSION=5
_SHA_CKSUM=24c85256b889d0b3384e7975c660f4a5cbb58d33
I would assume this has happened because ovirt was unable to power cycle the
machine and now can't confirm the SPM state? Normally in a case like this we
would confirm the host has been manually rebooted but we're unable to do that.
How can I clear the power management action that ovirt-engine thinks is in
progress?
_______________________________________________
Users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/[email protected]/message/LLORHPLPNGECTDGZVHHKQPK2NGW5JLDB/