[ovirt-users] Cannot Start VM After Pausing due to Storage I/O Error

bob.franzke--- via Users Thu, 09 Sep 2021 10:57:49 -0700

Noticed I had a VM that was 'paused' due to a 'Storage I/O error. I inherited 
this system from another admin and have no idea where to start figuring this 
out. We have a 4-node Ovirt cluster with a 5 Manager node. The VM in question 
is running on a host vm-host-colo-4. Best I can tell the VMs run on a gluster 
replicated volume replicated between all 4 nodes, with node 1 running as an 
arbiter node for the gluster volume. Other VMs are running on this host 4 so 
not sure what the issue is with this one VM. When I look at the status of the 
gluster volume for this host, I see the self-heal info for the bricks is listed 
as 'N/A' for this host. All the other hosts in the cluster list this info as 
'OK'. When I cd into the gluster directory on host 4, I don't see the same 
things as I do on the other hosts. I am not sure this is an issue but its just 
different. When running various gluster commands gluster seems to respond. See 
below:


[root@vm-host-colo-4 gluster]# gluster volume info all
 
Volume Name: gl-colo-1
Type: Replicate
Volume ID: 2c545e19-9468-487e-9e9b-cd3202fc24c4
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: 10.20.101.181:/gluster/gl-colo-1/brick1
Brick2: 10.20.101.183:/gluster/gl-colo-1/brick1
Brick3: 10.20.101.185:/gluster/gl-colo-1/brick1 (arbiter)
Options Reconfigured:
network.ping-timeout: 30
cluster.granular-entry-heal: enable
performance.strict-o-direct: on
storage.owner-gid: 36
storage.owner-uid: 36
cluster.choose-local: off
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: enable
network.remote-dio: off
performance.low-prio-threads: 32
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
auth.allow: *
user.cifs: off
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
 
Volume Name: gl-vm-host-4
Type: Distribute
Volume ID: a2ba6b29-2366-4a7e-bda8-2e0574cf4afa
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: 10.20.101.187:/gluster/gl-vm-host-colo-4
Options Reconfigured:
network.ping-timeout: 30
cluster.granular-entry-heal: enable
network.remote-dio: off
performance.strict-o-direct: on
storage.owner-gid: 36
storage.owner-uid: 36
auth.allow: *
user.cifs: disable
transport.address-family: inet
nfs.disable: on
[root@vm-host-colo-4 gluster]# 

[root@vm-host-colo-4 gluster]# gluster-eventsapi status
Webhooks: 
http://mydesktop.altn.int:80/ovirt-engine/services/glusterevents

+-------------------------+-------------+-----------------------+
|           NODE          | NODE STATUS | GLUSTEREVENTSD STATUS |
+-------------------------+-------------+-----------------------+
| vm-host-colo-1.altn.int |          UP |                    OK |
| vm-idev-colo-1.altn.int |          UP |                    OK |
| vm-host-colo-2.altn.int |          UP |                    OK |
|        localhost        |          UP |                    OK |
+-------------------------+-------------+-----------------------+
[root@vm-host-colo-4 gluster]# gluster volume status gl-vm-host-4
Status of volume: gl-vm-host-4
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.20.101.187:/gluster/gl-vm-host-col
o-4                                         49152     0          Y       33221
 
Task Status of Volume gl-vm-host-4
------------------------------------------------------------------------------
There are no active volume tasks

I also get a timeout error when doing a 'gluster volume status' on this node. 
So while some aspects seem fine with the gluster volume, some don't. Should I 
restart the glusterd daemon or will that mess things up? I am not sure if this 
is due to something wrong with the gluster volume or with the vm-host's ability 
to access the data for the VM disk, meaning a true I/O problem. There are two 
VMs in this state both running on this host and I am not sure how to proceed to 
get them running again. Should I force this VM to be on a different host by 
editing the VM or should I try and make it work on the host its on. As 
mentioned, many other VMs are running on this host so not sure why these two 
have an issue.

Up front apologies here. I am a network engineer and not a VM/Ovirt expert. 
This was dropped in my lap due to a layoff and could use some help on where to 
go from here. Thanks in advance for any help.
_______________________________________________
Users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/[email protected]/message/ITXVUOPBJMIYXPJOB7C7JC64MGZOEHF7/

[ovirt-users] Cannot Start VM After Pausing due to Storage I/O Error

Reply via email to