On Mon, May 17, 2021 at 4:22 PM Marco Fais <[email protected]> wrote: > Hi, > > I am having significant issues with glustershd with releases 8.4 and 9.1. > > My oVirt clusters are using gluster storage backends, and were running > fine with Gluster 7.x (shipped with earlier versions of oVirt Node 4.4.x). > Recently the oVirt project moved to Gluster 8.4 for the nodes, and hence I > have moved to this release when upgrading my clusters. > > Since then I am having issues whenever one of the nodes is brought down; > when the nodes come back up online the bricks are typically back up and > working, but some (random) glustershd processes in the various nodes seem > to have issues connecting to some of them. > > When the issue happens, can you check if the TCP port number of the brick (glusterfsd) processes displayed in `gluster volume status` matches with that of the actual port numbers observed (i.e. the --brick-port argument) when you run `ps aux | grep glusterfsd` ? If they don't match, then glusterd has incorrect brick port information in its memory and serving it to glustershd. Restarting glusterd instead of (killing the bricks + `volume start force`) should fix it, although we need to find why glusterd serves incorrect port numbers.
If they do match, then can you take a statedump of glustershd to check that it is indeed disconnected from the bricks? You will need to verify that 'connected=1' in the statedump. See "Self-heal is stuck/ not getting completed." section in https://docs.gluster.org/en/latest/Troubleshooting/troubleshooting-afr/. Statedump can be taken by `kill -SIGUSR1 $pid-of-glustershd`. It will be generated in the /var/run/gluster/ directory. Regards, Ravi
________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list [email protected] https://lists.gluster.org/mailman/listinfo/gluster-users
