Re: [Gluster-users] interpreting heal info and reported entries

Ravishankar N Wed, 29 Jan 2020 22:12:13 -0800


On 29/01/20 9:56 pm, Cox, Jason wrote:

I have glusterfs (v6.6) deployed with 3-way replication used by ovirt(v4.3).
I recently updated 1 of the nodes (now at gluster v6.7) and rebooted.When it came back online, glusterfs reported there were entries to behealed under the 2 nodes that had stayed online.
After 2+ days, the 2 nodes still show entries that need healing, soI’m trying to determine what the issue is.
The files shown in the heal info output are small so healing shouldnot take long. Also, ‘Gluster v heal <vol>’ and ‘gluster v heal <vol>full’ both return successful, but the entries persist.
So first off, I’m a little confused by what gluster volume heal <vol>info is reporting.
The following is what I see from heal info:

# gluster v heal engine info

Brick repo0:/gluster_bricks/engine/engine
/372501f5-062c-4790-afdb-dd7e761828ac/images/968daf61-6858-454a-9ed4-3d3db2ae1805/4317dd0d-fd35-4176-9353-7ff69e3a8dc3.meta
/372501f5-062c-4790-afdb-dd7e761828ac/images/4e3e8ca5-0edf-42ae-ac7b-e9a51ad85922/ceb42742-eaaa-4867-aa54-da525629aae4.meta
Status: Connected

Number of entries: 2

Brick repo1:/gluster_bricks/engine/engine
/372501f5-062c-4790-afdb-dd7e761828ac/images/968daf61-6858-454a-9ed4-3d3db2ae1805/4317dd0d-fd35-4176-9353-7ff69e3a8dc3.meta
/372501f5-062c-4790-afdb-dd7e761828ac/images/4e3e8ca5-0edf-42ae-ac7b-e9a51ad85922/ceb42742-eaaa-4867-aa54-da525629aae4.meta
Status: Connected

Number of entries: 2

Brick repo2:/gluster_bricks/engine/engine

Status: Connected

Number of entries: 0

Repo0 and repo1 were not rebooted, but repo2 was.
Since repo2 went offline I would expect when it came back online itwould have entries that need healing, but based on the heal infooutput that’s not what it looks like, so I’m thinking maybe heal infoisn’t reporting what I think it is reporting.
*When gluster volume heal <vol> info reports entries as above, what isit saying?

In heal info output, it is usually the nodes that were up that displaythe list of files that need heal. So the way to interpret it is whilerepo2 was down, repo0 and repo1 witnessed some modification to the filesand therefore capture them as needing heal, whose list is what the CLIdisplays

From the above output, I was reading it as repo0 has 2 entries thatneed to be healed from the other bricks and repo1 has 2 entries thatneed healing from the other bricks. However, that doesn’t make sensesince repo2 was the one that was rebooted and a ‘stat’ on the files inthe bricks show repo2 is the older version (checksums also show repo0and repo1 match). Trying to access the file through the FUSE mount onany node gives input/output errors.
Getfattr output:
repo0 glusterfs]# getfattr -d -m. -e hex/gluster_bricks/engine/engine/372501f5-062c-4790-afdb-dd7e761828ac/images/4e3e8ca5-0edf-42ae-ac7b-e9a51ad85922/ceb42742-eaaa-4867-aa54-da525629aae4.meta
getfattr: Removing leading '/' from absolute path names
# file:gluster_bricks/engine/engine/372501f5-062c-4790-afdb-dd7e761828ac/images/4e3e8ca5-0edf-42ae-ac7b-e9a51ad85922/ceb42742-eaaa-4867-aa54-da525629aae4.meta
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000

trusted.afr.dirty=0x000000000000000000000000

trusted.afr.engine-client-2=0x000000020000000200000000

trusted.bit-rot.signature=0x0102000000000000009338ff61a57fcb452b92ae816b8e5ff672be6d340e7da0a0dcfa34e26b26933b

trusted.bit-rot.version=0x02000000000000005e09d54f000ef84f

trusted.gfid=0xb85edc187d594872a594c25419154d05

trusted.gfid2path.ff2d749198341aff=0x32393564303861372d386437352d343638392d393239332d3339336434346362656233342f63656234323734322d656161612d343836372d616135342d6461353235363239616165342e6d657461

trusted.glusterfs.mdata=0x010000000000000000000000005e2f88ce000000002d1fe613000000005e2f88ce000000002d10ae53000000005e2f88ce000000002d067b3b

trusted.glusterfs.shard.block-size=0x0000000004000000

trusted.glusterfs.shard.file-size=0x00000000000001ad000000000000000000000000000000010000000000000000

trusted.pgfid.295d08a7-8d75-4689-9293-393d44cbeb34=0x00000001
repo1 glusterfs]# getfattr -d -m. -e hex/gluster_bricks/engine/engine/372501f5-062c-4790-afdb-dd7e761828ac/images/4e3e8ca5-0edf-42ae-ac7b-e9a51ad85922/ceb42742-eaaa-4867-aa54-da525629aae4.meta
getfattr: Removing leading '/' from absolute path names
# file:gluster_bricks/engine/engine/372501f5-062c-4790-afdb-dd7e761828ac/images/4e3e8ca5-0edf-42ae-ac7b-e9a51ad85922/ceb42742-eaaa-4867-aa54-da525629aae4.meta
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000

trusted.afr.dirty=0x000000000000000000000000

trusted.afr.engine-client-2=0x000000020000000200000000

trusted.bit-rot.signature=0x0102000000000000009338ff61a57fcb452b92ae816b8e5ff672be6d340e7da0a0dcfa34e26b26933b

trusted.bit-rot.version=0x02000000000000005e09db580000709b

trusted.gfid=0xb85edc187d594872a594c25419154d05

trusted.gfid2path.ff2d749198341aff=0x32393564303861372d386437352d343638392d393239332d3339336434346362656233342f63656234323734322d656161612d343836372d616135342d6461353235363239616165342e6d657461

trusted.glusterfs.mdata=0x010000000000000000000000005e2f88ce000000002d1fe613000000005e2f88ce000000002d10ae53000000005e2f88ce000000002d067b3b

trusted.glusterfs.shard.block-size=0x0000000004000000

trusted.glusterfs.shard.file-size=0x00000000000001ad000000000000000000000000000000010000000000000000

trusted.pgfid.295d08a7-8d75-4689-9293-393d44cbeb34=0x00000001
repo2 glusterfs]# getfattr -d -m. -e hex/gluster_bricks/engine/engine/372501f5-062c-4790-afdb-dd7e761828ac/images/4e3e8ca5-0edf-42ae-ac7b-e9a51ad85922/ceb42742-eaaa-4867-aa54-da525629aae4.meta
getfattr: Removing leading '/' from absolute path names
# file:gluster_bricks/engine/engine/372501f5-062c-4790-afdb-dd7e761828ac/images/4e3e8ca5-0edf-42ae-ac7b-e9a51ad85922/ceb42742-eaaa-4867-aa54-da525629aae4.meta
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000

trusted.afr.dirty=0x000000000000000000000000

trusted.bit-rot.signature=0x010200000000000000413e794bfbaf54b98bc00df95ce540fb6affe56ab5f5ddbb1fdb9eec096e0232

trusted.bit-rot.version=0x02000000000000005e09d553000151af

trusted.gfid=0xd36b1a8f63bc4a4bbcd0433882866733

trusted.gfid2path.ff2d749198341aff=0x32393564303861372d386437352d343638392d393239332d3339336434346362656233342f63656234323734322d656161612d343836372d616135342d6461353235363239616165342e6d657461

trusted.glusterfs.mdata=0x010000000000000000000000005e1df4a80000000020b618bf000000005e1df4a80000000020a1633c000000005e1df4a80000000020950ed4

trusted.glusterfs.shard.block-size=0x0000000004000000

trusted.glusterfs.shard.file-size=0x00000000000001ad000000000000000000000000000000010000000000000000

trusted.pgfid.295d08a7-8d75-4689-9293-393d44cbeb34=0x00000001
searching the gluster logs forceb42742-eaaa-4867-aa54-da525629aae4.meta I see:
repo0:
./rhev-data-center-mnt-glusterSD-repo0:_engine.log:The message "E[MSGID: 108008][afr-self-heal-common.c:384:afr_gfid_split_brain_source]0-engine-replicate-0: Gfid mismatch detected for<gfid:295d08a7-8d75-4689-9293-393d44cbeb34>/ceb42742-eaaa-4867-aa54-da525629aae4.meta>,d36b1a8f-63bc-4a4b-bcd0-433882866733 on engine-client-2 andb85edc18-7d59-4872-a594-c25419154d05 on engine-client-1." repeated 5times between [2020-01-28 23:01:22.912513] and [2020-01-2823:02:10.907716]
./rhev-data-center-mnt-glusterSD-repo0:_engine.log:[2020-01-2823:24:12.808924] E [MSGID: 108008][afr-self-heal-common.c:384:afr_gfid_split_brain_source]0-engine-replicate-0: Gfid mismatch detected for<gfid:295d08a7-8d75-4689-9293-393d44cbeb34>/ceb42742-eaaa-4867-aa54-da525629aae4.meta>,d36b1a8f-63bc-4a4b-bcd0-433882866733 on engine-client-2 andb85edc18-7d59-4872-a594-c25419154d05 on engine-client-1.
repo1:

nothing

repo2:
./rhev-data-center-mnt-glusterSD-repo0:_engine.log:The message "E[MSGID: 108008][afr-self-heal-common.c:384:afr_gfid_split_brain_source]0-engine-replicate-0: Gfid mismatch detected for<gfid:295d08a7-8d75-4689-9293-393d44cbeb34>/ceb42742-eaaa-4867-aa54-da525629aae4.meta>,d36b1a8f-63bc-4a4b-bcd0-433882866733 on engine-client-2 andb85edc18-7d59-4872-a594-c25419154d05 on engine-client-1." repeated 23times between [2020-01-29 15:42:46.201849] and [2020-01-2915:44:36.873793]
./rhev-data-center-mnt-glusterSD-repo0:_engine.log:[2020-01-2915:44:47.016466] E [MSGID: 108008][afr-self-heal-common.c:384:afr_gfid_split_brain_source]0-engine-replicate-0: Gfid mismatch detected for<gfid:295d08a7-8d75-4689-9293-393d44cbeb34>/ceb42742-eaaa-4867-aa54-da525629aae4.meta>,d36b1a8f-63bc-4a4b-bcd0-433882866733 on engine-client-2 andb85edc18-7d59-4872-a594-c25419154d05 on engine-client-1.
So it looks like a split brain issue according to the log message.

However,

    *Why doesn’t heal info show a split brain condition?
*Why does the logs for repo1 not have anything concerningceb42742-eaaa-4867-aa54-da525629aae4.meta?
    *If repo0 and repo1 match, why is there a split brain issue?

I think for some reason setting of AFR xattrs on the parent dir did nothappen, which is why the files are stuck in split-brain (instead ofgetting recreated on repo2 using the files from repo0 or 1). You canresolve it using the split-brain CLI, eg: `gluster volume heal $volnamesplit-brain source-brick repo0:/gluster_bricks/engine/engine/372501f5-062c-4790-afdb-dd7e761828ac/images/968daf61-6858-454a-9ed4-3d3db2ae1805/4317dd0d-fd35-4176-9353-7ff69e3a8dc3.meta`


Thanks,
Ravi

‘Gluster peer status’ on each node shows connected to each of theother 2 nodes.
‘Gluster volume heal engine info’ on each node shows each brick isconnected.
‘Gluster status engine’ on each node shows all 3 bricks as online, all3 self-heal daemons as online, all 3 bitrot daemons online, and all 3scrubber daemons online.
Thanks,

Jason
CONFIDENTIALITY NOTICE: This email and any attachments are for thesole use of the intended recipient and may contain material that isproprietary, confidential, privileged or otherwise legally protectedor restricted under applicable government laws. Any review,disclosure, distributing or other use without expressed permission ofthe sender is strictly prohibited. If you are not the intendedrecipient, please contact the sender and delete all copies withoutreading, printing, or saving.
________

Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/441850968

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
[email protected]
https://lists.gluster.org/mailman/listinfo/gluster-users

________

Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/441850968

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
[email protected]
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] interpreting heal info and reported entries

Reply via email to