Hi,

I’m using glusterfs since approx. 18 month and need some help detecting the 
culprit for stale file content. Stale file content means here that I read the 
same file from two clients with different content. 

I have a distributed volume 

Volume Name: apptivegrid
Type: Distribute
Volume ID: 7087ee24-6603-477a-a822-29d011bca78e
Status: Started
Snapshot Count: 0
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: 10.1.2.1:/bricks/apptivegrid-base
Brick2: 10.1.2.8:/bricks/apptivegrid-base
Options Reconfigured:
performance.cache-invalidation: on
performance.cache-samba-metadata: on
performance.strict-o-direct: on
performance.open-behind: off
performance.read-ahead: off
performance.write-behind: off
performance.readdir-ahead: off
performance.parallel-readdir: off
performance.quick-read: off
performance.stat-prefetch: off
performance.io-cache: off
performance.flush-behind: off
performance.client-io-threads: off
locks.mandatory-locking: off
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
storage.fips-mode-rchecksum: on
transport.address-family: inet

and sometimes I get stale file content like expained above. The file that I 
discovered having the problem is a small file of 32 bytes. The change to that 
file is a version number increased, so I could testify the data is changed at 
the right position and also could see that it has old content on one of the 
machines. I made the above settings to the volume in order to be sure it cannot 
happen.

I created that volume a while ago and did not change the layout of the bricks 
(but upgraded from gluster 10 to 11 last week). Running a rebalance command 
should a lot of errors and failures in the status report of the rebalance. I 
also get quite a lot of this error in the logfile (settings is the small file 
I’m talking about)

[2023-08-17 21:57:31.910494 +0000] W [MSGID: 114031] 
[client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-apptivegrid-client-0: remote 
operation failed. [{path=/62/03/8c/62038cd116e9a6857794aa14/settings}, 
{gfid=1d38410a-1c14-4346-a7e5-68856ed310e9}, {errno=2}, {error=No such file or 
directory}]

and this 

[2023-08-17 21:57:31.902676 +0000] I [MSGID: 109018] 
[dht-common.c:1838:dht_revalidate_cbk] 0-apptivegrid-dht: Mismatching layouts 
for /62/03/8c/62038cd116e9a6857794aa14, gfid = 
f7f8eef0-bc19-4936-8c0c-fd0a497c5e69

This morning I found another occurrence of a stale file which I wanted to 
diagnose but a couple of minutes later it seemed to have healed itself. In 
order to diagnose I’ve shutdown the processes that could access it to be sure. 
So no idea what did the refresh if my action (releasing fds/locks) or timeout.

In order to better estimate what is the culprit I would need to verify/falsify 
some of my assumptions:

- the process P1 on a machine opens a couple of files and keeps them open until 
30 minutes of inactivity on them. If a process P2 running on another machine 
would change the same file, P1 would see those changes on next read, right? So 
my assumption is that an open fd can get content update and open fds do not 
prevent files to be updated on the client machine that has P1 running
- same question for locks. I do advisory locks on the small file for updating. 
This wouldn’t conflict with a content update as I assume that glusterfs does 
not lock ranges it updates

If my assumptions are valid I would suspect a cache on the client could be the 
culprit. I read that there is a default cache 32MB for small files. But then I 
thought this would be invalidated by an upcall as cache-invalidation is on. 

Is there a command to flush the client cache? What isn’t nice but should work 
is unmounting and mounting again.

I hope the information provided is sufficient enough.

thanks in advance,

Norbert

________



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
[email protected]
https://lists.gluster.org/mailman/listinfo/gluster-users

Reply via email to