Hello Martín, Try to disable "performance.readdir-ahead", we had a similar issue, and disabling "performance.readdir-ahead" solved our issue. gluster volume set tapeless performance.readdir-ahead off
On Tue, Oct 27, 2020 at 8:23 PM Martín Lorenzo <[email protected]> wrote: > Hi Strahil, today we have the same number clients on all nodes, but the > problem persists. I have the impression that it gets more frequent as the > server capacity fills up, now we are having at least one incident per day. > Regards, > Martin > > On Mon, Oct 26, 2020 at 8:09 AM Martín Lorenzo <[email protected]> wrote: > >> HI Strahil, thanks for your reply, >> I had one node with 13 clients, the rest with 14. I've just restarted the >> services on that node, now I have 14, let's see what happens. >> Regarding the samba repos, I wasn't aware of that, I was using centos >> main repo. I'll check the out >> Best Regards, >> Martin >> >> >> On Tue, Oct 20, 2020 at 3:19 PM Strahil Nikolov <[email protected]> >> wrote: >> >>> Do you have the same ammount of clients connected to each brick ? >>> >>> I guess something like this can show it: >>> >>> gluster volume status VOL clients >>> gluster volume status VOL client-list >>> >>> Best Regards, >>> Strahil Nikolov >>> >>> >>> >>> >>> >>> >>> В вторник, 20 октомври 2020 г., 15:41:45 Гринуич+3, Martín Lorenzo < >>> [email protected]> написа: >>> >>> >>> >>> >>> >>> Hi, I have the following problem, I have a distributed replicated >>> cluster set up with samba and CTDB, over fuse mount points >>> I am having inconsistencies across the FUSE mounts, users report that >>> files are disappearing after being copied/moved. I take a look at the mount >>> points on each node, and they don't display the same data >>> >>> #### faulty mount point#### >>> [root@gluster6 ARRIBA GENTE martes 20 de octubre]# ll >>> ls: cannot access PANEO VUELTA A CLASES CON TAPABOCAS.mpg: No such file >>> or directory >>> ls: cannot access PANEO NIÑOS ESCUELAS CON TAPABOCAS.mpg: No such file >>> or directory >>> total 633723 >>> drwxr-xr-x. 5 arribagente PN 4096 Oct 19 10:52 COMERCIAL AG martes >>> 20 de octubre >>> -rw-r--r--. 1 arribagente PN 648927236 Jun 3 07:16 PANEO FACHADA >>> PALACIO LEGISLATIVO DRONE DIA Y NOCHE.mpg >>> -?????????? ? ? ? ? ? PANEO NIÑOS ESCUELAS >>> CON TAPABOCAS.mpg >>> -?????????? ? ? ? ? ? PANEO VUELTA A >>> CLASES CON TAPABOCAS.mpg >>> >>> >>> ###healthy mount point### >>> [root@gluster7 ARRIBA GENTE martes 20 de octubre]# ll >>> total 3435596 >>> drwxr-xr-x. 5 arribagente PN 4096 Oct 19 10:52 COMERCIAL AG martes >>> 20 de octubre >>> -rw-r--r--. 1 arribagente PN 648927236 Jun 3 07:16 PANEO FACHADA >>> PALACIO LEGISLATIVO DRONE DIA Y NOCHE.mpg >>> -rw-r--r--. 1 arribagente PN 2084415492 Aug 18 09:14 PANEO NIÑOS >>> ESCUELAS CON TAPABOCAS.mpg >>> -rw-r--r--. 1 arribagente PN 784701444 Sep 4 07:23 PANEO VUELTA A >>> CLASES CON TAPABOCAS.mpg >>> >>> - So far the only way to solve this is to create a directory in the >>> healthy mount point, on the same path: >>> [root@gluster7 ARRIBA GENTE martes 20 de octubre]# mkdir hola >>> >>> - When you refresh the other mountpoint, and the issue is resolved: >>> [root@gluster6 ARRIBA GENTE martes 20 de octubre]# ll >>> total 3435600 >>> drwxr-xr-x. 5 arribagente PN 4096 Oct 19 10:52 COMERCIAL AG >>> martes 20 de octubre >>> drwxr-xr-x. 2 root root 4096 Oct 20 08:45 hola >>> -rw-r--r--. 1 arribagente PN 648927236 Jun 3 07:16 PANEO FACHADA >>> PALACIO LEGISLATIVO DRONE DIA Y NOCHE.mpg >>> -rw-r--r--. 1 arribagente PN 2084415492 Aug 18 09:14 PANEO NIÑOS >>> ESCUELAS CON TAPABOCAS.mpg >>> -rw-r--r--. 1 arribagente PN 784701444 Sep 4 07:23 PANEO VUELTA A >>> CLASES CON TAPABOCAS.mpg >>> >>> Interestingly, the error occurs on the mount point where the files were >>> copied. They don't show up as pending heal entries. I have around 15 people >>> using them over samba, I think I'm having this issue reported every two >>> days. >>> >>> I have an older cluster with similar issues, different gluster version, >>> but a very similar topology (4 bricks, initially two bricks then expanded) >>> Please note , the bricks aren't the same size (but their replicas are), >>> so my other suspicion is that rebalancing has something to do with it. >>> >>> I'm trying to reproduce it over a small virtualized cluster, so far no >>> results. >>> >>> Here are the cluster details >>> four nodes, replica 2, plus one arbiter hosting 2 bricks >>> >>> I have 2 bricks with ~20 TB capacity and the other pair is ~48TB >>> Volume Name: tapeless >>> Type: Distributed-Replicate >>> Volume ID: 53bfa86d-b390-496b-bbd7-c4bba625c956 >>> Status: Started >>> Snapshot Count: 0 >>> Number of Bricks: 2 x (2 + 1) = 6 >>> Transport-type: tcp >>> Bricks: >>> Brick1: gluster6.glustersaeta.net:/data/glusterfs/tapeless/brick_6/brick >>> Brick2: gluster7.glustersaeta.net:/data/glusterfs/tapeless/brick_7/brick >>> Brick3: >>> kitchen-store.glustersaeta.net:/data/glusterfs/tapeless/brick_1a/brick >>> (arbiter) >>> Brick4: gluster12.glustersaeta.net: >>> /data/glusterfs/tapeless/brick_12/brick >>> Brick5: gluster13.glustersaeta.net: >>> /data/glusterfs/tapeless/brick_13/brick >>> Brick6: >>> kitchen-store.glustersaeta.net:/data/glusterfs/tapeless/brick_2a/brick >>> (arbiter) >>> Options Reconfigured: >>> features.quota-deem-statfs: on >>> performance.client-io-threads: on >>> nfs.disable: on >>> transport.address-family: inet >>> features.quota: on >>> features.inode-quota: on >>> features.cache-invalidation: on >>> features.cache-invalidation-timeout: 600 >>> performance.cache-samba-metadata: on >>> performance.stat-prefetch: on >>> performance.cache-invalidation: on >>> performance.md-cache-timeout: 600 >>> network.inode-lru-limit: 200000 >>> performance.nl-cache: on >>> performance.nl-cache-timeout: 600 >>> performance.readdir-ahead: on >>> performance.parallel-readdir: on >>> performance.cache-size: 1GB >>> client.event-threads: 4 >>> server.event-threads: 4 >>> performance.normal-prio-threads: 16 >>> performance.io-thread-count: 32 >>> performance.write-behind-window-size: 8MB >>> storage.batch-fsync-delay-usec: 0 >>> cluster.data-self-heal: on >>> cluster.metadata-self-heal: on >>> cluster.entry-self-heal: on >>> cluster.self-heal-daemon: on >>> performance.write-behind: on >>> performance.open-behind: on >>> >>> Log section form faulty mount point. I think the [file exists] entries >>> are from people trying to copy the missing files over an over >>> >>> >>> [2020-10-20 11:31:03.034220] I [MSGID: 108031] >>> [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: >>> selecting local read_child tapeless-client-0 >>> [2020-10-20 11:32:06.684329] I [MSGID: 108031] >>> [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: >>> selecting local read_child tapeless-client-0 >>> [2020-10-20 11:33:02.191863] I [MSGID: 108031] >>> [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: >>> selecting local read_child tapeless-client-0 >>> [2020-10-20 11:34:05.841608] I [MSGID: 108031] >>> [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: >>> selecting local read_child tapeless-client-0 >>> [2020-10-20 11:35:20.736633] I [MSGID: 108026] >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] >>> 0-tapeless-replicate-1: performing metadata selfheal on >>> 958dbd7a-3cd7-4b66-9038-76e5c5669644 >>> [2020-10-20 11:35:20.741213] I [MSGID: 108026] >>> [afr-self-heal-common.c:1750:afr_log_selfheal] 0-tapeless-replicate-1: >>> Completed metadata selfheal on 958dbd7a-3cd7-4b66-9038-76e5c5669644. >>> sources=[0] 1 sinks=2 >>> [2020-10-20 11:35:04.278043] I [MSGID: 108031] >>> [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: >>> selecting local read_child tapeless-client-0 >>> The message "I [MSGID: 108026] >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] >>> 0-tapeless-replicate-1: performing metadata selfheal on >>> 958dbd7a-3cd7-4b66-9038-76e5c5669644" repeated 3 times between [2020-10-20 >>> 11:35:20.736633] and [2020-10-20 11:35:26.733298] >>> The message "I [MSGID: 108026] >>> [afr-self-heal-common.c:1750:afr_log_selfheal] 0-tapeless-replicate-1: >>> Completed metadata selfheal on 958dbd7a-3cd7-4b66-9038-76e5c5669644. >>> sources=[0] 1 sinks=2 " repeated 3 times between [2020-10-20 >>> 11:35:20.741213] and [2020-10-20 11:35:26.737629] >>> [2020-10-20 11:36:02.548350] I [MSGID: 108031] >>> [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: >>> selecting local read_child tapeless-client-0 >>> [2020-10-20 11:36:57.365537] I [MSGID: 108026] >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] >>> 0-tapeless-replicate-1: performing metadata selfheal on >>> f4907af2-1775-4c46-89b5-e9776df6d5c7 >>> [2020-10-20 11:36:57.370824] I [MSGID: 108026] >>> [afr-self-heal-common.c:1750:afr_log_selfheal] 0-tapeless-replicate-1: >>> Completed metadata selfheal on f4907af2-1775-4c46-89b5-e9776df6d5c7. >>> sources=[0] 1 sinks=2 >>> [2020-10-20 11:37:01.363925] I [MSGID: 108026] >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] >>> 0-tapeless-replicate-1: performing metadata selfheal on >>> f4907af2-1775-4c46-89b5-e9776df6d5c7 >>> [2020-10-20 11:37:01.368069] I [MSGID: 108026] >>> [afr-self-heal-common.c:1750:afr_log_selfheal] 0-tapeless-replicate-1: >>> Completed metadata selfheal on f4907af2-1775-4c46-89b5-e9776df6d5c7. >>> sources=[0] 1 sinks=2 >>> The message "I [MSGID: 108031] >>> [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: >>> selecting local read_child tapeless-client-0" repeated 3 times between >>> [2020-10-20 11:36:02.548350] and [2020-10-20 11:37:36.389208] >>> [2020-10-20 11:38:07.367113] I [MSGID: 108031] >>> [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: >>> selecting local read_child tapeless-client-0 >>> [2020-10-20 11:39:01.595981] I [MSGID: 108031] >>> [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: >>> selecting local read_child tapeless-client-0 >>> [2020-10-20 11:40:04.184899] I [MSGID: 108031] >>> [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: >>> selecting local read_child tapeless-client-0 >>> [2020-10-20 11:41:07.833470] I [MSGID: 108031] >>> [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: >>> selecting local read_child tapeless-client-0 >>> [2020-10-20 11:42:01.871621] I [MSGID: 108031] >>> [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: >>> selecting local read_child tapeless-client-0 >>> [2020-10-20 11:43:04.399194] I [MSGID: 108031] >>> [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: >>> selecting local read_child tapeless-client-0 >>> [2020-10-20 11:44:04.558647] I [MSGID: 108031] >>> [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: >>> selecting local read_child tapeless-client-0 >>> [2020-10-20 11:44:15.953600] W [MSGID: 114031] >>> [client-rpc-fops_v2.c:2114:client4_0_create_cbk] 0-tapeless-client-5: >>> remote operation failed. Path: /PN/arribagente/PLAYER 2020/ARRIBA GENTE >>> martes 20 de octubre/PANEO NIÑOS ESCUELAS CON TAPABOCAS.mpg [File exists] >>> [2020-10-20 11:44:15.953819] W [MSGID: 114031] >>> [client-rpc-fops_v2.c:2114:client4_0_create_cbk] 0-tapeless-client-2: >>> remote operation failed. Path: /PN/arribagente/PLAYER 2020/ARRIBA GENTE >>> martes 20 de octubre/PANEO NIÑOS ESCUELAS CON TAPABOCAS.mpg [File exists] >>> [2020-10-20 11:44:15.954072] W [MSGID: 114031] >>> [client-rpc-fops_v2.c:2114:client4_0_create_cbk] 0-tapeless-client-3: >>> remote operation failed. Path: /PN/arribagente/PLAYER 2020/ARRIBA GENTE >>> martes 20 de octubre/PANEO NIÑOS ESCUELAS CON TAPABOCAS.mpg [File exists] >>> [2020-10-20 11:44:15.954680] W [fuse-bridge.c:2606:fuse_create_cbk] >>> 0-glusterfs-fuse: 31043294: /PN/arribagente/PLAYER 2020/ARRIBA GENTE martes >>> 20 de octubre/PANEO NIÑOS ESCUELAS CON TAPABOCAS.mpg => -1 (File exists) >>> [2020-10-20 11:44:15.963175] W [fuse-bridge.c:2606:fuse_create_cbk] >>> 0-glusterfs-fuse: 31043306: /PN/arribagente/PLAYER 2020/ARRIBA GENTE martes >>> 20 de octubre/PANEO NIÑOS ESCUELAS CON TAPABOCAS.mpg => -1 (File exists) >>> [2020-10-20 11:44:15.971839] W [fuse-bridge.c:2606:fuse_create_cbk] >>> 0-glusterfs-fuse: 31043318: /PN/arribagente/PLAYER 2020/ARRIBA GENTE martes >>> 20 de octubre/PANEO NIÑOS ESCUELAS CON TAPABOCAS.mpg => -1 (File exists) >>> [2020-10-20 11:44:16.010242] W [fuse-bridge.c:2606:fuse_create_cbk] >>> 0-glusterfs-fuse: 31043403: /PN/arribagente/PLAYER 2020/ARRIBA GENTE martes >>> 20 de octubre/PANEO NIÑOS ESCUELAS CON TAPABOCAS.mpg => -1 (File exists) >>> [2020-10-20 11:44:16.020291] W [fuse-bridge.c:2606:fuse_create_cbk] >>> 0-glusterfs-fuse: 31043415: /PN/arribagente/PLAYER 2020/ARRIBA GENTE martes >>> 20 de octubre/PANEO NIÑOS ESCUELAS CON TAPABOCAS.mpg => -1 (File exists) >>> [2020-10-20 11:44:16.028857] W [fuse-bridge.c:2606:fuse_create_cbk] >>> 0-glusterfs-fuse: 31043427: /PN/arribagente/PLAYER 2020/ARRIBA GENTE martes >>> 20 de octubre/PANEO NIÑOS ESCUELAS CON TAPABOCAS.mpg => -1 (File exists) >>> The message "W [MSGID: 114031] >>> [client-rpc-fops_v2.c:2114:client4_0_create_cbk] 0-tapeless-client-5: >>> remote operation failed. Path: /PN/arribagente/PLAYER 2020/ARRIBA GENTE >>> martes 20 de octubre/PANEO NIÑOS ESCUELAS CON TAPABOCAS.mpg [File exists]" >>> repeated 5 times between [2020-10-20 11:44:15.953600] and [2020-10-20 >>> 11:44:16.027785] >>> The message "W [MSGID: 114031] >>> [client-rpc-fops_v2.c:2114:client4_0_create_cbk] 0-tapeless-client-2: >>> remote operation failed. Path: /PN/arribagente/PLAYER 2020/ARRIBA GENTE >>> martes 20 de octubre/PANEO NIÑOS ESCUELAS CON TAPABOCAS.mpg [File exists]" >>> repeated 5 times between [2020-10-20 11:44:15.953819] and [2020-10-20 >>> 11:44:16.028331] >>> The message "W [MSGID: 114031] >>> [client-rpc-fops_v2.c:2114:client4_0_create_cbk] 0-tapeless-client-3: >>> remote operation failed. Path: /PN/arribagente/PLAYER 2020/ARRIBA GENTE >>> martes 20 de octubre/PANEO NIÑOS ESCUELAS CON TAPABOCAS.mpg [File exists]" >>> repeated 5 times between [2020-10-20 11:44:15.954072] and [2020-10-20 >>> 11:44:16.028355] >>> [2020-10-20 11:45:03.572106] I [MSGID: 108031] >>> [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: >>> selecting local read_child tapeless-client-0 >>> [2020-10-20 11:45:40.080010] I [MSGID: 108031] >>> [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: >>> selecting local read_child tapeless-client-0 >>> The message "I [MSGID: 108031] >>> [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: >>> selecting local read_child tapeless-client-0" repeated 2 times between >>> [2020-10-20 11:45:40.080010] and [2020-10-20 11:47:10.871801] >>> [2020-10-20 11:48:03.913129] I [MSGID: 108031] >>> [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: >>> selecting local read_child tapeless-client-0 >>> [2020-10-20 11:49:05.082165] I [MSGID: 108031] >>> [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: >>> selecting local read_child tapeless-client-0 >>> [2020-10-20 11:50:06.725722] I [MSGID: 108031] >>> [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: >>> selecting local read_child tapeless-client-0 >>> [2020-10-20 11:51:04.254685] I [MSGID: 108031] >>> [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: >>> selecting local read_child tapeless-client-0 >>> [2020-10-20 11:52:07.903617] I [MSGID: 108031] >>> [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: >>> selecting local read_child tapeless-client-0 >>> [2020-10-20 11:53:01.420513] I [MSGID: 108026] >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] >>> 0-tapeless-replicate-0: performing metadata selfheal on >>> 3c316533-5f47-4267-ac19-58b3be305b94 >>> [2020-10-20 11:53:01.428657] I [MSGID: 108026] >>> [afr-self-heal-common.c:1750:afr_log_selfheal] 0-tapeless-replicate-0: >>> Completed metadata selfheal on 3c316533-5f47-4267-ac19-58b3be305b94. >>> sources=[0] sinks=1 2 >>> The message "I [MSGID: 108031] >>> [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: >>> selecting local read_child tapeless-client-0" repeated 3 times between >>> [2020-10-20 11:52:07.903617] and [2020-10-20 11:53:12.037835] >>> [2020-10-20 11:54:02.208354] I [MSGID: 108031] >>> [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: >>> selecting local read_child tapeless-client-0 >>> [2020-10-20 11:55:04.360284] I [MSGID: 108031] >>> [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: >>> selecting local read_child tapeless-client-0 >>> [2020-10-20 11:56:09.508092] I [MSGID: 108031] >>> [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: >>> selecting local read_child tapeless-client-0 >>> [2020-10-20 11:57:02.580970] I [MSGID: 108031] >>> [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: >>> selecting local read_child tapeless-client-0 >>> [2020-10-20 11:58:06.230698] I [MSGID: 108031] >>> [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: >>> selecting local read_child tapeless-client-0 >>> >>> >>> Let me know if you need something else. Thank you for you suppoort! >>> Best Regards, >>> Martin Lorenzo >>> >>> >>> ________ >>> >>> >>> >>> Community Meeting Calendar: >>> >>> Schedule - >>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC >>> Bridge: https://bluejeans.com/441850968 >>> >>> Gluster-users mailing list >>> [email protected] >>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> >> ________ > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://bluejeans.com/441850968 > > Gluster-users mailing list > [email protected] > https://lists.gluster.org/mailman/listinfo/gluster-users > -- Respectfully Mahdi
________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list [email protected] https://lists.gluster.org/mailman/listinfo/gluster-users
