Hi Shwetha,
thank you for your reply...
I ran a few tests in Debug Mode and found no real indication of the
problem. After each start of the geo-replication some files are
transferred at the beginning and then no further transfers.
Few minutes after start the amount of changelog files in <brick> looks
like :
[ 06:42:26 ] - root@gl-master-02 ~/tmp $./var_gluster.sh
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.processed
: 0
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.processing
: 27 ### growing whole time
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history
: 324861
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history/.processed
: 1
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history/.processing
: 324859 ###finished building changelog files
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history/.current
: 0
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.current
: 0
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/xsync
: 0
---
As far as i remember at the beginning i have seen a few changelog files
in <brick>/.processed for a short moment, but always with size 0. Even
after some hours there are no files in <brick>/.processed.
In strace are a lot of of messages like 'failed: No data available' and
'rsync error: some files/attrs were not transfered ... (code 23)' for
about the first 5-10 minutes after geo-rep start.
for example gfid 8d601e5b-180c.... :
19361 1615530800.812727 select(7, [6], [], [], NULL <unfinished ...>
19357 1615530800.812779 select(0, NULL, NULL, NULL, {tv_sec=0,
tv_usec=235797} <unfinished ...>
19352 1615530800.816522
lstat(".gfid/f0ed7d0e-83be-4c3f-b2c8-f763e9aada12",
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.817723
lstat(".gfid/c5b44852-9cf9-441b-8766-d87bfac774c8",
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.819507
lstat(".gfid/b0b71bcc-7653-4ab8-b863-a83d395f5e91",
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.821106
lstat(".gfid/c4e80ff5-2e08-4e68-9a4d-ea7f45ed290d",
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.822874
lstat(".gfid/aaa468d8-9d6a-4aaf-8344-c57440286f5c", <unfinished ...>
19418 1615530800.823466 <... restart_syscall resumed> ) = 1
19418 1615530800.823519 read(14, "rsync: get_xattr_data:
lgetxattr(\"\"/tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234\"\",\"trusted.glusterfs.mdata\",0)
failed: No data available (61)\n", 32768) = 171
19418 1615530800.823587 poll([{fd=14, events=POLLIN}], 1, -1 <unfinished
...>
19352 1615530800.823830 <... lstat resumed> {st_mode=S_IFREG|0644,
st_size=4226767, ...}) = 0
19352 1615530800.823882
lstat(".gfid/8164ea3b-44f6-4ea2-a75f-501cea0024cc",
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.897938
lstat(".gfid/01da73ae-1f88-498d-8fe5-84ea76db12f3",
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.934281
lstat(".gfid/be48f891-cdc1-4e4c-a141-7001ae3f592e",
{st_mode=S_IFDIR|0777, st_size=4096, ...}) = 0
19352 1615530800.935938
lstat(".gfid/501fab77-5e83-42cb-9edf-ce30bc3a86a9",
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.937481
lstat(".gfid/668f6bd1-bdb0-46e0-9cd4-c7ebc38fbaf9", <unfinished ...>
19417 1615530800.961937 <... restart_syscall resumed> ) = 1
19417 1615530800.962042 read(13, "rsync error: some files/attrs were not
transferred (see previous errors) (code 23) at main.c(1196)
[sender=3.1.2]\n", 32768) = 114
according gsyncd.log this gfid is candidate for syncing like many
others, very few others are 'synced' :
[2021-03-12 06:33:20.651147] D [master(worker
/brick1/mvol1):318:a_syncdata] _GMaster: candidate for syncing
file=.gfid/8d601e5b-180c-46c8-b64f-ae6224542234
[2021-03-12 06:35:17.419920] D [master(worker
/brick1/mvol1):318:a_syncdata] _GMaster: candidate for syncing
file=.gfid/8d601e5b-180c-46c8-b64f-ae6224542234
[2021-03-12 06:35:03.382977] D [master(worker /brick1/mvol1):324:regjob]
_GMaster: synced file=.gfid/a3656075-784c-4377-a482-4aad8378acf0
when i try to fetch some attributes i get 'No such attribute' for the
mentioned gfid while for a synced gfid attributes are available :
[ 09:58:54 ] - root@gl-master-02 ~/tmp $getfattr -m . -d -e hex
/tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234
/tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234:
trusted.glusterfs.mdata: No such attribute
[ 09:59:38 ] - root@gl-master-02 ~/tmp $getfattr -m . -d -e hex
/tmp/gsyncd-aux-mount-46pc26b9/.gfid/a3656075-784c-4377-a482-4aad8378acf0
getfattr: Removing leading '/' from absolute path names
# file:
tmp/gsyncd-aux-mount-46pc26b9/.gfid/a3656075-784c-4377-a482-4aad8378acf0
trusted.glusterfs.mdata=0x010000000000000000000000005d1f73ff000000000da35668000000005d1f73fd0000000015811b46000000005f29050b000000001d5363a6
i can make a stat on that directory for example :
[ 10:07:19 ] - root@gl-master-02 ~/tmp $stat
/tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234
File:
/tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234
Size: 4096 Blocks: 8 IO Block: 131072 directory
Device: 37h/55d Inode: 9394601311212820456 Links: 2
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2021-01-17 07:26:09.596743288 +0000
Modify: 2021-03-12 07:34:39.382122663 +0000
Change: 2021-03-12 07:34:39.383446790 +0000
Birth: -
currently i have no clou how to get this fixed. obviously extended
attributes are missing for the most entries in /tmp/gsyncd-aux-mount.../
while some others exist.
I believe rsync is not the reason, it's rather because of missing
attributes ?!
Lastly the gfid points to a directory, when i try to get attributes for
this directroy in brick-path it succeed's :
[ 10:15:25 ] - root@gl-master-02 ~/tmp $ls -l
/brick1/mvol1/.glusterfs/8d/60/8d601e5b-180c-46c8-b64f-ae6224542234
lrwxrwxrwx 1 root root 56 Nov 5 17:54
/brick1/mvol1/.glusterfs/8d/60/8d601e5b-180c-46c8-b64f-ae6224542234 ->
../../f0/94/f094bf06-2806-4f90-9a79-489827c6cdf9/2217547
[ 10:38:01 ] - root@gl-master-02 ~ $getfattr -m . -d -e hex
/brick1/mvol1/2137/files/20/11/2217547
getfattr: Removing leading '/' from absolute path names
# file: brick1/mvol1/2137/files/20/11/2217547
trusted.gfid=0x8d601e5b180c46c8b64fae6224542234
trusted.glusterfs.2f5de6e4-66de-40a7-9f24-4762aad3ca96.xtime=0x604b198f0005e528
trusted.glusterfs.dht=0x001ed359000000007a2d37c1a8b9af89
trusted.glusterfs.dht.mds=0x00000000
[ 10:38:42 ] - root@gl-master-02 ~ $getfattr -m . -d -e hex
/brick1/mvol1/2137/files/20/11
getfattr: Removing leading '/' from absolute path names
# file: brick1/mvol1/2137/files/20/11
trusted.gfid=0xf094bf0628064f909a79489827c6cdf9
trusted.glusterfs.2f5de6e4-66de-40a7-9f24-4762aad3ca96.xtime=0x604b198f0005e528
trusted.glusterfs.dht=0x001ed35900000000d1738834ffffffff
trusted.glusterfs.mdata=0x010000000000000000000000005fc5378000000000077ba08a000000005fc535b60000000038d942cc000000005f9ffc610000000007b08744
[ 10:39:54 ] - root@gl-master-02 ~ $
but the geo-rep ended up in a loop, but without 'E'error :
[2021-03-12 10:46:40.572500] D [repce(worker
/brick1/mvol1):215:__call__] RepceClient: call
19352:140387951818496:1615546000.5609794 keep_alive -> 256
[2021-03-12 10:46:41.23154] D [master(worker
/brick2/mvol1):554:crawlwrap] _GMaster: ... crawl #0 done, took 5.017846
seconds
[2021-03-12 10:46:41.35729] D [master(worker
/brick2/mvol1):578:crawlwrap] _GMaster: Crawl info
cluster_stime=(1609281098, 0) brick_stime=(1609281900, 0)
[2021-03-12 10:46:46.41012] D [master(worker
/brick2/mvol1):554:crawlwrap] _GMaster: ... crawl #0 done, took 5.017512
seconds
[2021-03-12 10:46:46.53818] D [master(worker
/brick2/mvol1):578:crawlwrap] _GMaster: Crawl info
cluster_stime=(1609281098, 0) brick_stime=(1609281900, 0)
[2021-03-12 10:46:48.269174] D [repce(worker /brick2/mvol1):195:push]
RepceClient: call 19354:140476158043904:1615546008.2690222
keep_alive({'version': (1, 0), 'uuid':
'2f5de6e4-66de-40a7-9f24-4762aad3ca96', 'retval': 0, 'volume_mark':
(1609275788, 819193), 'timeout': 1615546128},) ...
Does anyone have any idea how to solve this problem ?
best regards,
Dietar
On 04.03.21 08:48, Shwetha Acharya wrote:
Hi Dietmar,
batch-fsync-delay-usec was already set to 0 and I increased the
sync_jobs from 3 to 6. In the moment I increased the sync_jobs
following error appeared in gsyncd.log :
[2021-03-03 23:17:46.59727] E [syncdutils(worker
/brick1/mvol1):312:log_raise_exception] <top>: connection to peer
is broken
[2021-03-03 23:17:46.59912] E [syncdutils(worker
/brick2/mvol1):312:log_raise_exception] <top>: connection to peer
is broken
If the geo-rep session is currently not in faulty state, we should be
bothered about this log message. It is normal when the config is
updated, geo-rep restart occurs and the above message pops up.
passive nodes became active and the content in <brick>/.processing
was removed. currently new changelog files are created in this
directory.shortly before I changed the sync_jobs I have checked
the <brick>/.processing directory on the master nodes. the result
was the same for every master node.
since the last error about 12 hours ago nearly 2400 changelog
files were created on each node but it looks like none of them
were consumed.
Processed changelogs that are synced are archived under
<brick>/.processed directory. Verify if the latest file is created there.
in the moment I'm not sure what is right and what is
wrong...should at least the oldest changelog files in this
directory have been processed gradually ?
Also you can try to set the log-level to debug for a while and set it
back to info(to avoid flooding of logs) and check the logs to get a
better picture of the scenario.
#gluster volume geo-replication <primary> <ip>::<secondary> config
log-level DEBUG
#gluster volume geo-replication <primary> <ip>::<secondary> config
log-level INFO
Regards,
Shwetha
________
Community Meeting Calendar:
Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
[email protected]
https://lists.gluster.org/mailman/listinfo/gluster-users