Re: [Gluster-users] no progress in geo-replication

Dietmar Putz Fri, 12 Mar 2021 06:54:15 -0800

Hi Shwetha,

thank you for your reply...

I ran a few tests in Debug Mode and found no real indication of theproblem. After each start of the geo-replication some files aretransferred at the beginning and then no further transfers.

Few minutes after start the amount of changelog files in <brick> lookslike :


[ 06:42:26 ] - root@gl-master-02  ~/tmp $./var_gluster.sh

/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.processed: 0/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.processing: 27 ### growing whole time/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history: 324861/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history/.processed: 1/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history/.processing: 324859 ###finished building changelog files/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history/.current: 0/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.current: 0/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/xsync: 0

---

As far as i remember at the beginning i have seen a few changelog filesin <brick>/.processed for a short moment, but always with size 0. Evenafter some hours there are no files in <brick>/.processed.

In strace are a lot of of messages like 'failed: No data available' and'rsync error: some files/attrs were not transfered ... (code 23)' forabout the first 5-10 minutes after geo-rep start.


for example gfid 8d601e5b-180c.... :

19361 1615530800.812727 select(7, [6], [], [], NULL <unfinished ...>

19357 1615530800.812779 select(0, NULL, NULL, NULL, {tv_sec=0,tv_usec=235797} <unfinished ...>19352 1615530800.816522lstat(".gfid/f0ed7d0e-83be-4c3f-b2c8-f763e9aada12",{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 019352 1615530800.817723lstat(".gfid/c5b44852-9cf9-441b-8766-d87bfac774c8",{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 019352 1615530800.819507lstat(".gfid/b0b71bcc-7653-4ab8-b863-a83d395f5e91",{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 019352 1615530800.821106lstat(".gfid/c4e80ff5-2e08-4e68-9a4d-ea7f45ed290d",{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 019352 1615530800.822874lstat(".gfid/aaa468d8-9d6a-4aaf-8344-c57440286f5c", <unfinished ...>

19418 1615530800.823466 <... restart_syscall resumed> ) = 1

19418 1615530800.823519 read(14, "rsync: get_xattr_data:lgetxattr(\"\"/tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234\"\",\"trusted.glusterfs.mdata\",0)failed: No data available (61)\n", 32768) = 17119418 1615530800.823587 poll([{fd=14, events=POLLIN}], 1, -1 <unfinished...>19352 1615530800.823830 <... lstat resumed> {st_mode=S_IFREG|0644,st_size=4226767, ...}) = 019352 1615530800.823882lstat(".gfid/8164ea3b-44f6-4ea2-a75f-501cea0024cc",{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 019352 1615530800.897938lstat(".gfid/01da73ae-1f88-498d-8fe5-84ea76db12f3",{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 019352 1615530800.934281lstat(".gfid/be48f891-cdc1-4e4c-a141-7001ae3f592e",{st_mode=S_IFDIR|0777, st_size=4096, ...}) = 019352 1615530800.935938lstat(".gfid/501fab77-5e83-42cb-9edf-ce30bc3a86a9",{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 019352 1615530800.937481lstat(".gfid/668f6bd1-bdb0-46e0-9cd4-c7ebc38fbaf9", <unfinished ...>

19417 1615530800.961937 <... restart_syscall resumed> ) = 1

19417 1615530800.962042 read(13, "rsync error: some files/attrs were nottransferred (see previous errors) (code 23) at main.c(1196)[sender=3.1.2]\n", 32768) = 114

according gsyncd.log this gfid is candidate for syncing like manyothers, very few others are 'synced' :

[2021-03-12 06:33:20.651147] D [master(worker/brick1/mvol1):318:a_syncdata] _GMaster: candidate for syncing file=.gfid/8d601e5b-180c-46c8-b64f-ae6224542234[2021-03-12 06:35:17.419920] D [master(worker/brick1/mvol1):318:a_syncdata] _GMaster: candidate for syncing file=.gfid/8d601e5b-180c-46c8-b64f-ae6224542234

[2021-03-12 06:35:03.382977] D [master(worker /brick1/mvol1):324:regjob]_GMaster: synced file=.gfid/a3656075-784c-4377-a482-4aad8378acf0

when i try to fetch some attributes i get 'No such attribute' for thementioned gfid while for a synced gfid attributes are available :

[ 09:58:54 ] - root@gl-master-02 ~/tmp $getfattr -m . -d -e hex/tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234/tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234:trusted.glusterfs.mdata: No such attribute

[ 09:59:38 ] - root@gl-master-02 ~/tmp $getfattr -m . -d -e hex/tmp/gsyncd-aux-mount-46pc26b9/.gfid/a3656075-784c-4377-a482-4aad8378acf0

getfattr: Removing leading '/' from absolute path names

# file:tmp/gsyncd-aux-mount-46pc26b9/.gfid/a3656075-784c-4377-a482-4aad8378acf0

trusted.glusterfs.mdata=0x010000000000000000000000005d1f73ff000000000da35668000000005d1f73fd0000000015811b46000000005f29050b000000001d5363a6

i can make a stat on that directory for example :

[ 10:07:19 ] - root@gl-master-02 ~/tmp $stat/tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234 File:/tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234

  Size: 4096          Blocks: 8          IO Block: 131072 directory
Device: 37h/55d    Inode: 9394601311212820456  Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: ( 0/    root)
Access: 2021-01-17 07:26:09.596743288 +0000
Modify: 2021-03-12 07:34:39.382122663 +0000
Change: 2021-03-12 07:34:39.383446790 +0000
 Birth: -

currently i have no clou how to get this fixed. obviously extendedattributes are missing for the most entries in /tmp/gsyncd-aux-mount.../while some others exist.

I believe rsync is not the reason, it's rather because of missingattributes ?!

Lastly the gfid points to a directory, when i try to get attributes forthis directroy in brick-path it succeed's :

[ 10:15:25 ] - root@gl-master-02 ~/tmp $ls -l/brick1/mvol1/.glusterfs/8d/60/8d601e5b-180c-46c8-b64f-ae6224542234lrwxrwxrwx 1 root root 56 Nov 5 17:54/brick1/mvol1/.glusterfs/8d/60/8d601e5b-180c-46c8-b64f-ae6224542234 ->../../f0/94/f094bf06-2806-4f90-9a79-489827c6cdf9/2217547

[ 10:38:01 ] - root@gl-master-02 ~ $getfattr -m . -d -e hex/brick1/mvol1/2137/files/20/11/2217547

getfattr: Removing leading '/' from absolute path names
# file: brick1/mvol1/2137/files/20/11/2217547
trusted.gfid=0x8d601e5b180c46c8b64fae6224542234
trusted.glusterfs.2f5de6e4-66de-40a7-9f24-4762aad3ca96.xtime=0x604b198f0005e528
trusted.glusterfs.dht=0x001ed359000000007a2d37c1a8b9af89
trusted.glusterfs.dht.mds=0x00000000

[ 10:38:42 ] - root@gl-master-02 ~ $getfattr -m . -d -e hex/brick1/mvol1/2137/files/20/11

getfattr: Removing leading '/' from absolute path names
# file: brick1/mvol1/2137/files/20/11
trusted.gfid=0xf094bf0628064f909a79489827c6cdf9
trusted.glusterfs.2f5de6e4-66de-40a7-9f24-4762aad3ca96.xtime=0x604b198f0005e528
trusted.glusterfs.dht=0x001ed35900000000d1738834ffffffff
trusted.glusterfs.mdata=0x010000000000000000000000005fc5378000000000077ba08a000000005fc535b60000000038d942cc000000005f9ffc610000000007b08744

[ 10:39:54 ] - root@gl-master-02  ~ $


but the geo-rep ended up in a loop, but without 'E'error :

[2021-03-12 10:46:40.572500] D [repce(worker/brick1/mvol1):215:__call__] RepceClient: call19352:140387951818496:1615546000.5609794 keep_alive -> 256[2021-03-12 10:46:41.23154] D [master(worker/brick2/mvol1):554:crawlwrap] _GMaster: ... crawl #0 done, took 5.017846seconds[2021-03-12 10:46:41.35729] D [master(worker/brick2/mvol1):578:crawlwrap] _GMaster: Crawl infocluster_stime=(1609281098, 0) brick_stime=(1609281900, 0)[2021-03-12 10:46:46.41012] D [master(worker/brick2/mvol1):554:crawlwrap] _GMaster: ... crawl #0 done, took 5.017512seconds[2021-03-12 10:46:46.53818] D [master(worker/brick2/mvol1):578:crawlwrap] _GMaster: Crawl infocluster_stime=(1609281098, 0) brick_stime=(1609281900, 0)[2021-03-12 10:46:48.269174] D [repce(worker /brick2/mvol1):195:push]RepceClient: call 19354:140476158043904:1615546008.2690222keep_alive({'version': (1, 0), 'uuid':'2f5de6e4-66de-40a7-9f24-4762aad3ca96', 'retval': 0, 'volume_mark':(1609275788, 819193), 'timeout': 1615546128},) ...



Does anyone have any idea how to solve this problem ?

best regards,

Dietar


On 04.03.21 08:48, Shwetha Acharya wrote:

Hi Dietmar,
batch-fsync-delay-usec was already set to 0 and I increased thesync_jobs from 3 to 6. In the moment I increased the sync_jobsfollowing error appeared in gsyncd.log :
    [2021-03-03 23:17:46.59727] E [syncdutils(worker
    /brick1/mvol1):312:log_raise_exception] <top>: connection to peer
    is broken
    [2021-03-03 23:17:46.59912] E [syncdutils(worker
    /brick2/mvol1):312:log_raise_exception] <top>: connection to peer
    is broken
If the geo-rep session is currently not in faulty state, we should bebothered about this log message. It is normal when the config isupdated, geo-rep restart occurs and the above message pops up.
    passive nodes became active and the content in <brick>/.processing
    was removed. currently new changelog files are created in this
    directory.shortly before I changed the sync_jobs I have checked
    the <brick>/.processing directory on the master nodes. the result
    was the same for every master node.

    since the last error about 12 hours ago nearly 2400 changelog
    files were created on each node but it looks like none of them
    were consumed.
Processed changelogs that are synced are archived under<brick>/.processed directory. Verify if the latest file is created there.
    in the moment I'm not sure what is right and what is
    wrong...should at least the oldest changelog files in this
    directory have been processed gradually ?
Also you can try to set the log-level to debug for a while and set itback to info(to avoid flooding of logs) and check the logs to get abetter picture of the scenario.#gluster volume geo-replication <primary> <ip>::<secondary> configlog-level DEBUG#gluster volume geo-replication <primary> <ip>::<secondary> configlog-level INFO
Regards,
Shwetha

________



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
[email protected]
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] no progress in geo-replication

Reply via email to