[Gluster-users] Geo-replication completely broken

Rob.Quagliozzi Thu, 25 Jun 2020 04:17:24 -0700

Hi All,

We've got two six node RHEL 7.8 clusters and geo-replication would appear to be 
completely broken between them. I've deleted the session, removed & recreated 
pem files, old changlogs/htime (after removing relevant options from volume) 
and completely set up geo-rep from scratch, but the new session comes up as 
Initializing, then goes faulty, and starts looping. Volume (on both sides) is a 
4 x 2 disperse, running Gluster v6 (RH latest).  Gsyncd reports:


[2020-06-25 07:07:14.701423] I [gsyncdstatus(monitor):248:set_worker_status] 
GeorepStatus: Worker Status Change status=Initializing...
[2020-06-25 07:07:14.701744] I [monitor(monitor):159:monitor] Monitor: starting 
gsyncd worker   brick=/rhgs/brick20/brick       
slave_node=bxts470194.eu.rabonet.com
[2020-06-25 07:07:14.707997] D [monitor(monitor):230:monitor] Monitor: Worker 
would mount volume privately
[2020-06-25 07:07:14.757181] I [gsyncd(agent /rhgs/brick20/brick):318:main] 
<top>: Using session config file    
path=/var/lib/glusterd/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/gsyncd.conf
[2020-06-25 07:07:14.758126] D [subcmds(agent 
/rhgs/brick20/brick):107:subcmd_agent] <top>: RPC FD      rpc_fd='5,12,11,10'
[2020-06-25 07:07:14.758627] I [changelogagent(agent 
/rhgs/brick20/brick):72:__init__] ChangelogAgent: Agent listining...
[2020-06-25 07:07:14.764234] I [gsyncd(worker /rhgs/brick20/brick):318:main] 
<top>: Using session config file   
path=/var/lib/glusterd/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/gsyncd.conf
[2020-06-25 07:07:14.779409] I [resource(worker 
/rhgs/brick20/brick):1386:connect_remote] SSH: Initializing SSH connection 
between master and slave...
[2020-06-25 07:07:14.841793] D [repce(worker /rhgs/brick20/brick):195:push] 
RepceClient: call 6799:140380783982400:1593068834.84 __repce_version__() ...
[2020-06-25 07:07:16.148725] D [repce(worker /rhgs/brick20/brick):215:__call__] 
RepceClient: call 6799:140380783982400:1593068834.84 __repce_version__ -> 1.0
[2020-06-25 07:07:16.148911] D [repce(worker /rhgs/brick20/brick):195:push] 
RepceClient: call 6799:140380783982400:1593068836.15 version() ...
[2020-06-25 07:07:16.149574] D [repce(worker /rhgs/brick20/brick):215:__call__] 
RepceClient: call 6799:140380783982400:1593068836.15 version -> 1.0
[2020-06-25 07:07:16.149735] D [repce(worker /rhgs/brick20/brick):195:push] 
RepceClient: call 6799:140380783982400:1593068836.15 pid() ...
[2020-06-25 07:07:16.150588] D [repce(worker /rhgs/brick20/brick):215:__call__] 
RepceClient: call 6799:140380783982400:1593068836.15 pid -> 30703
[2020-06-25 07:07:16.150747] I [resource(worker 
/rhgs/brick20/brick):1435:connect_remote] SSH: SSH connection between master 
and slave established.     duration=1.3712
[2020-06-25 07:07:16.150819] I [resource(worker 
/rhgs/brick20/brick):1105:connect] GLUSTER: Mounting gluster volume locally...
[2020-06-25 07:07:16.265860] D [resource(worker 
/rhgs/brick20/brick):879:inhibit] DirectMounter: auxiliary glusterfs mount in 
place
[2020-06-25 07:07:17.272511] D [resource(worker 
/rhgs/brick20/brick):953:inhibit] DirectMounter: auxiliary glusterfs mount 
prepared
[2020-06-25 07:07:17.272708] I [resource(worker 
/rhgs/brick20/brick):1128:connect] GLUSTER: Mounted gluster volume      
duration=1.1218
[2020-06-25 07:07:17.272794] I [subcmds(worker 
/rhgs/brick20/brick):84:subcmd_worker] <top>: Worker spawn successful. 
Acknowledging back to monitor
[2020-06-25 07:07:17.272973] D [master(worker 
/rhgs/brick20/brick):104:gmaster_builder] <top>: setting up change detection 
mode mode=xsync
[2020-06-25 07:07:17.273063] D [monitor(monitor):273:monitor] Monitor: 
worker(/rhgs/brick20/brick) connected
[2020-06-25 07:07:17.273678] D [master(worker 
/rhgs/brick20/brick):104:gmaster_builder] <top>: setting up change detection 
mode mode=changelog
[2020-06-25 07:07:17.274224] D [master(worker 
/rhgs/brick20/brick):104:gmaster_builder] <top>: setting up change detection 
mode mode=changeloghistory
[2020-06-25 07:07:17.276484] D [repce(worker /rhgs/brick20/brick):195:push] 
RepceClient: call 6799:140380783982400:1593068837.28 version() ...
[2020-06-25 07:07:17.276916] D [repce(worker /rhgs/brick20/brick):215:__call__] 
RepceClient: call 6799:140380783982400:1593068837.28 version -> 1.0
[2020-06-25 07:07:17.277009] D [master(worker 
/rhgs/brick20/brick):777:setup_working_dir] _GMaster: changelog working dir 
/var/lib/misc/gluster/gsyncd/prd_mx_intvol_bxts470190_prd_mx_intvol/rhgs-brick20-brick
[2020-06-25 07:07:17.277098] D [repce(worker /rhgs/brick20/brick):195:push] 
RepceClient: call 6799:140380783982400:1593068837.28 init() ...
[2020-06-25 07:07:17.292944] D [repce(worker /rhgs/brick20/brick):215:__call__] 
RepceClient: call 6799:140380783982400:1593068837.28 init -> None
[2020-06-25 07:07:17.293097] D [repce(worker /rhgs/brick20/brick):195:push] 
RepceClient: call 6799:140380783982400:1593068837.29 
register('/rhgs/brick20/brick', 
'/var/lib/misc/gluster/gsyncd/prd_mx_intvol_bxts470190_prd_mx_intvol/rhgs-brick20-brick',
 
'/var/log/glusterfs/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/changes-rhgs-brick20-brick.log',
 8, 5) ...
[2020-06-25 07:07:19.296294] E [repce(agent /rhgs/brick20/brick):121:worker] 
<top>: call failed:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 117, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line 40, 
in register
    return Changes.cl_register(cl_brick, cl_dir, cl_log, cl_level, retries)
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 46, 
in cl_register
    cls.raise_changelog_err()
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 30, 
in raise_changelog_err
    raise ChangelogException(errn, os.strerror(errn))
ChangelogException: [Errno 2] No such file or directory
[2020-06-25 07:07:19.297161] E [repce(worker /rhgs/brick20/brick):213:__call__] 
RepceClient: call failed        call=6799:140380783982400:1593068837.29 
method=register error=ChangelogException
[2020-06-25 07:07:19.297338] E [resource(worker 
/rhgs/brick20/brick):1286:service_loop] GLUSTER: Changelog register failed      
error=[Errno 2] No such file or directory
[2020-06-25 07:07:19.315074] I [repce(agent 
/rhgs/brick20/brick):96:service_loop] RepceServer: terminating on reaching EOF.
[2020-06-25 07:07:20.275701] I [monitor(monitor):280:monitor] Monitor: worker 
died in startup phase     brick=/rhgs/brick20/brick
[2020-06-25 07:07:20.277383] I [gsyncdstatus(monitor):248:set_worker_status] 
GeorepStatus: Worker Status Change status=Faulty

We've done everything we can think of, including an "strace -f" on the pid, and 
we can't really find anything. I'm about to lose the last of my hair over this, 
so does anyone have any ideas at all? We've even removed the entire slave vol 
and rebuilt it.

Thanks
Rob

Rob Quagliozzi
Specialised Application Support




________________________________
This email (including any attachments to it) is confidential, legally 
privileged, subject to copyright and is sent for the personal attention of the 
intended recipient only. If you have received this email in error, please 
advise us immediately and delete it. You are notified that disclosing, copying, 
distributing or taking any action in reliance on the contents of this 
information is strictly prohibited. Although we have taken reasonable 
precautions to ensure no viruses are present in this email, we cannot accept 
responsibility for any loss or damage arising from the viruses in this email or 
attachments. We exclude any liability for the content of this email, or for the 
consequences of any actions taken on the basis of the information provided in 
this email or its attachments, unless that information is subsequently 
confirmed in writing. <#rbnl#1898i>
________________________________

________



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
[email protected]
https://lists.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] Geo-replication completely broken

Reply via email to