Re: [ceph-users] RGW multisite sync data sync shard stuck

Andreas Calminder Fri, 25 Aug 2017 07:33:49 -0700

Hi David,
I never solved this issue as I couldn't figure out what was wrong. I just
went ahead and removed the second site and will proceed to setup a new
multisite whenever luminous is out and hoping the weirdness has been sorted.


Sorry I didn't have any good answers :/

/andreas

On 24 Aug 2017 20:35, "David Turner" <drakonst...@gmail.com> wrote:

> Andreas, did you find a solution to your multisite sync issues with the
> stuck shards?  I'm also on 10.2.7 and having this problem.  One realm has
> stuck shards for data sync and another realm says it's up to date, but
> isn't receiving new users via metadata sync.  I ran metadata sync init on
> it and it had all up to date metadata information when it finished, but
> then new users weren't synced again.  I don't know what to do  to get these
> working stably.  There are 2 RGW's for each realm in each zone in
> master/master allowing data to sync in both directions.
>
> On Mon, Jun 5, 2017 at 3:05 AM Andreas Calminder <
> andreas.calmin...@klarna.com> wrote:
>
>> Hello,
>> I'm using Ceph jewel (10.2.7) and as far as I know I'm using the jewel
>> multisite setup (multiple zones) as described here
>> http://docs.ceph.com/docs/master/radosgw/multisite/ and two ceph
>> clusters, one in each site. Stretching clusters over multiple sites
>> are seldom/never worth the hassle in my opinion. The reason the
>> replication ended up in a bad state, seems to be a mix of multiple
>> issues, first it's that if you shove a lot of objects into a bucket
>> +1M the bucket index starts to drag the rados gateways down, there's
>> also some kind of memory leak in rgw when the sync has failed
>> http://tracker.ceph.com/issues/19446, causing the rgw daemons to die
>> left and right due to out of memory errors and some times also other
>> parts of the system would be dragged down with them.
>>
>> On 4 June 2017 at 22:22,  <ceph.nov...@habmalnefrage.de> wrote:
>> > Hi Andreas.
>> >
>> > Well, we do _NOT_ need multiside in our environment, but unfortunately
>> is is the basis for the announced "metasearch", based on ElasticSearch...
>> so we try to implement a "multisite" config on Kraken (v11.2.0) since
>> weeks, but never succeeded so far. We have purged and started all over with
>> the multiside config for about ~5x by now.
>> >
>> > We have one CEPH cluster with two RadosGW's on top (so NOT two CEPH
>> cluster!), not sure if this makes a difference!?
>> >
>> > Can you please share some infos about your (former working?!?) setup?
>> Like
>> > - which CEPH version are you on
>> > - old deprecated "federated" or "new from Jewel" multiside setup
>> > - one or multiple CEPH clusters
>> >
>> > Great to see that multisite seems to work somehow somewhere. We were
>> really in doubt :O
>> >
>> > Thanks & regards
>> >  Anton
>> >
>> > P.S.: If someone reads this, who has a working "one Kraken CEPH
>> cluster" based multisite setup (or, let me dream, even a working
>> ElasticSearch setup :| ) please step out of the dark and enlighten us :O
>> >
>> > Gesendet: Dienstag, 30. Mai 2017 um 11:02 Uhr
>> > Von: "Andreas Calminder" <andreas.calmin...@klarna.com>
>> > An: ceph-users@lists.ceph.com
>> > Betreff: [ceph-users] RGW multisite sync data sync shard stuck
>> > Hello,
>> > I've got a sync issue with my multisite setup. There's 2 zones in 1
>> > zone group in 1 realm. The data sync in the non-master zone have stuck
>> > on Incremental sync is behind by 1 shard, this wasn't noticed until
>> > the radosgw instances in the master zone started dying from out of
>> > memory issues, all radosgw instances in the non-master zone was then
>> > shutdown to ensure services in the master zone while trying to
>> > troubleshoot the issue.
>> >
>> > From the rgw logs in the master zone I see entries like:
>> >
>> > 2017-05-29 16:10:34.717988 7fbbc1ffb700 0 ERROR: failed to sync
>> > object: 12354/BUCKETNAME:be8fa19b-ad79-4cd8-ac7b-1e14fdc882f6.
>> 2374181.27/dirname_1/dirname_2/filename_1.ext
>> > 2017-05-29 16:10:34.718016 7fbbc1ffb700 0 ERROR: failed to sync
>> > object: 12354/BUCKETNAME:be8fa19b-ad79-4cd8-ac7b-1e14fdc882f6.
>> 2374181.27/dirname_1/dirname_2/filename_2.ext
>> > 2017-05-29 16:10:34.718504 7fbbc1ffb700 0 ERROR: failed to fetch
>> > remote data log info: ret=-5
>> > 2017-05-29 16:10:34.719443 7fbbc1ffb700 0 ERROR: a sync operation
>> > returned error
>> > 2017-05-29 16:10:34.720291 7fbc167f4700 0 store->fetch_remote_obj()
>> > returned r=-5
>> >
>> > sync status in the non-master zone reports that the metadata is up to
>> > sync and that the data sync is behind on 1 shard and that the oldest
>> > incremental change not applied is about 2 weeks back.
>> >
>> > I'm not quite sure how to proceed, is there a way to find out the id
>> > of the shard and force some kind of re-sync of the data in it from the
>> > master zone? I'm unable to have the non-master zone rgw's running
>> > because it'll leave the master zone in a bad state with rgw dying
>> > every now and then.
>> >
>> > Regards,
>> > Andreas
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> >
>>
>>
>>
>> --
>> Andreas Calminder
>> System Administrator
>> IT Operations Core Services
>>
>> Klarna AB (publ)
>> Sveavägen 46, 111 34 Stockholm
>> Tel: +46 8 120 120 00 <+46%208%20120%20120%2000>
>> Reg no: 556737-0431
>> klarna.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RGW multisite sync data sync shard stuck

Reply via email to