Hi David, I never solved this issue as I couldn't figure out what was wrong. I just went ahead and removed the second site and will proceed to setup a new multisite whenever luminous is out and hoping the weirdness has been sorted.
Sorry I didn't have any good answers :/ /andreas On 24 Aug 2017 20:35, "David Turner" <drakonst...@gmail.com> wrote: > Andreas, did you find a solution to your multisite sync issues with the > stuck shards? I'm also on 10.2.7 and having this problem. One realm has > stuck shards for data sync and another realm says it's up to date, but > isn't receiving new users via metadata sync. I ran metadata sync init on > it and it had all up to date metadata information when it finished, but > then new users weren't synced again. I don't know what to do to get these > working stably. There are 2 RGW's for each realm in each zone in > master/master allowing data to sync in both directions. > > On Mon, Jun 5, 2017 at 3:05 AM Andreas Calminder < > andreas.calmin...@klarna.com> wrote: > >> Hello, >> I'm using Ceph jewel (10.2.7) and as far as I know I'm using the jewel >> multisite setup (multiple zones) as described here >> http://docs.ceph.com/docs/master/radosgw/multisite/ and two ceph >> clusters, one in each site. Stretching clusters over multiple sites >> are seldom/never worth the hassle in my opinion. The reason the >> replication ended up in a bad state, seems to be a mix of multiple >> issues, first it's that if you shove a lot of objects into a bucket >> +1M the bucket index starts to drag the rados gateways down, there's >> also some kind of memory leak in rgw when the sync has failed >> http://tracker.ceph.com/issues/19446, causing the rgw daemons to die >> left and right due to out of memory errors and some times also other >> parts of the system would be dragged down with them. >> >> On 4 June 2017 at 22:22, <ceph.nov...@habmalnefrage.de> wrote: >> > Hi Andreas. >> > >> > Well, we do _NOT_ need multiside in our environment, but unfortunately >> is is the basis for the announced "metasearch", based on ElasticSearch... >> so we try to implement a "multisite" config on Kraken (v11.2.0) since >> weeks, but never succeeded so far. We have purged and started all over with >> the multiside config for about ~5x by now. >> > >> > We have one CEPH cluster with two RadosGW's on top (so NOT two CEPH >> cluster!), not sure if this makes a difference!? >> > >> > Can you please share some infos about your (former working?!?) setup? >> Like >> > - which CEPH version are you on >> > - old deprecated "federated" or "new from Jewel" multiside setup >> > - one or multiple CEPH clusters >> > >> > Great to see that multisite seems to work somehow somewhere. We were >> really in doubt :O >> > >> > Thanks & regards >> > Anton >> > >> > P.S.: If someone reads this, who has a working "one Kraken CEPH >> cluster" based multisite setup (or, let me dream, even a working >> ElasticSearch setup :| ) please step out of the dark and enlighten us :O >> > >> > Gesendet: Dienstag, 30. Mai 2017 um 11:02 Uhr >> > Von: "Andreas Calminder" <andreas.calmin...@klarna.com> >> > An: ceph-users@lists.ceph.com >> > Betreff: [ceph-users] RGW multisite sync data sync shard stuck >> > Hello, >> > I've got a sync issue with my multisite setup. There's 2 zones in 1 >> > zone group in 1 realm. The data sync in the non-master zone have stuck >> > on Incremental sync is behind by 1 shard, this wasn't noticed until >> > the radosgw instances in the master zone started dying from out of >> > memory issues, all radosgw instances in the non-master zone was then >> > shutdown to ensure services in the master zone while trying to >> > troubleshoot the issue. >> > >> > From the rgw logs in the master zone I see entries like: >> > >> > 2017-05-29 16:10:34.717988 7fbbc1ffb700 0 ERROR: failed to sync >> > object: 12354/BUCKETNAME:be8fa19b-ad79-4cd8-ac7b-1e14fdc882f6. >> 2374181.27/dirname_1/dirname_2/filename_1.ext >> > 2017-05-29 16:10:34.718016 7fbbc1ffb700 0 ERROR: failed to sync >> > object: 12354/BUCKETNAME:be8fa19b-ad79-4cd8-ac7b-1e14fdc882f6. >> 2374181.27/dirname_1/dirname_2/filename_2.ext >> > 2017-05-29 16:10:34.718504 7fbbc1ffb700 0 ERROR: failed to fetch >> > remote data log info: ret=-5 >> > 2017-05-29 16:10:34.719443 7fbbc1ffb700 0 ERROR: a sync operation >> > returned error >> > 2017-05-29 16:10:34.720291 7fbc167f4700 0 store->fetch_remote_obj() >> > returned r=-5 >> > >> > sync status in the non-master zone reports that the metadata is up to >> > sync and that the data sync is behind on 1 shard and that the oldest >> > incremental change not applied is about 2 weeks back. >> > >> > I'm not quite sure how to proceed, is there a way to find out the id >> > of the shard and force some kind of re-sync of the data in it from the >> > master zone? I'm unable to have the non-master zone rgw's running >> > because it'll leave the master zone in a bad state with rgw dying >> > every now and then. >> > >> > Regards, >> > Andreas >> > _______________________________________________ >> > ceph-users mailing list >> > ceph-users@lists.ceph.com >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > >> > >> >> >> >> -- >> Andreas Calminder >> System Administrator >> IT Operations Core Services >> >> Klarna AB (publ) >> Sveavägen 46, 111 34 Stockholm >> Tel: +46 8 120 120 00 <+46%208%20120%20120%2000> >> Reg no: 556737-0431 >> klarna.com >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com