For those following along at home, already done:
http://tracker.ceph.com/issues/22440
<http://tracker.ceph.com/issues/22440#change-103779>


On Fri, Dec 15, 2017 at 1:57 AM Brad Hubbard <bhubb...@redhat.com> wrote:

> On Wed, Dec 13, 2017 at 11:39 PM, Nick Fisk <n...@fisk.me.uk> wrote:
> > Boom!! Fixed it. Not sure if the behavior I stumbled from is correct, but
> > this has a potential to break a few things for people moving from Jewel
> to
> > Luminous if they potentially had a few too many PG’s.
> >
> >
> >
> > Firstly, how I stumbled across it. I whacked the logging up to max on
> OSD 68
> > and saw this mentioned in the logs
> >
> >
> >
> > osd.68 106454 maybe_wait_for_max_pg withhold creation of pg 0.1cf: 403 >=
> > 400
> >
> >
> >
> > This made me search through the code for this warning string
> >
> >
> >
> > https://github.com/ceph/ceph/blob/master/src/osd/OSD.cc#L4221
> >
> >
> >
> > Which jogged my memory about the changes in Luminous regarding max PG’s
> > warning, and in particular these two config options
> >
> > mon_max_pg_per_osd
> >
> > osd_max_pg_per_osd_hard_ratio
> >
> >
> >
> > In my cluster I have just over 200 PG’s per OSD, but the node with OSD.68
> > in, has 8TB disks instead of 3TB for the rest of the cluster. This means
> > these OSD’s were taking a lot more PG’s than the average would suggest.
> So
> > in Luminous 200x2 gives a hard limit of 400, which is what that error
> > message in the log suggests is the limit. I set the
> > osd_max_pg_per_osd_hard_ratio  option to 3 and restarted the OSD and hey
> > presto everything fell into line.
> >
> >
> >
> > Now a question. I get the idea around these settings to stop making too
> many
> > or pools with too many PG’s. But is it correct they can break an existing
> > pool which is maybe making the new PG on an OSD due to CRUSH layout being
> > modified?
>
> It would be good to capture this in a tracker Nick so it can be
> explored in  more depth.
>
> >
> >
> >
> > Nick
> >
> >
> >
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> > Nick Fisk
> > Sent: 13 December 2017 11:14
> > To: 'Gregory Farnum' <gfar...@redhat.com>
> > Cc: 'ceph-users' <ceph-users@lists.ceph.com>
> > Subject: Re: [ceph-users] Odd object blocking IO on PG
> >
> >
> >
> >
> >
> > On Tue, Dec 12, 2017 at 12:33 PM Nick Fisk <n...@fisk.me.uk> wrote:
> >
> >
> >> That doesn't look like an RBD object -- any idea who is
> >> "client.34720596.1:212637720"?
> >
> > So I think these might be proxy ops from the cache tier, as there are
> also
> > block ops on one of the cache tier OSD's, but this time it actually lists
> > the object name. Block op on cache tier.
> >
> >            "description": "osd_op(client.34720596.1:212637720 17.ae78c1cf
> > 17:f3831e75:::rbd_data.15a5e20238e1f29.00000000000388ad:head
> [set-alloc-hint
> > object_size 4194304 write_size 4194304,write 2584576~16384] snapc 0=[]
> > RETRY=2 ondisk+retry+write+known_if_redirected e104841)",
> >             "initiated_at": "2017-12-12 16:25:32.435718",
> >             "age": 13996.681147,
> >             "duration": 13996.681203,
> >             "type_data": {
> >                 "flag_point": "reached pg",
> >                 "client_info": {
> >                     "client": "client.34720596",
> >                     "client_addr": "10.3.31.41:0/2600619462",
> >                     "tid": 212637720
> >
> > I'm a bit baffled at the moment what's going. The pg query (attached) is
> not
> > showing in the main status that it has been blocked from peering or that
> > there are any missing objects. I've tried restarting all OSD's I can see
> > relating to the PG in case they needed a bit of a nudge.
> >
> >
> >
> > Did that fix anything? I don't see anything immediately obvious but I'm
> not
> > practiced in quickly reading that pg state output.
> >
> >
> >
> > What's the output of "ceph -s"?
> >
> >
> >
> > Hi Greg,
> >
> >
> >
> > No restarting OSD’s didn’t seem to help. But I did make some progress
> late
> > last night. By stopping OSD.68 the cluster unlocks itself and IO can
> > progress. However as soon as it starts back up, 0.1cf and a couple of
> other
> > PG’s again get stuck in an activating state. If I out the OSD, either
> with
> > it up or down, then some other PG’s seem to get hit by the same problem
> as
> > CRUSH moves PG mappings around to other OSD’s.
> >
> >
> >
> > So there definitely seems to be some sort of weird peering issue
> somewhere.
> > I have seen a very similar issue before on this cluster where after
> running
> > the crush reweight script to balance OSD utilization, the weight got set
> too
> > low and PG’s were unable to peer. I’m not convinced this is what’s
> happening
> > here as all the weights haven’t changed, but I’m intending to explore
> this
> > further just in case.
> >
> >
> >
> > With 68 down
> >
> >     pgs:     1071783/48650631 objects degraded (2.203%)
> >
> >              5923 active+clean
> >
> >              399  active+undersized+degraded
> >
> >              7    active+clean+scrubbing+deep
> >
> >              7    active+clean+remapped
> >
> >
> >
> > With it up
> >
> >     pgs:     0.047% pgs not active
> >
> >              67271/48651279 objects degraded (0.138%)
> >
> >              15602/48651279 objects misplaced (0.032%)
> >
> >              6051 active+clean
> >
> >              273  active+recovery_wait+degraded
> >
> >              4    active+clean+scrubbing+deep
> >
> >              4    active+remapped+backfill_wait
> >
> >             3    activating+remapped
> >
> > active+recovering+degraded
> >
> >
> >
> > PG Dump
> >
> > ceph pg dump | grep activatin
> >
> > dumped all
> >
> > 2.389         0                  0        0         0       0           0
> > 1500     1500           activating+remapped 2017-12-13 11:08:50.990526
> > 76271'34230    106239:160310 [68,60,58,59,29,23]         68
> > [62,60,58,59,29,23]             62      76271'34230 2017-12-13
> > 09:00:08.359690      76271'34230 2017-12-10 10:05:10.931366
> >
> > 0.1cf      3947                  0        0         0       0 16472186880
> > 1577     1577           activating+remapped 2017-12-13 11:08:50.641034
> > 106236'7512915   106239:6176548           [34,68,8]         34
> > [34,8,53]             34   106138'7512682 2017-12-13 10:27:37.400613
> > 106138'7512682 2017-12-13 10:27:37.400613
> >
> > 2.210         0                  0        0         0       0           0
> > 1500     1500           activating+remapped 2017-12-13 11:08:50.686193
> > 76271'33304     106239:96797 [68,67,34,36,16,15]         68
> > [62,67,34,36,16,15]             62      76271'33304 2017-12-12
> > 00:49:21.038437      76271'33304 2017-12-10 16:05:12.751425
> >
> >
> >
> >
> >
> >
> >>
> >> On Tue, Dec 12, 2017 at 12:36 PM, Nick Fisk <n...@fisk.me.uk> wrote:
> >> > Does anyone know what this object (0.ae78c1cf) might be, it's not your
> >> > normal run of the mill RBD object and I can't seem to find it in the
> >> > pool using rados --all ls . It seems to be leaving the 0.1cf PG stuck
> >> > in an
> >> > activating+remapped state and blocking IO. Pool 0 is just a pure RBD
> >> > activating+pool
> >> > with a cache tier above it. There is no current mention of unfound
> >> > objects or any other obvious issues.
> >> >
> >> > There is some backfilling going on, on another OSD which was upgraded
> >> > to bluestore, which was when the issue started. But I can't see any
> >> > link in the PG dump with upgraded OSD. My only thought so far is to
> >> > wait for this backfilling to finish and then deep-scrub this PG and
> >> > see if that reveals anything?
> >> >
> >> > Thanks,
> >> > Nick
> >> >
> >> >  "description": "osd_op(client.34720596.1:212637720 0.1cf 0.ae78c1cf
> >> > (undecoded)
> >> > ondisk+retry+write+ignore_cache+ignore_overlay+known_if_redirected
> >> > e105014)",
> >> >             "initiated_at": "2017-12-12 17:10:50.030660",
> >> >             "age": 335.948290,
> >> >             "duration": 335.948383,
> >> >             "type_data": {
> >> >                 "flag_point": "delayed",
> >> >                 "events": [
> >> >                     {
> >> >                         "time": "2017-12-12 17:10:50.030660",
> >> >                         "event": "initiated"
> >> >                     },
> >> >                     {
> >> >                         "time": "2017-12-12 17:10:50.030692",
> >> >                         "event": "queued_for_pg"
> >> >                     },
> >> >                     {
> >> >                         "time": "2017-12-12 17:10:50.030719",
> >> >                         "event": "reached_pg"
> >> >                     },
> >> >                     {
> >> >                         "time": "2017-12-12 17:10:50.030727",
> >> >                         "event": "waiting for peered"
> >> >                     },
> >> >                     {
> >> >                         "time": "2017-12-12 17:10:50.197353",
> >> >                         "event": "reached_pg"
> >> >                     },
> >> >                     {
> >> >                         "time": "2017-12-12 17:10:50.197355",
> >> >                         "event": "waiting for peered"
> >> >
> >> > _______________________________________________
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >>
> >> --
> >> Jason
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
>
> --
> Cheers,
> Brad
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to