Re: [ceph-users] mark out vs crush weight 0

Robert LeBlanc Mon, 23 May 2016 08:51:07 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Check out the Weighted Priority Queue option in Jewel, this really
helped reduce the impact of recovery and backfill on client traffic
with my testing. I think it really addresses a lot of the pain points
you mention.
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.4.0
Comment: https://www.mailvelope.com


wsFcBAEBCAAQBQJXQybACRDmVDuy+mK58QAANFMP/iFHpRFx8Xiik5axDSZl
zKjUQGUetuGzh6hu/y1+RNtrbUaC+Gg6L4A3ivT5f7CUsCcnOquQ/bBxQMe5
ve5M8XrEREPlBOzcQS+IIFK66bN8OC1Q/Rf1OzCFpWJmoMumbcBxrGV5KV8l
5m/GrOjmtxJzH/olaAzktOMAm3mTpWyL7KIPjUiBXvPi4EnyifIV3Hqc55TX
6/oz7vX7U9cg+JouVvnDAkLcb5C/hxNRNCGKO7Vxk0usuvYbvsbmRbQddAFt
6z6dJ9SFiPpys50WR8vpmsabqFEwKBAZSCemv/LdeAp+moLhFAydVD46LRsP
NUNj23NuB5lDJKt444Y97/udDgnwJM4uq/8fHfTGMdptkzDsfdbOxDG4SPqd
m7/bOJJET0UByCgtNuU0dUq0Rme0iidrH/9gZt6Y2w0jY4VSvPmkuP+GSIfj
Boc2EIw39SoyaNgC/m5WvEru5trsH+vE7RcJpStLzwv+3MejQPzr9UDay/k4
7gxrNrB7YJ7YIX5i2yGYfE+tNVNUD4nGBgPCcBY7yDAzvbBKM5HzZSxWfYv6
JULq+EVc592gGjUx8BI+vJnckV3yGABCrVdUda2xxYwjMkIHbnoQtL7yi3DL
W7Y5Z5iIDGSSpDcMOIEzSCiABzuKJHQC+EPf1NHGbEtK7ZGFPqmVx98eREgO
oyjl
=U0LK
-----END PGP SIGNATURE-----


----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Thu, May 19, 2016 at 5:26 AM, Oliver Dzombic <i...@ip-interactive.de>
wrote:

> Hi,
>
> a sparedisk is a nice idea.
>
> But i think thats something you can also do with a shellscript.
>
> Checking if an osd is down or out and just using your spare disk.
>
> Maybe the programming ressources should not be used for something most
> of us can do with a simple shell script checking every 5 seconds the
> situation.
>
> ----
>
> Maybe better idea ( in my humble opinion ) is to solve this stuff by
> optimizing the code in recovery situations.
>
> Currently we have things like
>
> client-op-priority,
> recovery-op-priority,
> max-backfills,
> recovery-max-active and so on
>
> to limit the performance impact in a recovery situation.
>
> And still in a situation of recovery the performance go downhill ( a lot
> )  when all OSD's start to refill the to_be_recovered OSD.
>
> In my case, i was removing old HDD's from a cluster.
>
> If i down/out them ( 6 TB drives 40-50% full ) the cluster's performance
> will go down very dramatically. So i had to reduce the weight by 0.1
> steps to ease this pain, but could not remove it completely.
>
>
> So i think the tools / code to protect the cluster's performance ( even
> in recovery situation ) can be improved.
>
> Of course, on one hand, we want to make sure, that asap the configured
> amount of replica's and this way, datasecurity is restored.
>
> But on the other hand, it does not help too much if the recovery
> proceedure will impact the cluster's performance on a level where the
> useability is too much reduced.
>
> So maybe introcude another config option to controle this ratio ?
>
> To control more effectively how much IOPS/Bandwidth is used ( maybe
> streight in numbers in form of an IO ratelimit ) so that administrator's
> have the chance to config, according to the hardware environment, the
> "perfect" settings for their individual usecase.
>
>
> Because, right now, when i reduce the weight of a 6 TB HDD, while having
> ~ 30 OSD's in the cluster, from 1.0 to 0.9, around 3-5% of data will be
> moved around the cluster ( replication 2 ).
>
> While its moving, there is a true performance hit on the virtual servers.
>
> So if this could be solved, by a IOPS/HDD Bandwidth rate limit, that i
> can simply tell the cluster to use max. 10 IOPS and/or 10 MB/s for the
> recovery, then i think it would be a great help for any usecase and
> administrator.
>
> Thanks !
>
>
> --
> Mit freundlichen Gruessen / Best regards
>
> Oliver Dzombic
> IP-Interactive
>
> mailto:i...@ip-interactive.de
>
> Anschrift:
>
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
>
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
>
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
>
>
> Am 19.05.2016 um 04:57 schrieb Christian Balzer:
> >
> > Hello Sage,
> >
> > On Wed, 18 May 2016 17:23:00 -0400 (EDT) Sage Weil wrote:
> >
> >> Currently, after an OSD has been down for 5 minutes, we mark the OSD
> >> "out", whic redistributes the data to other OSDs in the cluster.  If the
> >> OSD comes back up, it marks the OSD back in (with the same reweight
> >> value, usually 1.0).
> >>
> >> The good thing about marking OSDs out is that exactly the amount of data
> >> on the OSD moves.  (Well, pretty close.)  It is uniformly distributed
> >> across all other devices.
> >>
> > Others have commented already on how improve your initial suggestion
> > (retaining CRUSH weights) etc.
> > Let me butt in here with an even more invasive but impact reducing
> > suggestion.
> >
> > Your "good thing" up there is good as far as total data movement goes,
> but
> > it still can unduly impact client performance when one OSD becomes both
> > the target and source of data movement at the same time during
> > backfill/recovery.
> >
> > So how about upping the ante with the (of course optional) concept of a
> > "spare OSD" per node?
> > People are already used to the concept, it also makes a full cluster
> > situation massively more unlikely.
> >
> > So expanding on the concept below, lets say we have one spare OSD per
> node
> > by default.
> > It's on a disk of the same size or larger than all the other OSDs in the
> > node, it is fully prepared but has no ID yet.
> >
> > So we're experiencing an OSD failure and it's about to be set out by the
> > MON, lets consider this sequence (OSD X is the dead, S the spare one:
> >
> > 1. Set nobackfill/norecovery
> > 2. OSD X gets weighted 0
> > 3. OSD X gets set out
> > 4. OSD S gets activated with the original weight of X and its ID.
> > 5. Unset nobackfill/norecovery
> >
> > Now data will flow only to the new OSD, other OSDs will not be subject to
> > simultaneous reads and writes by backfills.
> >
> > Of course in case there is no spare available (not replaced yet or
> > multiple OSD failures), Ceph can go ahead and do it's usual thing,
> > hopefully enhanced by the logic below.
> >
> > Alternatively, instead of just limiting the number of backfills per OSD
> > make them directionally aware, that is don't allow concurrent read and
> > write backfills on the same OSD.
> >
> > Regards,
> >
> > Christian
> >> The bad thing is that if the OSD really is dead, and you remove it from
> >> the cluster, or replace it and recreate the new OSD with a new OSD id,
> >> there is a second data migration that sucks data out of the part of the
> >> crush tree where the removed OSD was.  This move is non-optimal: if the
> >> drive is size X, some data "moves" from the dead OSD to other N OSDs on
> >> the host (X/N to each), and the same amount of data (X) moves off the
> >> host (uniformly coming from all N+1 drives it used to live on).  The
> >> same thing happens at the layer up: some data will move from the host to
> >> peer hosts in the rack, and the same amount will move out of the rack.
> >> This is a byproduct of CRUSH's hierarchical placement.
> >>
> >> If the lifecycle is to let drives fail, mark them out, and leave them
> >> there forever in the 'out' state, then the current behavior is fine,
> >> although over time you'll have lot sof dead+out osds that slow things
> >> down marginally.
> >>
> >> If the procedure is to replace dead OSDs and re-use the same OSD id,
> >> then this also works fine.  Unfortunately the tools don't make this easy
> >> (that I know of).
> >>
> >> But if the procedure is to remove dead OSDs, or to remove dead OSDs and
> >> recreate new OSDs in their place, probably with a fresh OSD id, then you
> >> get this extra movement.  In that case, I'm wondering if we should allow
> >> the mons to *instead* se the crush weight to 0 after the osd is down for
> >> too long.  For that to work we need to set a flag so that if the OSD
> >> comes back up it'll restore the old crush weight (or more likely make
> >> the normal osd startup crush location update do so with the OSDs
> >> advertised capacity).  Is it sensible?
> >>
> >> And/or, anybody have a good idea how the tools can/should be changed to
> >> make the osd replacement re-use the osd id?
> >>
> >> sage
> >>
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] mark out vs crush weight 0

Reply via email to