-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Check out the Weighted Priority Queue option in Jewel, this really helped reduce the impact of recovery and backfill on client traffic with my testing. I think it really addresses a lot of the pain points you mention. -----BEGIN PGP SIGNATURE----- Version: Mailvelope v1.4.0 Comment: https://www.mailvelope.com
wsFcBAEBCAAQBQJXQybACRDmVDuy+mK58QAANFMP/iFHpRFx8Xiik5axDSZl zKjUQGUetuGzh6hu/y1+RNtrbUaC+Gg6L4A3ivT5f7CUsCcnOquQ/bBxQMe5 ve5M8XrEREPlBOzcQS+IIFK66bN8OC1Q/Rf1OzCFpWJmoMumbcBxrGV5KV8l 5m/GrOjmtxJzH/olaAzktOMAm3mTpWyL7KIPjUiBXvPi4EnyifIV3Hqc55TX 6/oz7vX7U9cg+JouVvnDAkLcb5C/hxNRNCGKO7Vxk0usuvYbvsbmRbQddAFt 6z6dJ9SFiPpys50WR8vpmsabqFEwKBAZSCemv/LdeAp+moLhFAydVD46LRsP NUNj23NuB5lDJKt444Y97/udDgnwJM4uq/8fHfTGMdptkzDsfdbOxDG4SPqd m7/bOJJET0UByCgtNuU0dUq0Rme0iidrH/9gZt6Y2w0jY4VSvPmkuP+GSIfj Boc2EIw39SoyaNgC/m5WvEru5trsH+vE7RcJpStLzwv+3MejQPzr9UDay/k4 7gxrNrB7YJ7YIX5i2yGYfE+tNVNUD4nGBgPCcBY7yDAzvbBKM5HzZSxWfYv6 JULq+EVc592gGjUx8BI+vJnckV3yGABCrVdUda2xxYwjMkIHbnoQtL7yi3DL W7Y5Z5iIDGSSpDcMOIEzSCiABzuKJHQC+EPf1NHGbEtK7ZGFPqmVx98eREgO oyjl =U0LK -----END PGP SIGNATURE----- ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Thu, May 19, 2016 at 5:26 AM, Oliver Dzombic <i...@ip-interactive.de> wrote: > Hi, > > a sparedisk is a nice idea. > > But i think thats something you can also do with a shellscript. > > Checking if an osd is down or out and just using your spare disk. > > Maybe the programming ressources should not be used for something most > of us can do with a simple shell script checking every 5 seconds the > situation. > > ---- > > Maybe better idea ( in my humble opinion ) is to solve this stuff by > optimizing the code in recovery situations. > > Currently we have things like > > client-op-priority, > recovery-op-priority, > max-backfills, > recovery-max-active and so on > > to limit the performance impact in a recovery situation. > > And still in a situation of recovery the performance go downhill ( a lot > ) when all OSD's start to refill the to_be_recovered OSD. > > In my case, i was removing old HDD's from a cluster. > > If i down/out them ( 6 TB drives 40-50% full ) the cluster's performance > will go down very dramatically. So i had to reduce the weight by 0.1 > steps to ease this pain, but could not remove it completely. > > > So i think the tools / code to protect the cluster's performance ( even > in recovery situation ) can be improved. > > Of course, on one hand, we want to make sure, that asap the configured > amount of replica's and this way, datasecurity is restored. > > But on the other hand, it does not help too much if the recovery > proceedure will impact the cluster's performance on a level where the > useability is too much reduced. > > So maybe introcude another config option to controle this ratio ? > > To control more effectively how much IOPS/Bandwidth is used ( maybe > streight in numbers in form of an IO ratelimit ) so that administrator's > have the chance to config, according to the hardware environment, the > "perfect" settings for their individual usecase. > > > Because, right now, when i reduce the weight of a 6 TB HDD, while having > ~ 30 OSD's in the cluster, from 1.0 to 0.9, around 3-5% of data will be > moved around the cluster ( replication 2 ). > > While its moving, there is a true performance hit on the virtual servers. > > So if this could be solved, by a IOPS/HDD Bandwidth rate limit, that i > can simply tell the cluster to use max. 10 IOPS and/or 10 MB/s for the > recovery, then i think it would be a great help for any usecase and > administrator. > > Thanks ! > > > -- > Mit freundlichen Gruessen / Best regards > > Oliver Dzombic > IP-Interactive > > mailto:i...@ip-interactive.de > > Anschrift: > > IP Interactive UG ( haftungsbeschraenkt ) > Zum Sonnenberg 1-3 > 63571 Gelnhausen > > HRB 93402 beim Amtsgericht Hanau > Geschäftsführung: Oliver Dzombic > > Steuer Nr.: 35 236 3622 1 > UST ID: DE274086107 > > > Am 19.05.2016 um 04:57 schrieb Christian Balzer: > > > > Hello Sage, > > > > On Wed, 18 May 2016 17:23:00 -0400 (EDT) Sage Weil wrote: > > > >> Currently, after an OSD has been down for 5 minutes, we mark the OSD > >> "out", whic redistributes the data to other OSDs in the cluster. If the > >> OSD comes back up, it marks the OSD back in (with the same reweight > >> value, usually 1.0). > >> > >> The good thing about marking OSDs out is that exactly the amount of data > >> on the OSD moves. (Well, pretty close.) It is uniformly distributed > >> across all other devices. > >> > > Others have commented already on how improve your initial suggestion > > (retaining CRUSH weights) etc. > > Let me butt in here with an even more invasive but impact reducing > > suggestion. > > > > Your "good thing" up there is good as far as total data movement goes, > but > > it still can unduly impact client performance when one OSD becomes both > > the target and source of data movement at the same time during > > backfill/recovery. > > > > So how about upping the ante with the (of course optional) concept of a > > "spare OSD" per node? > > People are already used to the concept, it also makes a full cluster > > situation massively more unlikely. > > > > So expanding on the concept below, lets say we have one spare OSD per > node > > by default. > > It's on a disk of the same size or larger than all the other OSDs in the > > node, it is fully prepared but has no ID yet. > > > > So we're experiencing an OSD failure and it's about to be set out by the > > MON, lets consider this sequence (OSD X is the dead, S the spare one: > > > > 1. Set nobackfill/norecovery > > 2. OSD X gets weighted 0 > > 3. OSD X gets set out > > 4. OSD S gets activated with the original weight of X and its ID. > > 5. Unset nobackfill/norecovery > > > > Now data will flow only to the new OSD, other OSDs will not be subject to > > simultaneous reads and writes by backfills. > > > > Of course in case there is no spare available (not replaced yet or > > multiple OSD failures), Ceph can go ahead and do it's usual thing, > > hopefully enhanced by the logic below. > > > > Alternatively, instead of just limiting the number of backfills per OSD > > make them directionally aware, that is don't allow concurrent read and > > write backfills on the same OSD. > > > > Regards, > > > > Christian > >> The bad thing is that if the OSD really is dead, and you remove it from > >> the cluster, or replace it and recreate the new OSD with a new OSD id, > >> there is a second data migration that sucks data out of the part of the > >> crush tree where the removed OSD was. This move is non-optimal: if the > >> drive is size X, some data "moves" from the dead OSD to other N OSDs on > >> the host (X/N to each), and the same amount of data (X) moves off the > >> host (uniformly coming from all N+1 drives it used to live on). The > >> same thing happens at the layer up: some data will move from the host to > >> peer hosts in the rack, and the same amount will move out of the rack. > >> This is a byproduct of CRUSH's hierarchical placement. > >> > >> If the lifecycle is to let drives fail, mark them out, and leave them > >> there forever in the 'out' state, then the current behavior is fine, > >> although over time you'll have lot sof dead+out osds that slow things > >> down marginally. > >> > >> If the procedure is to replace dead OSDs and re-use the same OSD id, > >> then this also works fine. Unfortunately the tools don't make this easy > >> (that I know of). > >> > >> But if the procedure is to remove dead OSDs, or to remove dead OSDs and > >> recreate new OSDs in their place, probably with a fresh OSD id, then you > >> get this extra movement. In that case, I'm wondering if we should allow > >> the mons to *instead* se the crush weight to 0 after the osd is down for > >> too long. For that to work we need to set a flag so that if the OSD > >> comes back up it'll restore the old crush weight (or more likely make > >> the normal osd startup crush location update do so with the OSDs > >> advertised capacity). Is it sensible? > >> > >> And/or, anybody have a good idea how the tools can/should be changed to > >> make the osd replacement re-use the osd id? > >> > >> sage > >> > >> > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users@lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com