After doing this, I've found that I'm having problems with a few specific PGs. If I set nodeep-scrub, then manually deep-scrub one specific PG, the responsible OSDs get kicked out. I'm starting a new discussion, subject: "I have PGs that I can't deep-scrub"
I'll re-test this correlation after I fix the broken PGs. On Mon, Jun 9, 2014 at 10:20 PM, Gregory Farnum <g...@inktank.com> wrote: > On Mon, Jun 9, 2014 at 6:42 PM, Mike Dawson <mike.daw...@cloudapt.com> wrote: >> Craig, >> >> I've struggled with the same issue for quite a while. If your i/o is similar >> to mine, I believe you are on the right track. For the past month or so, I >> have been running this cronjob: >> >> * * * * * for strPg in `ceph pg dump | egrep '^[0-9]\.[0-9a-f]{1,4}' | >> sort -k20 | awk '{ print $1 }' | head -2`; do ceph pg deep-scrub $strPg; >> done >> >> That roughly handles my 20672 PGs that are set to be deep-scrubbed every 7 >> days. Your script may be a bit better, but this quick and dirty method has >> helped my cluster maintain more consistency. >> >> The real key for me is to avoid the "clumpiness" I have observed without >> that hack where concurrent deep-scrubs sit at zero for a long period of time >> (despite having PGs that were months overdue for a deep-scrub), then >> concurrent deep-scrubs suddenly spike up and stay in the teens for hours, >> killing client writes/second. >> >> The scrubbing behavior table[0] indicates that a periodic tick initiates >> scrubs on a per-PG basis. Perhaps the timing of ticks aren't sufficiently >> randomized when you restart lots of OSDs concurrently (for instance via >> pdsh). >> >> On my cluster I suffer a significant drag on client writes/second when I >> exceed perhaps four or five concurrent PGs in deep-scrub. When concurrent >> deep-scrubs get into the teens, I get a massive drop in client >> writes/second. >> >> Greg, is there locking involved when a PG enters deep-scrub? If so, is the >> entire PG locked for the duration or is each individual object inside the PG >> locked as it is processed? Some of my PGs will be in deep-scrub for minutes >> at a time. > > It locks very small regions of the key space, but the expensive part > is that deep scrub actually has to read all the data off disk, and > that's often a lot more disk seeks than simply examining the metadata > is. > -Greg > >> >> 0: http://ceph.com/docs/master/dev/osd_internals/scrub/ >> >> Thanks, >> Mike Dawson >> >> >> >> On 6/9/2014 6:22 PM, Craig Lewis wrote: >>> >>> I've correlated a large deep scrubbing operation to cluster stability >>> problems. >>> >>> My primary cluster does a small amount of deep scrubs all the time, >>> spread out over the whole week. It has no stability problems. >>> >>> My secondary cluster doesn't spread them out. It saves them up, and >>> tries to do all of the deep scrubs over the weekend. The secondary >>> starts loosing OSDs about an hour after these deep scrubs start. >>> >>> To avoid this, I'm thinking of writing a script that continuously scrubs >>> the oldest outstanding PG. In psuedo-bash: >>> # Sort by the deep-scrub timestamp, taking the single oldest PG >>> while ceph pg dump | awk '$1 ~ /[0-9a-f]+\.[0-9a-f]+/ {print $20, $21, >>> $1}' | sort | head -1 | read date time pg >>> do >>> ceph pg deep-scrub ${pg} >>> while ceph status | grep scrubbing+deep >>> do >>> sleep 5 >>> done >>> sleep 30 >>> done >>> >>> >>> Does anybody think this will solve my problem? >>> >>> I'm also considering disabling deep-scrubbing until the secondary >>> finishes replicating from the primary. Once it's caught up, the write >>> load should drop enough that opportunistic deep scrubs should have a >>> chance to run. It should only take another week or two to catch up. >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com