What's the output of "ceph -s" while this is happening?

Is there some identifiable difference between these two states, like you
get a lot of throughput on the data pools but then metadata recovery is
slower?

Are you sure the recovery is actually going slower, or are the individual
ops larger or more expensive?

My WAG is that recovering the metadata pool, composed mostly of directories
stored in omap objects, is going much slower for some reason. You can
adjust the cost of those individual ops some by
changing osd_recovery_max_omap_entries_per_chunk (default: 8096), but I'm
not sure which way you want to go or indeed if this has anything to do with
the problem you're seeing. (eg, it could be that reading out the omaps is
expensive, so you can get higher recovery op numbers by turning down the
number of entries per request, but not actually see faster backfilling
because you have to issue more requests.)
-Greg

On Wed, Feb 21, 2018 at 2:57 PM Reed Dier <reed.d...@focusvq.com> wrote:

> Hi all,
>
> I am running into an odd situation that I cannot easily explain.
> I am currently in the midst of destroy and rebuild of OSDs from filestore
> to bluestore.
> With my HDDs, I am seeing expected behavior, but with my SSDs I am seeing
> unexpected behavior. The HDDs and SSDs are set in crush accordingly.
>
> My path to replacing the OSDs is to set the noout, norecover, norebalance
> flag, destroy the OSD, create the OSD back, (iterate n times, all within a
> single failure domain), unset the flags, and let it go. It finishes, rinse,
> repeat.
>
> For the SSD OSDs, they are SATA SSDs (Samsung SM863a) , 10 to a node, with
> 2 NVMe drives (Intel P3700), 5 SATA SSDs to 1 NVMe drive, 16G partitions
> for block.db (previously filestore journals).
> 2x10GbE networking between the nodes. SATA backplane caps out at around 10
> Gb/s as its 2x 6 Gb/s controllers. Luminous 12.2.2.
>
> When the flags are unset, recovery starts and I see a very large rush of
> traffic, however, after the first machine completed, the performance
> tapered off at a rapid pace and trickles. Comparatively, I’m getting
> 100-200 recovery ops on 3 HDDs, backfilling from 21 other HDDs, where as
> I’m getting 150-250 recovery ops on 5 SSDs, backfilling from 40 other SSDs.
> Every once in a while I will see a spike up to 500, 1000, or even 2000 ops
> on the SSDs, often a few hundred recovery ops from one OSD, and 8-15 ops
> from the others that are backfilling.
>
> This is a far cry from the more than 15-30k recovery ops that it started
> off recovering with 1-3k recovery ops from a single OSD to the backfilling
> OSD(s). And an even farther cry from the >15k recovery ops I was sustaining
> for over an hour or more before. I was able to rebuild a 1.9T SSD (1.1T
> used) in a little under an hour, and I could do about 5 at a time and still
> keep it at roughly an hour to backfill all of them, but then I hit a
> roadblock after the first machine, when I tried to do 10 at a time (single
> machine). I am now still experiencing the same thing on the third node,
> while doing 5 OSDs at a time.
>
> The pools associated with these SSDs are cephfs-metadata, as well as a
> pure rados object pool we use for our own internal applications. Both are
> size=3, min_size=2.
>
> It appears I am not the first to run into this, but it looks like there
> was no resolution: https://www.spinics.net/lists/ceph-users/msg41493.html
>
> Recovery parameters for the OSDs match what was in the previous thread,
> sans the osd conf block listed. And current osd_max_backfills = 30 and
> osd_recovery_max_active = 35. Very little activity on the OSDs during this
> period, so should not be any contention for iops on the SSDs.
>
> The only oddity that I can attribute to things is that we had a few
> periods of time where the disk load on one of the mons was high enough to
> cause the mon to drop out of quorum for a brief amount of time, a few
> times. But I wouldn’t think backfills would just get throttled due to mons
> flapping.
>
> Hopefully someone has some experience or can steer me in a path to improve
> the performance of the backfills so that I’m not stuck in backfill
> purgatory longer than I need to be.
>
> Linking an imgur album with some screen grabs of the recovery ops over
> time for the first machine, versus the second and third machines to
> demonstrate the delta between them.
> https://imgur.com/a/OJw4b
>
> Also including a ceph osd df of the SSDs, highlighted in red are the OSDs
> currently backfilling. Could this possibly be PG overdose? I don’t ever run
> into ‘stuck activating’ PGs, its just painfully slow backfills, like they
> are being throttled by ceph, that are causing me to worry. Drives aren’t
> worn, <30 P/E cycles on the drives, so plenty of life left in them.
>
> Thanks,
> Reed
>
> $ ceph osd df
> ID CLASS WEIGHT  REWEIGHT SIZE  USE   AVAIL %USE  VAR  PGS
> 24   ssd 1.76109  1.00000 1803G 1094G  708G 60.69 1.08 260
> 25   ssd 1.76109  1.00000 1803G 1136G  667G 63.01 1.12 271
> 26   ssd 1.76109  1.00000 1803G 1018G  785G 56.46 1.01 243
> 27   ssd 1.76109  1.00000 1803G 1065G  737G 59.10 1.05 253
> 28   ssd 1.76109  1.00000 1803G 1026G  776G 56.94 1.02 245
> 29   ssd 1.76109  1.00000 1803G 1132G  671G 62.79 1.12 270
> 30   ssd 1.76109  1.00000 1803G  944G  859G 52.35 0.93 224
> 31   ssd 1.76109  1.00000 1803G 1061G  742G 58.85 1.05 252
> 32   ssd 1.76109  1.00000 1803G 1003G  799G 55.67 0.99 239
> 33   ssd 1.76109  1.00000 1803G 1049G  753G 58.20 1.04 250
> 34   ssd 1.76109  1.00000 1803G 1086G  717G 60.23 1.07 257
> 35   ssd 1.76109  1.00000 1803G  978G  824G 54.26 0.97 232
> 36   ssd 1.76109  1.00000 1803G 1057G  745G 58.64 1.05 252
> 37   ssd 1.76109  1.00000 1803G 1025G  777G 56.88 1.01 244
> 38   ssd 1.76109  1.00000 1803G 1047G  756G 58.06 1.04 250
> 39   ssd 1.76109  1.00000 1803G 1031G  771G 57.20 1.02 246
> 40   ssd 1.76109  1.00000 1803G 1029G  774G 57.07 1.02 245
> 41   ssd 1.76109  1.00000 1803G 1033G  770G 57.28 1.02 245
> 42   ssd 1.76109  1.00000 1803G  993G  809G 55.10 0.98 236
> 43   ssd 1.76109  1.00000 1803G 1072G  731G 59.45 1.06 256
> 44   ssd 1.76109  1.00000 1803G 1039G  763G 57.64 1.03 248
> 45   ssd 1.76109  1.00000 1803G  992G  810G 55.06 0.98 236
> 46   ssd 1.76109  1.00000 1803G 1068G  735G 59.23 1.06 254
> 47   ssd 1.76109  1.00000 1803G 1020G  783G 56.57 1.01 242
> 48   ssd 1.76109  1.00000 1803G  945G  857G 52.44 0.94 225
> 49   ssd 1.76109  1.00000 1803G  649G 1154G 36.01 0.64 139
> 50   ssd 1.76109  1.00000 1803G  426G 1377G 23.64 0.42  83
> 51   ssd 1.76109  1.00000 1803G  610G 1193G 33.84 0.60 131
> 52   ssd 1.76109  1.00000 1803G  558G 1244G 30.98 0.55 118
> 53   ssd 1.76109  1.00000 1803G  731G 1072G 40.54 0.72 161
> 54   ssd 1.74599  1.00000 1787G  859G  928G 48.06 0.86 229
> 55   ssd 1.74599  1.00000 1787G  942G  844G 52.74 0.94 252
> 56   ssd 1.74599  1.00000 1787G  928G  859G 51.94 0.93 246
> 57   ssd 1.74599  1.00000 1787G 1039G  748G 58.15 1.04 277
> 58   ssd 1.74599  1.00000 1787G  963G  824G 53.87 0.96 255
> 59   ssd 1.74599  1.00000 1787G  909G  877G 50.89 0.91 241
> 60   ssd 1.74599  1.00000 1787G 1039G  748G 58.15 1.04 277
> 61   ssd 1.74599  1.00000 1787G  892G  895G 49.91 0.89 238
> 62   ssd 1.74599  1.00000 1787G  927G  859G 51.90 0.93 245
> 63   ssd 1.74599  1.00000 1787G  864G  922G 48.39 0.86 229
> 64   ssd 1.74599  1.00000 1787G  968G  819G 54.16 0.97 257
> 65   ssd 1.74599  1.00000 1787G  892G  894G 49.93 0.89 237
> 66   ssd 1.74599  1.00000 1787G  951G  836G 53.23 0.95 252
> 67   ssd 1.74599  1.00000 1787G  878G  908G 49.16 0.88 232
> 68   ssd 1.74599  1.00000 1787G  899G  888G 50.29 0.90 238
> 69   ssd 1.74599  1.00000 1787G  948G  839G 53.04 0.95 252
> 70   ssd 1.74599  1.00000 1787G  914G  873G 51.15 0.91 246
> 71   ssd 1.74599  1.00000 1787G 1004G  782G 56.21 1.00 266
> 72   ssd 1.74599  1.00000 1787G  812G  974G 45.47 0.81 216
> 73   ssd 1.74599  1.00000 1787G  932G  855G 52.15 0.93 247
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to