Re: [ceph-users] SSD Bluestore Backfills Slow

Reed Dier Fri, 23 Feb 2018 07:47:08 -0800

Probably unrelated, but I do keep seeing this odd negative objects degraded 
message on the fs-metadata pool:


> pool fs-metadata-ssd id 16
>   -34/3 objects degraded (-1133.333%)
>   recovery io 0 B/s, 89 keys/s, 2 objects/s
>   client io 51289 B/s rd, 101 kB/s wr, 0 op/s rd, 0 op/s wr

Don’t mean to clutter the ML/thread, however it did seem odd, maybe its a 
culprit? Maybe its some weird sampling interval issue thats been solved in 
12.2.3?

Thanks,

Reed


> On Feb 23, 2018, at 8:26 AM, Reed Dier <[email protected]> wrote:
> 
> Below is ceph -s
> 
>>   cluster:
>>     id:     {id}
>>     health: HEALTH_WARN
>>             noout flag(s) set
>>             260610/1068004947 objects misplaced (0.024%)
>>             Degraded data redundancy: 23157232/1068004947 objects degraded 
>> (2.168%), 332 pgs unclean, 328 pgs degraded, 328 pgs undersized
>> 
>>   services:
>>     mon: 3 daemons, quorum mon02,mon01,mon03
>>     mgr: mon03(active), standbys: mon02
>>     mds: cephfs-1/1/1 up  {0=mon03=up:active}, 1 up:standby
>>     osd: 74 osds: 74 up, 74 in; 332 remapped pgs
>>          flags noout
>> 
>>   data:
>>     pools:   5 pools, 5316 pgs
>>     objects: 339M objects, 46627 GB
>>     usage:   154 TB used, 108 TB / 262 TB avail
>>     pgs:     23157232/1068004947 objects degraded (2.168%)
>>              260610/1068004947 objects misplaced (0.024%)
>>              4984 active+clean
>>              183  active+undersized+degraded+remapped+backfilling
>>              145  active+undersized+degraded+remapped+backfill_wait
>>              3    active+remapped+backfill_wait
>>              1    active+remapped+backfilling
>> 
>>   io:
>>     client:   8428 kB/s rd, 47905 B/s wr, 130 op/s rd, 0 op/s wr
>>     recovery: 37057 kB/s, 50 keys/s, 217 objects/s
> 
> Also the two pools on the SSDs, are the objects pool at 4096 PG, and the 
> fs-metadata pool at 32 PG.
> 
>> Are you sure the recovery is actually going slower, or are the individual 
>> ops larger or more expensive?
> 
> The objects should not vary wildly in size.
> Even if they were differing in size, the SSDs are roughly idle in their 
> current state of backfilling when examining wait in iotop, or atop, or 
> sysstat/iostat.
> 
> This compares to when I was fully saturating the SATA backplane with over 
> 1000MB/s of writes to multiple disks when the backfills were going “full 
> speed.”
> 
> Here is a breakdown of recovery io by pool:
> 
>> pool objects-ssd id 20
>>   recovery io 6779 kB/s, 92 objects/s
>>   client io 3071 kB/s rd, 50 op/s rd, 0 op/s wr
>> 
>> pool fs-metadata-ssd id 16
>>   recovery io 0 B/s, 28 keys/s, 2 objects/s
>>   client io 109 kB/s rd, 67455 B/s wr, 1 op/s rd, 0 op/s wr
>> 
>> pool cephfs-hdd id 17
>>   recovery io 40542 kB/s, 158 objects/s
>>   client io 10056 kB/s rd, 142 op/s rd, 0 op/s wr
> 
> So the 24 HDD’s are outperforming the 50 SSD’s for recovery and client 
> traffic at the moment, which seems conspicuous to me.
> 
> Most of the OSD’s with recovery ops to the SSDs are reporting 8-12 ops, with 
> one OSD occasionally spiking up to 300-500 for a few minutes. Stats being 
> pulled by both local CollectD instances on each node, as well as the Influx 
> plugin in MGR as we evaluate that against collectd.
> 
> Thanks,
> 
> Reed
> 
> 
>> On Feb 22, 2018, at 6:21 PM, Gregory Farnum <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> What's the output of "ceph -s" while this is happening?
>> 
>> Is there some identifiable difference between these two states, like you get 
>> a lot of throughput on the data pools but then metadata recovery is slower?
>> 
>> Are you sure the recovery is actually going slower, or are the individual 
>> ops larger or more expensive?
>> 
>> My WAG is that recovering the metadata pool, composed mostly of directories 
>> stored in omap objects, is going much slower for some reason. You can adjust 
>> the cost of those individual ops some by changing 
>> osd_recovery_max_omap_entries_per_chunk (default: 8096), but I'm not sure 
>> which way you want to go or indeed if this has anything to do with the 
>> problem you're seeing. (eg, it could be that reading out the omaps is 
>> expensive, so you can get higher recovery op numbers by turning down the 
>> number of entries per request, but not actually see faster backfilling 
>> because you have to issue more requests.)
>> -Greg
>> 
>> On Wed, Feb 21, 2018 at 2:57 PM Reed Dier <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Hi all,
>> 
>> I am running into an odd situation that I cannot easily explain.
>> I am currently in the midst of destroy and rebuild of OSDs from filestore to 
>> bluestore.
>> With my HDDs, I am seeing expected behavior, but with my SSDs I am seeing 
>> unexpected behavior. The HDDs and SSDs are set in crush accordingly.
>> 
>> My path to replacing the OSDs is to set the noout, norecover, norebalance 
>> flag, destroy the OSD, create the OSD back, (iterate n times, all within a 
>> single failure domain), unset the flags, and let it go. It finishes, rinse, 
>> repeat.
>> 
>> For the SSD OSDs, they are SATA SSDs (Samsung SM863a) , 10 to a node, with 2 
>> NVMe drives (Intel P3700), 5 SATA SSDs to 1 NVMe drive, 16G partitions for 
>> block.db (previously filestore journals).
>> 2x10GbE networking between the nodes. SATA backplane caps out at around 10 
>> Gb/s as its 2x 6 Gb/s controllers. Luminous 12.2.2.
>> 
>> When the flags are unset, recovery starts and I see a very large rush of 
>> traffic, however, after the first machine completed, the performance tapered 
>> off at a rapid pace and trickles. Comparatively, I’m getting 100-200 
>> recovery ops on 3 HDDs, backfilling from 21 other HDDs, where as I’m getting 
>> 150-250 recovery ops on 5 SSDs, backfilling from 40 other SSDs. Every once 
>> in a while I will see a spike up to 500, 1000, or even 2000 ops on the SSDs, 
>> often a few hundred recovery ops from one OSD, and 8-15 ops from the others 
>> that are backfilling.
>> 
>> This is a far cry from the more than 15-30k recovery ops that it started off 
>> recovering with 1-3k recovery ops from a single OSD to the backfilling 
>> OSD(s). And an even farther cry from the >15k recovery ops I was sustaining 
>> for over an hour or more before. I was able to rebuild a 1.9T SSD (1.1T 
>> used) in a little under an hour, and I could do about 5 at a time and still 
>> keep it at roughly an hour to backfill all of them, but then I hit a 
>> roadblock after the first machine, when I tried to do 10 at a time (single 
>> machine). I am now still experiencing the same thing on the third node, 
>> while doing 5 OSDs at a time. 
>> 
>> The pools associated with these SSDs are cephfs-metadata, as well as a pure 
>> rados object pool we use for our own internal applications. Both are size=3, 
>> min_size=2.
>> 
>> It appears I am not the first to run into this, but it looks like there was 
>> no resolution: https://www.spinics.net/lists/ceph-users/msg41493.html 
>> <https://www.spinics.net/lists/ceph-users/msg41493.html>
>> 
>> Recovery parameters for the OSDs match what was in the previous thread, sans 
>> the osd conf block listed. And current osd_max_backfills = 30 and 
>> osd_recovery_max_active = 35. Very little activity on the OSDs during this 
>> period, so should not be any contention for iops on the SSDs.
>> 
>> The only oddity that I can attribute to things is that we had a few periods 
>> of time where the disk load on one of the mons was high enough to cause the 
>> mon to drop out of quorum for a brief amount of time, a few times. But I 
>> wouldn’t think backfills would just get throttled due to mons flapping.
>> 
>> Hopefully someone has some experience or can steer me in a path to improve 
>> the performance of the backfills so that I’m not stuck in backfill purgatory 
>> longer than I need to be.
>> 
>> Linking an imgur album with some screen grabs of the recovery ops over time 
>> for the first machine, versus the second and third machines to demonstrate 
>> the delta between them.
>> https://imgur.com/a/OJw4b <https://imgur.com/a/OJw4b>
>> 
>> Also including a ceph osd df of the SSDs, highlighted in red are the OSDs 
>> currently backfilling. Could this possibly be PG overdose? I don’t ever run 
>> into ‘stuck activating’ PGs, its just painfully slow backfills, like they 
>> are being throttled by ceph, that are causing me to worry. Drives aren’t 
>> worn, <30 P/E cycles on the drives, so plenty of life left in them.
>> 
>> Thanks,
>> Reed
>> 
>>> $ ceph osd df
>>> ID CLASS WEIGHT  REWEIGHT SIZE  USE   AVAIL %USE  VAR  PGS
>>> 24   ssd 1.76109  1.00000 1803G 1094G  708G 60.69 1.08 260
>>> 25   ssd 1.76109  1.00000 1803G 1136G  667G 63.01 1.12 271
>>> 26   ssd 1.76109  1.00000 1803G 1018G  785G 56.46 1.01 243
>>> 27   ssd 1.76109  1.00000 1803G 1065G  737G 59.10 1.05 253
>>> 28   ssd 1.76109  1.00000 1803G 1026G  776G 56.94 1.02 245
>>> 29   ssd 1.76109  1.00000 1803G 1132G  671G 62.79 1.12 270
>>> 30   ssd 1.76109  1.00000 1803G  944G  859G 52.35 0.93 224
>>> 31   ssd 1.76109  1.00000 1803G 1061G  742G 58.85 1.05 252
>>> 32   ssd 1.76109  1.00000 1803G 1003G  799G 55.67 0.99 239
>>> 33   ssd 1.76109  1.00000 1803G 1049G  753G 58.20 1.04 250
>>> 34   ssd 1.76109  1.00000 1803G 1086G  717G 60.23 1.07 257
>>> 35   ssd 1.76109  1.00000 1803G  978G  824G 54.26 0.97 232
>>> 36   ssd 1.76109  1.00000 1803G 1057G  745G 58.64 1.05 252
>>> 37   ssd 1.76109  1.00000 1803G 1025G  777G 56.88 1.01 244
>>> 38   ssd 1.76109  1.00000 1803G 1047G  756G 58.06 1.04 250
>>> 39   ssd 1.76109  1.00000 1803G 1031G  771G 57.20 1.02 246
>>> 40   ssd 1.76109  1.00000 1803G 1029G  774G 57.07 1.02 245
>>> 41   ssd 1.76109  1.00000 1803G 1033G  770G 57.28 1.02 245
>>> 42   ssd 1.76109  1.00000 1803G  993G  809G 55.10 0.98 236
>>> 43   ssd 1.76109  1.00000 1803G 1072G  731G 59.45 1.06 256
>>> 44   ssd 1.76109  1.00000 1803G 1039G  763G 57.64 1.03 248
>>> 45   ssd 1.76109  1.00000 1803G  992G  810G 55.06 0.98 236
>>> 46   ssd 1.76109  1.00000 1803G 1068G  735G 59.23 1.06 254
>>> 47   ssd 1.76109  1.00000 1803G 1020G  783G 56.57 1.01 242
>>> 48   ssd 1.76109  1.00000 1803G  945G  857G 52.44 0.94 225
>>> 49   ssd 1.76109  1.00000 1803G  649G 1154G 36.01 0.64 139
>>> 50   ssd 1.76109  1.00000 1803G  426G 1377G 23.64 0.42  83
>>> 51   ssd 1.76109  1.00000 1803G  610G 1193G 33.84 0.60 131
>>> 52   ssd 1.76109  1.00000 1803G  558G 1244G 30.98 0.55 118
>>> 53   ssd 1.76109  1.00000 1803G  731G 1072G 40.54 0.72 161
>>> 54   ssd 1.74599  1.00000 1787G  859G  928G 48.06 0.86 229
>>> 55   ssd 1.74599  1.00000 1787G  942G  844G 52.74 0.94 252
>>> 56   ssd 1.74599  1.00000 1787G  928G  859G 51.94 0.93 246
>>> 57   ssd 1.74599  1.00000 1787G 1039G  748G 58.15 1.04 277
>>> 58   ssd 1.74599  1.00000 1787G  963G  824G 53.87 0.96 255
>>> 59   ssd 1.74599  1.00000 1787G  909G  877G 50.89 0.91 241
>>> 60   ssd 1.74599  1.00000 1787G 1039G  748G 58.15 1.04 277
>>> 61   ssd 1.74599  1.00000 1787G  892G  895G 49.91 0.89 238
>>> 62   ssd 1.74599  1.00000 1787G  927G  859G 51.90 0.93 245
>>> 63   ssd 1.74599  1.00000 1787G  864G  922G 48.39 0.86 229
>>> 64   ssd 1.74599  1.00000 1787G  968G  819G 54.16 0.97 257
>>> 65   ssd 1.74599  1.00000 1787G  892G  894G 49.93 0.89 237
>>> 66   ssd 1.74599  1.00000 1787G  951G  836G 53.23 0.95 252
>>> 67   ssd 1.74599  1.00000 1787G  878G  908G 49.16 0.88 232
>>> 68   ssd 1.74599  1.00000 1787G  899G  888G 50.29 0.90 238
>>> 69   ssd 1.74599  1.00000 1787G  948G  839G 53.04 0.95 252
>>> 70   ssd 1.74599  1.00000 1787G  914G  873G 51.15 0.91 246
>>> 71   ssd 1.74599  1.00000 1787G 1004G  782G 56.21 1.00 266
>>> 72   ssd 1.74599  1.00000 1787G  812G  974G 45.47 0.81 216
>>> 73   ssd 1.74599  1.00000 1787G  932G  855G 52.15 0.93 247
>> _______________________________________________
>> ceph-users mailing list
>> [email protected] <mailto:[email protected]>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD Bluestore Backfills Slow

Reply via email to