Re: [ceph-users] SSD Bluestore Backfills Slow

Caspar Smit Mon, 04 Jun 2018 00:55:41 -0700

Hi Reed,

"Changing/injecting osd_recovery_sleep_hdd into the running SSD OSD’s on
bluestore opened the floodgates."


What exactly did you change/inject here?

We have a cluster with 10TB SATA HDD's which each have a 100GB SSD based
block.db

Looking at ceph osd metadata for each of those:

        "bluefs_db_model": "SAMSUNG MZ7KM960",
        "bluefs_db_rotational": "0",
        "bluefs_db_type": "ssd",
        "bluefs_slow_model": "ST10000NM0086-2A",
        "bluefs_slow_rotational": "1",
        "bluefs_slow_type": "hdd",
        "bluestore_bdev_rotational": "1",
        "bluestore_bdev_type": "hdd",
        "default_device_class": "hdd",
*        "journal_rotational": "1",*
        "osd_objectstore": "bluestore",
        "rotational": "1"

Looks to me if i'm hitting the same issue, isn't it?

ps. An upgrade of Ceph is planned in the near future but for now i would
like to use the workaround if applicable to me.

Thank you in advance.

Kind regards,
Caspar Smit

2018-02-26 23:22 GMT+01:00 Reed Dier <reed.d...@focusvq.com>:

> Quick turn around,
>
> Changing/injecting osd_recovery_sleep_hdd into the running SSD OSD’s on
> bluestore opened the floodgates.
>
> pool objects-ssd id 20
>   recovery io 1512 MB/s, 21547 objects/s
>
> pool fs-metadata-ssd id 16
>   recovery io 0 B/s, 6494 keys/s, 271 objects/s
>   client io 82325 B/s rd, 68146 B/s wr, 1 op/s rd, 0 op/s wr
>
>
> Graph of performance jump. Extremely marked.
> https://imgur.com/a/LZR9R
>
> So at least we now have the gun to go with the smoke.
>
> Thanks for the help and appreciate you pointing me in some directions that
> I was able to use to figure out the issue.
>
> Adding to ceph.conf for future OSD conversions.
>
> Thanks,
>
> Reed
>
>
> On Feb 26, 2018, at 4:12 PM, Reed Dier <reed.d...@focusvq.com> wrote:
>
> For the record, I am not seeing a demonstrative fix by injecting the value
> of 0 into the OSDs running.
>
> osd_recovery_sleep_hybrid = '0.000000' (not observed, change may require
> restart)
>
>
> If it does indeed need to be restarted, I will need to wait for the
> current backfills to finish their process as restarting an OSD would bring
> me under min_size.
>
> However, doing config show on the osd daemon appears to have taken the
> value of 0.
>
> ceph daemon osd.24 config show | grep recovery_sleep
>     "osd_recovery_sleep": "0.000000",
>     "osd_recovery_sleep_hdd": "0.100000",
>     "osd_recovery_sleep_hybrid": "0.000000",
>     "osd_recovery_sleep_ssd": "0.000000",
>
>
> I may take the restart as an opportunity to also move to 12.2.3 at the
> same time, since it is not expected that that should affect this issue.
>
> I could also attempt to change osd_recovery_sleep_hdd as well, since these
> are ssd osd’s, it shouldn’t make a difference, but its a free move.
>
> Thanks,
>
> Reed
>
> On Feb 26, 2018, at 3:42 PM, Gregory Farnum <gfar...@redhat.com> wrote:
>
> On Mon, Feb 26, 2018 at 12:26 PM Reed Dier <reed.d...@focusvq.com> wrote:
>
>> I will try to set the hybrid sleeps to 0 on the affected OSDs as an
>> interim solution to getting the metadata configured correctly.
>>
>
> Yes, that's a good workaround as long as you don't have any actual hybrid
> OSDs (or aren't worried about them sleeping...I'm not sure if that setting
> came from experience or not).
>
>
>>
>> For reference, here is the complete metadata for osd.24, bluestore SATA
>> SSD with NVMe block.db.
>>
>> {
>>         "id": 24,
>>         "arch": "x86_64",
>>         "back_addr": "",
>>         "back_iface": "bond0",
>>         "bluefs": "1",
>>         "bluefs_db_access_mode": "blk",
>>         "bluefs_db_block_size": "4096",
>>         "bluefs_db_dev": "259:0",
>>         "bluefs_db_dev_node": "nvme0n1",
>>         "bluefs_db_driver": "KernelDevice",
>>         "bluefs_db_model": "INTEL SSDPEDMD400G4                     ",
>>         "bluefs_db_partition_path": "/dev/nvme0n1p4",
>>         "bluefs_db_rotational": "0",
>>         "bluefs_db_serial": " ",
>>         "bluefs_db_size": "16000221184",
>>         "bluefs_db_type": "nvme",
>>         "bluefs_single_shared_device": "0",
>>         "bluefs_slow_access_mode": "blk",
>>         "bluefs_slow_block_size": "4096",
>>         "bluefs_slow_dev": "253:8",
>>         "bluefs_slow_dev_node": "dm-8",
>>         "bluefs_slow_driver": "KernelDevice",
>>         "bluefs_slow_model": "",
>>         "bluefs_slow_partition_path": "/dev/dm-8",
>>         "bluefs_slow_rotational": "0",
>>         "bluefs_slow_size": "1920378863616",
>>         "bluefs_slow_type": "ssd",
>>         "bluestore_bdev_access_mode": "blk",
>>         "bluestore_bdev_block_size": "4096",
>>         "bluestore_bdev_dev": "253:8",
>>         "bluestore_bdev_dev_node": "dm-8",
>>         "bluestore_bdev_driver": "KernelDevice",
>>         "bluestore_bdev_model": "",
>>         "bluestore_bdev_partition_path": "/dev/dm-8",
>>         "bluestore_bdev_rotational": "0",
>>         "bluestore_bdev_size": "1920378863616",
>>         "bluestore_bdev_type": "ssd",
>>         "ceph_version": "ceph version 12.2.2
>> (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)",
>>         "cpu": "Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz",
>>         "default_device_class": "ssd",
>>         "distro": "ubuntu",
>>         "distro_description": "Ubuntu 16.04.3 LTS",
>>         "distro_version": "16.04",
>>         "front_addr": "",
>>         "front_iface": "bond0",
>>         "hb_back_addr": "",
>>         "hb_front_addr": "",
>>         "hostname": “host00",
>>         "journal_rotational": "1",
>>         "kernel_description": "#29~16.04.2-Ubuntu SMP Tue Jan 9 22:00:44
>> UTC 2018",
>>         "kernel_version": "4.13.0-26-generic",
>>         "mem_swap_kb": "124999672",
>>         "mem_total_kb": "131914008",
>>         "os": "Linux",
>>         "osd_data": "/var/lib/ceph/osd/ceph-24",
>>         "osd_objectstore": "bluestore",
>>         "rotational": "0"
>>     }
>>
>>
>> So it looks like it correctly guessed(?) the
>> bluestore_bdev_type/default_device_class correctly (though it may have
>> been an inherited value?), as did bluefs_db_type get set to nvme correctly.
>>
>> So I’m not sure why journal_rotational is still showing 1.
>> Maybe something in the ceph-volume lvm piece that isn’t correctly setting
>> that flag on OSD creation?
>> Also seems like the journal_rotational field should have been deprecated
>> in bluestore as bluefs_db_rotational should cover that, and if there were a
>> WAL partition as well, I assume there would be something to the tune of
>> bluefs_wal_rotational or something like that, and journal would never be
>> used for bluestore?
>>
>
> Thanks to both of you for helping diagnose this issue. I created a ticket
> and have a PR up to fix it: http://tracker.ceph.com/issues/23141,
> https://github.com/ceph/ceph/pull/20602
>
> Until that gets backported into another Luminous release you'll need to do
> some kind of workaround though. :/
> -Greg
>
>
>>
>> Appreciate the help.
>>
>> Thanks,
>> Reed
>>
>> On Feb 26, 2018, at 1:28 PM, Gregory Farnum <gfar...@redhat.com> wrote:
>>
>> On Mon, Feb 26, 2018 at 11:21 AM Reed Dier <reed.d...@focusvq.com> wrote:
>>
>>> The ‘good perf’ that I reported below was the result of beginning 5 new
>>> bluestore conversions which results in a leading edge of ‘good’
>>> performance, before trickling off.
>>>
>>> This performance lasted about 20 minutes, where it backfilled a small
>>> set of PGs off of non-bluestore OSDs.
>>>
>>> Current performance is now hovering around:
>>>
>>> pool objects-ssd id 20
>>>   recovery io 14285 kB/s, 202 objects/s
>>>
>>> pool fs-metadata-ssd id 16
>>>   recovery io 0 B/s, 262 keys/s, 12 objects/s
>>>   client io 412 kB/s rd, 67593 B/s wr, 5 op/s rd, 0 op/s wr
>>>
>>>
>>> What are you referencing when you talk about recovery ops per second?
>>>
>>> These are recovery ops as reported by ceph -s or via stats exported via
>>> influx plugin in mgr, and via local collectd collection.
>>>
>>> Also, what are the values for osd_recovery_sleep_hdd
>>> and osd_recovery_sleep_hybrid, and can you validate via "ceph osd metadata"
>>> that your BlueStore SSD OSDs are correctly reporting both themselves and
>>> their journals as non-rotational?
>>>
>>>
>>> This yields more interesting results.
>>> Pasting results for 3 sets of OSDs in this order
>>>  {0}hdd+nvme block.db
>>> {24}ssd+nvme block.db
>>> {59}ssd+nvme journal
>>>
>>> ceph osd metadata | grep 'id\|rotational'
>>> "id": 0,
>>>         "bluefs_db_rotational": "0",
>>>         "bluefs_slow_rotational": "1",
>>>         "bluestore_bdev_rotational": "1",
>>> *        "journal_rotational": "1",*
>>>         "rotational": “1"
>>>
>>> "id": 24,
>>>         "bluefs_db_rotational": "0",
>>>         "bluefs_slow_rotational": "0",
>>>         "bluestore_bdev_rotational": "0",
>>> *        "journal_rotational": "1",*
>>>         "rotational": “0"
>>>
>>> "id": 59,
>>>         "journal_rotational": "0",
>>>         "rotational": “0"
>>>
>>>
>>> I wonder if it matters/is correct to see "journal_rotational": “1” for
>>> the bluestore OSD’s {0,24} with nvme block.db.
>>>
>>> Hope this may be helpful in determining the root cause.
>>>
>>
>> If you have an SSD main store and a hard drive ("rotational") journal,
>> the OSD will insert recovery sleeps from the osd_recovery_sleep_hybrid
>> config option. By default that is .025 (seconds).
>>
>> I believe you can override the setting (I'm not sure how), but you really
>> want to correct that flag at the OS layer. Generally when we see this
>> there's a RAID card or something between the solid-state device and the
>> host which is lying about the state of the world.
>> -Greg
>>
>>
>>>
>>> If it helps, all of the OSD’s were originally deployed with ceph-deploy,
>>> but are now being redone with ceph-volume locally on each host.
>>>
>>> Thanks,
>>>
>>> Reed
>>>
>>> On Feb 26, 2018, at 1:00 PM, Gregory Farnum <gfar...@redhat.com> wrote:
>>>
>>> On Mon, Feb 26, 2018 at 9:12 AM Reed Dier <reed.d...@focusvq.com> wrote:
>>>
>>>> After my last round of backfills completed, I started 5 more bluestore
>>>> conversions, which helped me recognize a very specific pattern of
>>>> performance.
>>>>
>>>> pool objects-ssd id 20
>>>>   recovery io 757 MB/s, 10845 objects/s
>>>>
>>>> pool fs-metadata-ssd id 16
>>>>   recovery io 0 B/s, 36265 keys/s, 1633 objects/s
>>>>   client io 2544 kB/s rd, 36788 B/s wr, 1 op/s rd, 0 op/s wr
>>>>
>>>>
>>>> The “non-throttled” backfills are only coming from filestore SSD OSD’s.
>>>> When backfilling from bluestore SSD OSD’s, they appear to be throttled
>>>> at the aforementioned <20 ops per OSD.
>>>>
>>>
>>> Wait, is that the current state? What are you referencing when you talk
>>> about recovery ops per second?
>>>
>>> Also, what are the values for osd_recovery_sleep_hdd
>>> and osd_recovery_sleep_hybrid, and can you validate via "ceph osd metadata"
>>> that your BlueStore SSD OSDs are correctly reporting both themselves and
>>> their journals as non-rotational?
>>> -Greg
>>>
>>>
>>>>
>>>> This would corroborate why the first batch of SSD’s I migrated to
>>>> bluestore were all at “full” speed, as all of the OSD’s they were
>>>> backfilling from were filestore based, compared to increasingly bluestore
>>>> backfill targets, leading to increasingly long backfill times as I move
>>>> from one host to the next.
>>>>
>>>> Looking at the recovery settings, the recovery_sleep and
>>>> recovery_sleep_ssd values across bluestore or filestore OSDs are showing as
>>>> 0 values, which means no sleep/throttle if I am reading everything
>>>> correctly.
>>>>
>>>> sudo ceph daemon osd.73 config show | grep recovery
>>>>     "osd_allow_recovery_below_min_size": "true",
>>>>     "osd_debug_skip_full_check_in_recovery": "false",
>>>>     "osd_force_recovery_pg_log_entries_factor": "1.300000",
>>>>     "osd_min_recovery_priority": "0",
>>>>     "osd_recovery_cost": "20971520",
>>>>     "osd_recovery_delay_start": "0.000000",
>>>>     "osd_recovery_forget_lost_objects": "false",
>>>>     "osd_recovery_max_active": "35",
>>>>     "osd_recovery_max_chunk": "8388608",
>>>>     "osd_recovery_max_omap_entries_per_chunk": "64000",
>>>>     "osd_recovery_max_single_start": "1",
>>>>     "osd_recovery_op_priority": "3",
>>>>     "osd_recovery_op_warn_multiple": "16",
>>>>     "osd_recovery_priority": "5",
>>>>     "osd_recovery_retry_interval": "30.000000",
>>>> *    "osd_recovery_sleep": "0.000000",*
>>>>     "osd_recovery_sleep_hdd": "0.100000",
>>>>     "osd_recovery_sleep_hybrid": "0.025000",
>>>> *    "osd_recovery_sleep_ssd": "0.000000",*
>>>>     "osd_recovery_thread_suicide_timeout": "300",
>>>>     "osd_recovery_thread_timeout": "30",
>>>>     "osd_scrub_during_recovery": "false",
>>>>
>>>>
>>>> As far as I know, the device class is configured correctly as far as I
>>>> know, it all shows as ssd/hdd correctly in ceph osd tree.
>>>>
>>>> So hopefully this may be enough of a smoking gun to help narrow down
>>>> where this may be stemming from.
>>>>
>>>> Thanks,
>>>>
>>>> Reed
>>>>
>>>> On Feb 23, 2018, at 10:04 AM, David Turner <drakonst...@gmail.com>
>>>> wrote:
>>>>
>>>> Here is a [1] link to a ML thread tracking some slow backfilling on
>>>> bluestore.  It came down to the backfill sleep setting for them.  Maybe it
>>>> will help.
>>>>
>>>> [1] https://www.mail-archive.com/ceph-users@lists.ceph.com/m
>>>> sg40256.html
>>>>
>>>> On Fri, Feb 23, 2018 at 10:46 AM Reed Dier <reed.d...@focusvq.com>
>>>> wrote:
>>>>
>>>>> Probably unrelated, but I do keep seeing this odd negative objects
>>>>> degraded message on the fs-metadata pool:
>>>>>
>>>>> pool fs-metadata-ssd id 16
>>>>>   -34/3 objects degraded (-1133.333%)
>>>>>   recovery io 0 B/s, 89 keys/s, 2 objects/s
>>>>>   client io 51289 B/s rd, 101 kB/s wr, 0 op/s rd, 0 op/s wr
>>>>>
>>>>>
>>>>> Don’t mean to clutter the ML/thread, however it did seem odd, maybe
>>>>> its a culprit? Maybe its some weird sampling interval issue thats been
>>>>> solved in 12.2.3?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Reed
>>>>>
>>>>>
>>>>> On Feb 23, 2018, at 8:26 AM, Reed Dier <reed.d...@focusvq.com> wrote:
>>>>>
>>>>> Below is ceph -s
>>>>>
>>>>>   cluster:
>>>>>     id:     {id}
>>>>>     health: HEALTH_WARN
>>>>>             noout flag(s) set
>>>>>             260610/1068004947 objects misplaced (0.024%)
>>>>>             Degraded data redundancy: 23157232/1068004947 objects
>>>>> degraded (2.168%), 332 pgs unclean, 328 pgs degraded, 328 pgs undersized
>>>>>
>>>>>   services:
>>>>>     mon: 3 daemons, quorum mon02,mon01,mon03
>>>>>     mgr: mon03(active), standbys: mon02
>>>>>     mds: cephfs-1/1/1 up  {0=mon03=up:active}, 1 up:standby
>>>>>     osd: 74 osds: 74 up, 74 in; 332 remapped pgs
>>>>>          flags noout
>>>>>
>>>>>   data:
>>>>>     pools:   5 pools, 5316 pgs
>>>>>     objects: 339M objects, 46627 GB
>>>>>     usage:   154 TB used, 108 TB / 262 TB avail
>>>>>     pgs:     23157232/1068004947 objects degraded (2.168%)
>>>>>              260610/1068004947 objects misplaced (0.024%)
>>>>>              4984 active+clean
>>>>>              183  active+undersized+degraded+remapped+backfilling
>>>>>              145  active+undersized+degraded+remapped+backfill_wait
>>>>>              3    active+remapped+backfill_wait
>>>>>              1    active+remapped+backfilling
>>>>>
>>>>>   io:
>>>>>     client:   8428 kB/s rd, 47905 B/s wr, 130 op/s rd, 0 op/s wr
>>>>>     recovery: 37057 kB/s, 50 keys/s, 217 objects/s
>>>>>
>>>>>
>>>>> Also the two pools on the SSDs, are the objects pool at 4096 PG, and
>>>>> the fs-metadata pool at 32 PG.
>>>>>
>>>>> Are you sure the recovery is actually going slower, or are the
>>>>> individual ops larger or more expensive?
>>>>>
>>>>> The objects should not vary wildly in size.
>>>>> Even if they were differing in size, the SSDs are roughly idle in
>>>>> their current state of backfilling when examining wait in iotop, or atop,
>>>>> or sysstat/iostat.
>>>>>
>>>>> This compares to when I was fully saturating the SATA backplane with
>>>>> over 1000MB/s of writes to multiple disks when the backfills were going
>>>>> “full speed.”
>>>>>
>>>>> Here is a breakdown of recovery io by pool:
>>>>>
>>>>> pool objects-ssd id 20
>>>>>   recovery io 6779 kB/s, 92 objects/s
>>>>>   client io 3071 kB/s rd, 50 op/s rd, 0 op/s wr
>>>>>
>>>>> pool fs-metadata-ssd id 16
>>>>>   recovery io 0 B/s, 28 keys/s, 2 objects/s
>>>>>   client io 109 kB/s rd, 67455 B/s wr, 1 op/s rd, 0 op/s wr
>>>>>
>>>>> pool cephfs-hdd id 17
>>>>>   recovery io 40542 kB/s, 158 objects/s
>>>>>   client io 10056 kB/s rd, 142 op/s rd, 0 op/s wr
>>>>>
>>>>>
>>>>> So the 24 HDD’s are outperforming the 50 SSD’s for recovery and client
>>>>> traffic at the moment, which seems conspicuous to me.
>>>>>
>>>>> Most of the OSD’s with recovery ops to the SSDs are reporting 8-12
>>>>> ops, with one OSD occasionally spiking up to 300-500 for a few minutes.
>>>>> Stats being pulled by both local CollectD instances on each node, as well
>>>>> as the Influx plugin in MGR as we evaluate that against collectd.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Reed
>>>>>
>>>>>
>>>>> On Feb 22, 2018, at 6:21 PM, Gregory Farnum <gfar...@redhat.com>
>>>>> wrote:
>>>>>
>>>>> What's the output of "ceph -s" while this is happening?
>>>>>
>>>>> Is there some identifiable difference between these two states, like
>>>>> you get a lot of throughput on the data pools but then metadata recovery 
>>>>> is
>>>>> slower?
>>>>>
>>>>> Are you sure the recovery is actually going slower, or are the
>>>>> individual ops larger or more expensive?
>>>>>
>>>>> My WAG is that recovering the metadata pool, composed mostly of
>>>>> directories stored in omap objects, is going much slower for some reason.
>>>>> You can adjust the cost of those individual ops some by
>>>>> changing osd_recovery_max_omap_entries_per_chunk (default: 8096), but
>>>>> I'm not sure which way you want to go or indeed if this has anything to do
>>>>> with the problem you're seeing. (eg, it could be that reading out the 
>>>>> omaps
>>>>> is expensive, so you can get higher recovery op numbers by turning down 
>>>>> the
>>>>> number of entries per request, but not actually see faster backfilling
>>>>> because you have to issue more requests.)
>>>>> -Greg
>>>>>
>>>>> On Wed, Feb 21, 2018 at 2:57 PM Reed Dier <reed.d...@focusvq.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I am running into an odd situation that I cannot easily explain.
>>>>>> I am currently in the midst of destroy and rebuild of OSDs from
>>>>>> filestore to bluestore.
>>>>>> With my HDDs, I am seeing expected behavior, but with my SSDs I am
>>>>>> seeing unexpected behavior. The HDDs and SSDs are set in crush 
>>>>>> accordingly.
>>>>>>
>>>>>> My path to replacing the OSDs is to set the noout, norecover,
>>>>>> norebalance flag, destroy the OSD, create the OSD back, (iterate n times,
>>>>>> all within a single failure domain), unset the flags, and let it go. It
>>>>>> finishes, rinse, repeat.
>>>>>>
>>>>>> For the SSD OSDs, they are SATA SSDs (Samsung SM863a) , 10 to a node,
>>>>>> with 2 NVMe drives (Intel P3700), 5 SATA SSDs to 1 NVMe drive, 16G
>>>>>> partitions for block.db (previously filestore journals).
>>>>>> 2x10GbE networking between the nodes. SATA backplane caps out at
>>>>>> around 10 Gb/s as its 2x 6 Gb/s controllers. Luminous 12.2.2.
>>>>>>
>>>>>> When the flags are unset, recovery starts and I see a very large rush
>>>>>> of traffic, however, after the first machine completed, the performance
>>>>>> tapered off at a rapid pace and trickles. Comparatively, I’m getting
>>>>>> 100-200 recovery ops on 3 HDDs, backfilling from 21 other HDDs, where as
>>>>>> I’m getting 150-250 recovery ops on 5 SSDs, backfilling from 40 other 
>>>>>> SSDs.
>>>>>> Every once in a while I will see a spike up to 500, 1000, or even 2000 
>>>>>> ops
>>>>>> on the SSDs, often a few hundred recovery ops from one OSD, and 8-15 ops
>>>>>> from the others that are backfilling.
>>>>>>
>>>>>> This is a far cry from the more than 15-30k recovery ops that it
>>>>>> started off recovering with 1-3k recovery ops from a single OSD to the
>>>>>> backfilling OSD(s). And an even farther cry from the >15k recovery ops I
>>>>>> was sustaining for over an hour or more before. I was able to rebuild a
>>>>>> 1.9T SSD (1.1T used) in a little under an hour, and I could do about 5 
>>>>>> at a
>>>>>> time and still keep it at roughly an hour to backfill all of them, but 
>>>>>> then
>>>>>> I hit a roadblock after the first machine, when I tried to do 10 at a 
>>>>>> time
>>>>>> (single machine). I am now still experiencing the same thing on the third
>>>>>> node, while doing 5 OSDs at a time.
>>>>>>
>>>>>> The pools associated with these SSDs are cephfs-metadata, as well as
>>>>>> a pure rados object pool we use for our own internal applications. Both 
>>>>>> are
>>>>>> size=3, min_size=2.
>>>>>>
>>>>>> It appears I am not the first to run into this, but it looks like
>>>>>> there was no resolution: https://www.spinic
>>>>>> s.net/lists/ceph-users/msg41493.html
>>>>>>
>>>>>> Recovery parameters for the OSDs match what was in the previous
>>>>>> thread, sans the osd conf block listed. And current osd_max_backfills = 
>>>>>> 30
>>>>>> and osd_recovery_max_active = 35. Very little activity on the OSDs during
>>>>>> this period, so should not be any contention for iops on the SSDs.
>>>>>>
>>>>>> The only oddity that I can attribute to things is that we had a few
>>>>>> periods of time where the disk load on one of the mons was high enough to
>>>>>> cause the mon to drop out of quorum for a brief amount of time, a few
>>>>>> times. But I wouldn’t think backfills would just get throttled due to 
>>>>>> mons
>>>>>> flapping.
>>>>>>
>>>>>> Hopefully someone has some experience or can steer me in a path to
>>>>>> improve the performance of the backfills so that I’m not stuck in 
>>>>>> backfill
>>>>>> purgatory longer than I need to be.
>>>>>>
>>>>>> Linking an imgur album with some screen grabs of the recovery ops
>>>>>> over time for the first machine, versus the second and third machines to
>>>>>> demonstrate the delta between them.
>>>>>> https://imgur.com/a/OJw4b
>>>>>>
>>>>>> Also including a ceph osd df of the SSDs, highlighted in red are the
>>>>>> OSDs currently backfilling. Could this possibly be PG overdose? I don’t
>>>>>> ever run into ‘stuck activating’ PGs, its just painfully slow backfills,
>>>>>> like they are being throttled by ceph, that are causing me to worry. 
>>>>>> Drives
>>>>>> aren’t worn, <30 P/E cycles on the drives, so plenty of life left in 
>>>>>> them.
>>>>>>
>>>>>> Thanks,
>>>>>> Reed
>>>>>>
>>>>>> $ ceph osd df
>>>>>> ID CLASS WEIGHT  REWEIGHT SIZE  USE   AVAIL %USE  VAR  PGS
>>>>>> 24   ssd 1.76109  1.00000 1803G 1094G  708G 60.69 1.08 260
>>>>>> 25   ssd 1.76109  1.00000 1803G 1136G  667G 63.01 1.12 271
>>>>>> 26   ssd 1.76109  1.00000 1803G 1018G  785G 56.46 1.01 243
>>>>>> 27   ssd 1.76109  1.00000 1803G 1065G  737G 59.10 1.05 253
>>>>>> 28   ssd 1.76109  1.00000 1803G 1026G  776G 56.94 1.02 245
>>>>>> 29   ssd 1.76109  1.00000 1803G 1132G  671G 62.79 1.12 270
>>>>>> 30   ssd 1.76109  1.00000 1803G  944G  859G 52.35 0.93 224
>>>>>> 31   ssd 1.76109  1.00000 1803G 1061G  742G 58.85 1.05 252
>>>>>> 32   ssd 1.76109  1.00000 1803G 1003G  799G 55.67 0.99 239
>>>>>> 33   ssd 1.76109  1.00000 1803G 1049G  753G 58.20 1.04 250
>>>>>> 34   ssd 1.76109  1.00000 1803G 1086G  717G 60.23 1.07 257
>>>>>> 35   ssd 1.76109  1.00000 1803G  978G  824G 54.26 0.97 232
>>>>>> 36   ssd 1.76109  1.00000 1803G 1057G  745G 58.64 1.05 252
>>>>>> 37   ssd 1.76109  1.00000 1803G 1025G  777G 56.88 1.01 244
>>>>>> 38   ssd 1.76109  1.00000 1803G 1047G  756G 58.06 1.04 250
>>>>>> 39   ssd 1.76109  1.00000 1803G 1031G  771G 57.20 1.02 246
>>>>>> 40   ssd 1.76109  1.00000 1803G 1029G  774G 57.07 1.02 245
>>>>>> 41   ssd 1.76109  1.00000 1803G 1033G  770G 57.28 1.02 245
>>>>>> 42   ssd 1.76109  1.00000 1803G  993G  809G 55.10 0.98 236
>>>>>> 43   ssd 1.76109  1.00000 1803G 1072G  731G 59.45 1.06 256
>>>>>> 44   ssd 1.76109  1.00000 1803G 1039G  763G 57.64 1.03 248
>>>>>> 45   ssd 1.76109  1.00000 1803G  992G  810G 55.06 0.98 236
>>>>>> 46   ssd 1.76109  1.00000 1803G 1068G  735G 59.23 1.06 254
>>>>>> 47   ssd 1.76109  1.00000 1803G 1020G  783G 56.57 1.01 242
>>>>>> 48   ssd 1.76109  1.00000 1803G  945G  857G 52.44 0.94 225
>>>>>> 49   ssd 1.76109  1.00000 1803G  649G 1154G 36.01 0.64 139
>>>>>> 50   ssd 1.76109  1.00000 1803G  426G 1377G 23.64 0.42  83
>>>>>> 51   ssd 1.76109  1.00000 1803G  610G 1193G 33.84 0.60 131
>>>>>> 52   ssd 1.76109  1.00000 1803G  558G 1244G 30.98 0.55 118
>>>>>> 53   ssd 1.76109  1.00000 1803G  731G 1072G 40.54 0.72 161
>>>>>> 54   ssd 1.74599  1.00000 1787G  859G  928G 48.06 0.86 229
>>>>>> 55   ssd 1.74599  1.00000 1787G  942G  844G 52.74 0.94 252
>>>>>> 56   ssd 1.74599  1.00000 1787G  928G  859G 51.94 0.93 246
>>>>>> 57   ssd 1.74599  1.00000 1787G 1039G  748G 58.15 1.04 277
>>>>>> 58   ssd 1.74599  1.00000 1787G  963G  824G 53.87 0.96 255
>>>>>> 59   ssd 1.74599  1.00000 1787G  909G  877G 50.89 0.91 241
>>>>>> 60   ssd 1.74599  1.00000 1787G 1039G  748G 58.15 1.04 277
>>>>>> 61   ssd 1.74599  1.00000 1787G  892G  895G 49.91 0.89 238
>>>>>> 62   ssd 1.74599  1.00000 1787G  927G  859G 51.90 0.93 245
>>>>>> 63   ssd 1.74599  1.00000 1787G  864G  922G 48.39 0.86 229
>>>>>> 64   ssd 1.74599  1.00000 1787G  968G  819G 54.16 0.97 257
>>>>>> 65   ssd 1.74599  1.00000 1787G  892G  894G 49.93 0.89 237
>>>>>> 66   ssd 1.74599  1.00000 1787G  951G  836G 53.23 0.95 252
>>>>>> 67   ssd 1.74599  1.00000 1787G  878G  908G 49.16 0.88 232
>>>>>> 68   ssd 1.74599  1.00000 1787G  899G  888G 50.29 0.90 238
>>>>>> 69   ssd 1.74599  1.00000 1787G  948G  839G 53.04 0.95 252
>>>>>> 70   ssd 1.74599  1.00000 1787G  914G  873G 51.15 0.91 246
>>>>>> 71   ssd 1.74599  1.00000 1787G 1004G  782G 56.21 1.00 266
>>>>>> 72   ssd 1.74599  1.00000 1787G  812G  974G 45.47 0.81 216
>>>>>> 73   ssd 1.74599  1.00000 1787G  932G  855G 52.15 0.93 247
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD Bluestore Backfills Slow

Reply via email to