On Mon, Feb 26, 2018 at 11:21 AM Reed Dier <reed.d...@focusvq.com> wrote:
> The ‘good perf’ that I reported below was the result of beginning 5 new > bluestore conversions which results in a leading edge of ‘good’ > performance, before trickling off. > > This performance lasted about 20 minutes, where it backfilled a small set > of PGs off of non-bluestore OSDs. > > Current performance is now hovering around: > > pool objects-ssd id 20 > recovery io 14285 kB/s, 202 objects/s > > pool fs-metadata-ssd id 16 > recovery io 0 B/s, 262 keys/s, 12 objects/s > client io 412 kB/s rd, 67593 B/s wr, 5 op/s rd, 0 op/s wr > > > What are you referencing when you talk about recovery ops per second? > > These are recovery ops as reported by ceph -s or via stats exported via > influx plugin in mgr, and via local collectd collection. > > Also, what are the values for osd_recovery_sleep_hdd > and osd_recovery_sleep_hybrid, and can you validate via "ceph osd metadata" > that your BlueStore SSD OSDs are correctly reporting both themselves and > their journals as non-rotational? > > > This yields more interesting results. > Pasting results for 3 sets of OSDs in this order > {0}hdd+nvme block.db > {24}ssd+nvme block.db > {59}ssd+nvme journal > > ceph osd metadata | grep 'id\|rotational' > "id": 0, > "bluefs_db_rotational": "0", > "bluefs_slow_rotational": "1", > "bluestore_bdev_rotational": "1", > * "journal_rotational": "1",* > "rotational": “1" > > "id": 24, > "bluefs_db_rotational": "0", > "bluefs_slow_rotational": "0", > "bluestore_bdev_rotational": "0", > * "journal_rotational": "1",* > "rotational": “0" > > "id": 59, > "journal_rotational": "0", > "rotational": “0" > > > I wonder if it matters/is correct to see "journal_rotational": “1” for the > bluestore OSD’s {0,24} with nvme block.db. > > Hope this may be helpful in determining the root cause. > If you have an SSD main store and a hard drive ("rotational") journal, the OSD will insert recovery sleeps from the osd_recovery_sleep_hybrid config option. By default that is .025 (seconds). I believe you can override the setting (I'm not sure how), but you really want to correct that flag at the OS layer. Generally when we see this there's a RAID card or something between the solid-state device and the host which is lying about the state of the world. -Greg > > If it helps, all of the OSD’s were originally deployed with ceph-deploy, > but are now being redone with ceph-volume locally on each host. > > Thanks, > > Reed > > On Feb 26, 2018, at 1:00 PM, Gregory Farnum <gfar...@redhat.com> wrote: > > On Mon, Feb 26, 2018 at 9:12 AM Reed Dier <reed.d...@focusvq.com> wrote: > >> After my last round of backfills completed, I started 5 more bluestore >> conversions, which helped me recognize a very specific pattern of >> performance. >> >> pool objects-ssd id 20 >> recovery io 757 MB/s, 10845 objects/s >> >> pool fs-metadata-ssd id 16 >> recovery io 0 B/s, 36265 keys/s, 1633 objects/s >> client io 2544 kB/s rd, 36788 B/s wr, 1 op/s rd, 0 op/s wr >> >> >> The “non-throttled” backfills are only coming from filestore SSD OSD’s. >> When backfilling from bluestore SSD OSD’s, they appear to be throttled at >> the aforementioned <20 ops per OSD. >> > > Wait, is that the current state? What are you referencing when you talk > about recovery ops per second? > > Also, what are the values for osd_recovery_sleep_hdd > and osd_recovery_sleep_hybrid, and can you validate via "ceph osd metadata" > that your BlueStore SSD OSDs are correctly reporting both themselves and > their journals as non-rotational? > -Greg > > >> >> This would corroborate why the first batch of SSD’s I migrated to >> bluestore were all at “full” speed, as all of the OSD’s they were >> backfilling from were filestore based, compared to increasingly bluestore >> backfill targets, leading to increasingly long backfill times as I move >> from one host to the next. >> >> Looking at the recovery settings, the recovery_sleep and >> recovery_sleep_ssd values across bluestore or filestore OSDs are showing as >> 0 values, which means no sleep/throttle if I am reading everything >> correctly. >> >> sudo ceph daemon osd.73 config show | grep recovery >> "osd_allow_recovery_below_min_size": "true", >> "osd_debug_skip_full_check_in_recovery": "false", >> "osd_force_recovery_pg_log_entries_factor": "1.300000", >> "osd_min_recovery_priority": "0", >> "osd_recovery_cost": "20971520", >> "osd_recovery_delay_start": "0.000000", >> "osd_recovery_forget_lost_objects": "false", >> "osd_recovery_max_active": "35", >> "osd_recovery_max_chunk": "8388608", >> "osd_recovery_max_omap_entries_per_chunk": "64000", >> "osd_recovery_max_single_start": "1", >> "osd_recovery_op_priority": "3", >> "osd_recovery_op_warn_multiple": "16", >> "osd_recovery_priority": "5", >> "osd_recovery_retry_interval": "30.000000", >> * "osd_recovery_sleep": "0.000000",* >> "osd_recovery_sleep_hdd": "0.100000", >> "osd_recovery_sleep_hybrid": "0.025000", >> * "osd_recovery_sleep_ssd": "0.000000",* >> "osd_recovery_thread_suicide_timeout": "300", >> "osd_recovery_thread_timeout": "30", >> "osd_scrub_during_recovery": "false", >> >> >> As far as I know, the device class is configured correctly as far as I >> know, it all shows as ssd/hdd correctly in ceph osd tree. >> >> So hopefully this may be enough of a smoking gun to help narrow down >> where this may be stemming from. >> >> Thanks, >> >> Reed >> >> On Feb 23, 2018, at 10:04 AM, David Turner <drakonst...@gmail.com> wrote: >> >> Here is a [1] link to a ML thread tracking some slow backfilling on >> bluestore. It came down to the backfill sleep setting for them. Maybe it >> will help. >> >> [1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg40256.html >> >> On Fri, Feb 23, 2018 at 10:46 AM Reed Dier <reed.d...@focusvq.com> wrote: >> >>> Probably unrelated, but I do keep seeing this odd negative objects >>> degraded message on the fs-metadata pool: >>> >>> pool fs-metadata-ssd id 16 >>> -34/3 objects degraded (-1133.333%) >>> recovery io 0 B/s, 89 keys/s, 2 objects/s >>> client io 51289 B/s rd, 101 kB/s wr, 0 op/s rd, 0 op/s wr >>> >>> >>> Don’t mean to clutter the ML/thread, however it did seem odd, maybe its >>> a culprit? Maybe its some weird sampling interval issue thats been solved >>> in 12.2.3? >>> >>> Thanks, >>> >>> Reed >>> >>> >>> On Feb 23, 2018, at 8:26 AM, Reed Dier <reed.d...@focusvq.com> wrote: >>> >>> Below is ceph -s >>> >>> cluster: >>> id: {id} >>> health: HEALTH_WARN >>> noout flag(s) set >>> 260610/1068004947 objects misplaced (0.024%) >>> Degraded data redundancy: 23157232/1068004947 objects >>> degraded (2.168%), 332 pgs unclean, 328 pgs degraded, 328 pgs undersized >>> >>> services: >>> mon: 3 daemons, quorum mon02,mon01,mon03 >>> mgr: mon03(active), standbys: mon02 >>> mds: cephfs-1/1/1 up {0=mon03=up:active}, 1 up:standby >>> osd: 74 osds: 74 up, 74 in; 332 remapped pgs >>> flags noout >>> >>> data: >>> pools: 5 pools, 5316 pgs >>> objects: 339M objects, 46627 GB >>> usage: 154 TB used, 108 TB / 262 TB avail >>> pgs: 23157232/1068004947 objects degraded (2.168%) >>> 260610/1068004947 objects misplaced (0.024%) >>> 4984 active+clean >>> 183 active+undersized+degraded+remapped+backfilling >>> 145 active+undersized+degraded+remapped+backfill_wait >>> 3 active+remapped+backfill_wait >>> 1 active+remapped+backfilling >>> >>> io: >>> client: 8428 kB/s rd, 47905 B/s wr, 130 op/s rd, 0 op/s wr >>> recovery: 37057 kB/s, 50 keys/s, 217 objects/s >>> >>> >>> Also the two pools on the SSDs, are the objects pool at 4096 PG, and the >>> fs-metadata pool at 32 PG. >>> >>> Are you sure the recovery is actually going slower, or are the >>> individual ops larger or more expensive? >>> >>> The objects should not vary wildly in size. >>> Even if they were differing in size, the SSDs are roughly idle in their >>> current state of backfilling when examining wait in iotop, or atop, or >>> sysstat/iostat. >>> >>> This compares to when I was fully saturating the SATA backplane with >>> over 1000MB/s of writes to multiple disks when the backfills were going >>> “full speed.” >>> >>> Here is a breakdown of recovery io by pool: >>> >>> pool objects-ssd id 20 >>> recovery io 6779 kB/s, 92 objects/s >>> client io 3071 kB/s rd, 50 op/s rd, 0 op/s wr >>> >>> pool fs-metadata-ssd id 16 >>> recovery io 0 B/s, 28 keys/s, 2 objects/s >>> client io 109 kB/s rd, 67455 B/s wr, 1 op/s rd, 0 op/s wr >>> >>> pool cephfs-hdd id 17 >>> recovery io 40542 kB/s, 158 objects/s >>> client io 10056 kB/s rd, 142 op/s rd, 0 op/s wr >>> >>> >>> So the 24 HDD’s are outperforming the 50 SSD’s for recovery and client >>> traffic at the moment, which seems conspicuous to me. >>> >>> Most of the OSD’s with recovery ops to the SSDs are reporting 8-12 ops, >>> with one OSD occasionally spiking up to 300-500 for a few minutes. Stats >>> being pulled by both local CollectD instances on each node, as well as the >>> Influx plugin in MGR as we evaluate that against collectd. >>> >>> Thanks, >>> >>> Reed >>> >>> >>> On Feb 22, 2018, at 6:21 PM, Gregory Farnum <gfar...@redhat.com> wrote: >>> >>> What's the output of "ceph -s" while this is happening? >>> >>> Is there some identifiable difference between these two states, like you >>> get a lot of throughput on the data pools but then metadata recovery is >>> slower? >>> >>> Are you sure the recovery is actually going slower, or are the >>> individual ops larger or more expensive? >>> >>> My WAG is that recovering the metadata pool, composed mostly of >>> directories stored in omap objects, is going much slower for some reason. >>> You can adjust the cost of those individual ops some by >>> changing osd_recovery_max_omap_entries_per_chunk (default: 8096), but I'm >>> not sure which way you want to go or indeed if this has anything to do with >>> the problem you're seeing. (eg, it could be that reading out the omaps is >>> expensive, so you can get higher recovery op numbers by turning down the >>> number of entries per request, but not actually see faster backfilling >>> because you have to issue more requests.) >>> -Greg >>> >>> On Wed, Feb 21, 2018 at 2:57 PM Reed Dier <reed.d...@focusvq.com> wrote: >>> >>>> Hi all, >>>> >>>> I am running into an odd situation that I cannot easily explain. >>>> I am currently in the midst of destroy and rebuild of OSDs from >>>> filestore to bluestore. >>>> With my HDDs, I am seeing expected behavior, but with my SSDs I am >>>> seeing unexpected behavior. The HDDs and SSDs are set in crush accordingly. >>>> >>>> My path to replacing the OSDs is to set the noout, norecover, >>>> norebalance flag, destroy the OSD, create the OSD back, (iterate n times, >>>> all within a single failure domain), unset the flags, and let it go. It >>>> finishes, rinse, repeat. >>>> >>>> For the SSD OSDs, they are SATA SSDs (Samsung SM863a) , 10 to a node, >>>> with 2 NVMe drives (Intel P3700), 5 SATA SSDs to 1 NVMe drive, 16G >>>> partitions for block.db (previously filestore journals). >>>> 2x10GbE networking between the nodes. SATA backplane caps out at around >>>> 10 Gb/s as its 2x 6 Gb/s controllers. Luminous 12.2.2. >>>> >>>> When the flags are unset, recovery starts and I see a very large rush >>>> of traffic, however, after the first machine completed, the performance >>>> tapered off at a rapid pace and trickles. Comparatively, I’m getting >>>> 100-200 recovery ops on 3 HDDs, backfilling from 21 other HDDs, where as >>>> I’m getting 150-250 recovery ops on 5 SSDs, backfilling from 40 other SSDs. >>>> Every once in a while I will see a spike up to 500, 1000, or even 2000 ops >>>> on the SSDs, often a few hundred recovery ops from one OSD, and 8-15 ops >>>> from the others that are backfilling. >>>> >>>> This is a far cry from the more than 15-30k recovery ops that it >>>> started off recovering with 1-3k recovery ops from a single OSD to the >>>> backfilling OSD(s). And an even farther cry from the >15k recovery ops I >>>> was sustaining for over an hour or more before. I was able to rebuild a >>>> 1.9T SSD (1.1T used) in a little under an hour, and I could do about 5 at a >>>> time and still keep it at roughly an hour to backfill all of them, but then >>>> I hit a roadblock after the first machine, when I tried to do 10 at a time >>>> (single machine). I am now still experiencing the same thing on the third >>>> node, while doing 5 OSDs at a time. >>>> >>>> The pools associated with these SSDs are cephfs-metadata, as well as a >>>> pure rados object pool we use for our own internal applications. Both are >>>> size=3, min_size=2. >>>> >>>> It appears I am not the first to run into this, but it looks like there >>>> was no resolution: >>>> https://www.spinics.net/lists/ceph-users/msg41493.html >>>> >>>> Recovery parameters for the OSDs match what was in the previous thread, >>>> sans the osd conf block listed. And current osd_max_backfills = 30 and >>>> osd_recovery_max_active = 35. Very little activity on the OSDs during this >>>> period, so should not be any contention for iops on the SSDs. >>>> >>>> The only oddity that I can attribute to things is that we had a few >>>> periods of time where the disk load on one of the mons was high enough to >>>> cause the mon to drop out of quorum for a brief amount of time, a few >>>> times. But I wouldn’t think backfills would just get throttled due to mons >>>> flapping. >>>> >>>> Hopefully someone has some experience or can steer me in a path to >>>> improve the performance of the backfills so that I’m not stuck in backfill >>>> purgatory longer than I need to be. >>>> >>>> Linking an imgur album with some screen grabs of the recovery ops over >>>> time for the first machine, versus the second and third machines to >>>> demonstrate the delta between them. >>>> https://imgur.com/a/OJw4b >>>> >>>> Also including a ceph osd df of the SSDs, highlighted in red are the >>>> OSDs currently backfilling. Could this possibly be PG overdose? I don’t >>>> ever run into ‘stuck activating’ PGs, its just painfully slow backfills, >>>> like they are being throttled by ceph, that are causing me to worry. Drives >>>> aren’t worn, <30 P/E cycles on the drives, so plenty of life left in them. >>>> >>>> Thanks, >>>> Reed >>>> >>>> $ ceph osd df >>>> ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS >>>> 24 ssd 1.76109 1.00000 1803G 1094G 708G 60.69 1.08 260 >>>> 25 ssd 1.76109 1.00000 1803G 1136G 667G 63.01 1.12 271 >>>> 26 ssd 1.76109 1.00000 1803G 1018G 785G 56.46 1.01 243 >>>> 27 ssd 1.76109 1.00000 1803G 1065G 737G 59.10 1.05 253 >>>> 28 ssd 1.76109 1.00000 1803G 1026G 776G 56.94 1.02 245 >>>> 29 ssd 1.76109 1.00000 1803G 1132G 671G 62.79 1.12 270 >>>> 30 ssd 1.76109 1.00000 1803G 944G 859G 52.35 0.93 224 >>>> 31 ssd 1.76109 1.00000 1803G 1061G 742G 58.85 1.05 252 >>>> 32 ssd 1.76109 1.00000 1803G 1003G 799G 55.67 0.99 239 >>>> 33 ssd 1.76109 1.00000 1803G 1049G 753G 58.20 1.04 250 >>>> 34 ssd 1.76109 1.00000 1803G 1086G 717G 60.23 1.07 257 >>>> 35 ssd 1.76109 1.00000 1803G 978G 824G 54.26 0.97 232 >>>> 36 ssd 1.76109 1.00000 1803G 1057G 745G 58.64 1.05 252 >>>> 37 ssd 1.76109 1.00000 1803G 1025G 777G 56.88 1.01 244 >>>> 38 ssd 1.76109 1.00000 1803G 1047G 756G 58.06 1.04 250 >>>> 39 ssd 1.76109 1.00000 1803G 1031G 771G 57.20 1.02 246 >>>> 40 ssd 1.76109 1.00000 1803G 1029G 774G 57.07 1.02 245 >>>> 41 ssd 1.76109 1.00000 1803G 1033G 770G 57.28 1.02 245 >>>> 42 ssd 1.76109 1.00000 1803G 993G 809G 55.10 0.98 236 >>>> 43 ssd 1.76109 1.00000 1803G 1072G 731G 59.45 1.06 256 >>>> 44 ssd 1.76109 1.00000 1803G 1039G 763G 57.64 1.03 248 >>>> 45 ssd 1.76109 1.00000 1803G 992G 810G 55.06 0.98 236 >>>> 46 ssd 1.76109 1.00000 1803G 1068G 735G 59.23 1.06 254 >>>> 47 ssd 1.76109 1.00000 1803G 1020G 783G 56.57 1.01 242 >>>> 48 ssd 1.76109 1.00000 1803G 945G 857G 52.44 0.94 225 >>>> 49 ssd 1.76109 1.00000 1803G 649G 1154G 36.01 0.64 139 >>>> 50 ssd 1.76109 1.00000 1803G 426G 1377G 23.64 0.42 83 >>>> 51 ssd 1.76109 1.00000 1803G 610G 1193G 33.84 0.60 131 >>>> 52 ssd 1.76109 1.00000 1803G 558G 1244G 30.98 0.55 118 >>>> 53 ssd 1.76109 1.00000 1803G 731G 1072G 40.54 0.72 161 >>>> 54 ssd 1.74599 1.00000 1787G 859G 928G 48.06 0.86 229 >>>> 55 ssd 1.74599 1.00000 1787G 942G 844G 52.74 0.94 252 >>>> 56 ssd 1.74599 1.00000 1787G 928G 859G 51.94 0.93 246 >>>> 57 ssd 1.74599 1.00000 1787G 1039G 748G 58.15 1.04 277 >>>> 58 ssd 1.74599 1.00000 1787G 963G 824G 53.87 0.96 255 >>>> 59 ssd 1.74599 1.00000 1787G 909G 877G 50.89 0.91 241 >>>> 60 ssd 1.74599 1.00000 1787G 1039G 748G 58.15 1.04 277 >>>> 61 ssd 1.74599 1.00000 1787G 892G 895G 49.91 0.89 238 >>>> 62 ssd 1.74599 1.00000 1787G 927G 859G 51.90 0.93 245 >>>> 63 ssd 1.74599 1.00000 1787G 864G 922G 48.39 0.86 229 >>>> 64 ssd 1.74599 1.00000 1787G 968G 819G 54.16 0.97 257 >>>> 65 ssd 1.74599 1.00000 1787G 892G 894G 49.93 0.89 237 >>>> 66 ssd 1.74599 1.00000 1787G 951G 836G 53.23 0.95 252 >>>> 67 ssd 1.74599 1.00000 1787G 878G 908G 49.16 0.88 232 >>>> 68 ssd 1.74599 1.00000 1787G 899G 888G 50.29 0.90 238 >>>> 69 ssd 1.74599 1.00000 1787G 948G 839G 53.04 0.95 252 >>>> 70 ssd 1.74599 1.00000 1787G 914G 873G 51.15 0.91 246 >>>> 71 ssd 1.74599 1.00000 1787G 1004G 782G 56.21 1.00 266 >>>> 72 ssd 1.74599 1.00000 1787G 812G 974G 45.47 0.81 216 >>>> 73 ssd 1.74599 1.00000 1787G 932G 855G 52.15 0.93 247 >>>> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com