Probably unrelated, but I do keep seeing this odd negative objects degraded message on the fs-metadata pool:
> pool fs-metadata-ssd id 16 > -34/3 objects degraded (-1133.333%) > recovery io 0 B/s, 89 keys/s, 2 objects/s > client io 51289 B/s rd, 101 kB/s wr, 0 op/s rd, 0 op/s wr Don’t mean to clutter the ML/thread, however it did seem odd, maybe its a culprit? Maybe its some weird sampling interval issue thats been solved in 12.2.3? Thanks, Reed > On Feb 23, 2018, at 8:26 AM, Reed Dier <[email protected]> wrote: > > Below is ceph -s > >> cluster: >> id: {id} >> health: HEALTH_WARN >> noout flag(s) set >> 260610/1068004947 objects misplaced (0.024%) >> Degraded data redundancy: 23157232/1068004947 objects degraded >> (2.168%), 332 pgs unclean, 328 pgs degraded, 328 pgs undersized >> >> services: >> mon: 3 daemons, quorum mon02,mon01,mon03 >> mgr: mon03(active), standbys: mon02 >> mds: cephfs-1/1/1 up {0=mon03=up:active}, 1 up:standby >> osd: 74 osds: 74 up, 74 in; 332 remapped pgs >> flags noout >> >> data: >> pools: 5 pools, 5316 pgs >> objects: 339M objects, 46627 GB >> usage: 154 TB used, 108 TB / 262 TB avail >> pgs: 23157232/1068004947 objects degraded (2.168%) >> 260610/1068004947 objects misplaced (0.024%) >> 4984 active+clean >> 183 active+undersized+degraded+remapped+backfilling >> 145 active+undersized+degraded+remapped+backfill_wait >> 3 active+remapped+backfill_wait >> 1 active+remapped+backfilling >> >> io: >> client: 8428 kB/s rd, 47905 B/s wr, 130 op/s rd, 0 op/s wr >> recovery: 37057 kB/s, 50 keys/s, 217 objects/s > > Also the two pools on the SSDs, are the objects pool at 4096 PG, and the > fs-metadata pool at 32 PG. > >> Are you sure the recovery is actually going slower, or are the individual >> ops larger or more expensive? > > The objects should not vary wildly in size. > Even if they were differing in size, the SSDs are roughly idle in their > current state of backfilling when examining wait in iotop, or atop, or > sysstat/iostat. > > This compares to when I was fully saturating the SATA backplane with over > 1000MB/s of writes to multiple disks when the backfills were going “full > speed.” > > Here is a breakdown of recovery io by pool: > >> pool objects-ssd id 20 >> recovery io 6779 kB/s, 92 objects/s >> client io 3071 kB/s rd, 50 op/s rd, 0 op/s wr >> >> pool fs-metadata-ssd id 16 >> recovery io 0 B/s, 28 keys/s, 2 objects/s >> client io 109 kB/s rd, 67455 B/s wr, 1 op/s rd, 0 op/s wr >> >> pool cephfs-hdd id 17 >> recovery io 40542 kB/s, 158 objects/s >> client io 10056 kB/s rd, 142 op/s rd, 0 op/s wr > > So the 24 HDD’s are outperforming the 50 SSD’s for recovery and client > traffic at the moment, which seems conspicuous to me. > > Most of the OSD’s with recovery ops to the SSDs are reporting 8-12 ops, with > one OSD occasionally spiking up to 300-500 for a few minutes. Stats being > pulled by both local CollectD instances on each node, as well as the Influx > plugin in MGR as we evaluate that against collectd. > > Thanks, > > Reed > > >> On Feb 22, 2018, at 6:21 PM, Gregory Farnum <[email protected] >> <mailto:[email protected]>> wrote: >> >> What's the output of "ceph -s" while this is happening? >> >> Is there some identifiable difference between these two states, like you get >> a lot of throughput on the data pools but then metadata recovery is slower? >> >> Are you sure the recovery is actually going slower, or are the individual >> ops larger or more expensive? >> >> My WAG is that recovering the metadata pool, composed mostly of directories >> stored in omap objects, is going much slower for some reason. You can adjust >> the cost of those individual ops some by changing >> osd_recovery_max_omap_entries_per_chunk (default: 8096), but I'm not sure >> which way you want to go or indeed if this has anything to do with the >> problem you're seeing. (eg, it could be that reading out the omaps is >> expensive, so you can get higher recovery op numbers by turning down the >> number of entries per request, but not actually see faster backfilling >> because you have to issue more requests.) >> -Greg >> >> On Wed, Feb 21, 2018 at 2:57 PM Reed Dier <[email protected] >> <mailto:[email protected]>> wrote: >> Hi all, >> >> I am running into an odd situation that I cannot easily explain. >> I am currently in the midst of destroy and rebuild of OSDs from filestore to >> bluestore. >> With my HDDs, I am seeing expected behavior, but with my SSDs I am seeing >> unexpected behavior. The HDDs and SSDs are set in crush accordingly. >> >> My path to replacing the OSDs is to set the noout, norecover, norebalance >> flag, destroy the OSD, create the OSD back, (iterate n times, all within a >> single failure domain), unset the flags, and let it go. It finishes, rinse, >> repeat. >> >> For the SSD OSDs, they are SATA SSDs (Samsung SM863a) , 10 to a node, with 2 >> NVMe drives (Intel P3700), 5 SATA SSDs to 1 NVMe drive, 16G partitions for >> block.db (previously filestore journals). >> 2x10GbE networking between the nodes. SATA backplane caps out at around 10 >> Gb/s as its 2x 6 Gb/s controllers. Luminous 12.2.2. >> >> When the flags are unset, recovery starts and I see a very large rush of >> traffic, however, after the first machine completed, the performance tapered >> off at a rapid pace and trickles. Comparatively, I’m getting 100-200 >> recovery ops on 3 HDDs, backfilling from 21 other HDDs, where as I’m getting >> 150-250 recovery ops on 5 SSDs, backfilling from 40 other SSDs. Every once >> in a while I will see a spike up to 500, 1000, or even 2000 ops on the SSDs, >> often a few hundred recovery ops from one OSD, and 8-15 ops from the others >> that are backfilling. >> >> This is a far cry from the more than 15-30k recovery ops that it started off >> recovering with 1-3k recovery ops from a single OSD to the backfilling >> OSD(s). And an even farther cry from the >15k recovery ops I was sustaining >> for over an hour or more before. I was able to rebuild a 1.9T SSD (1.1T >> used) in a little under an hour, and I could do about 5 at a time and still >> keep it at roughly an hour to backfill all of them, but then I hit a >> roadblock after the first machine, when I tried to do 10 at a time (single >> machine). I am now still experiencing the same thing on the third node, >> while doing 5 OSDs at a time. >> >> The pools associated with these SSDs are cephfs-metadata, as well as a pure >> rados object pool we use for our own internal applications. Both are size=3, >> min_size=2. >> >> It appears I am not the first to run into this, but it looks like there was >> no resolution: https://www.spinics.net/lists/ceph-users/msg41493.html >> <https://www.spinics.net/lists/ceph-users/msg41493.html> >> >> Recovery parameters for the OSDs match what was in the previous thread, sans >> the osd conf block listed. And current osd_max_backfills = 30 and >> osd_recovery_max_active = 35. Very little activity on the OSDs during this >> period, so should not be any contention for iops on the SSDs. >> >> The only oddity that I can attribute to things is that we had a few periods >> of time where the disk load on one of the mons was high enough to cause the >> mon to drop out of quorum for a brief amount of time, a few times. But I >> wouldn’t think backfills would just get throttled due to mons flapping. >> >> Hopefully someone has some experience or can steer me in a path to improve >> the performance of the backfills so that I’m not stuck in backfill purgatory >> longer than I need to be. >> >> Linking an imgur album with some screen grabs of the recovery ops over time >> for the first machine, versus the second and third machines to demonstrate >> the delta between them. >> https://imgur.com/a/OJw4b <https://imgur.com/a/OJw4b> >> >> Also including a ceph osd df of the SSDs, highlighted in red are the OSDs >> currently backfilling. Could this possibly be PG overdose? I don’t ever run >> into ‘stuck activating’ PGs, its just painfully slow backfills, like they >> are being throttled by ceph, that are causing me to worry. Drives aren’t >> worn, <30 P/E cycles on the drives, so plenty of life left in them. >> >> Thanks, >> Reed >> >>> $ ceph osd df >>> ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS >>> 24 ssd 1.76109 1.00000 1803G 1094G 708G 60.69 1.08 260 >>> 25 ssd 1.76109 1.00000 1803G 1136G 667G 63.01 1.12 271 >>> 26 ssd 1.76109 1.00000 1803G 1018G 785G 56.46 1.01 243 >>> 27 ssd 1.76109 1.00000 1803G 1065G 737G 59.10 1.05 253 >>> 28 ssd 1.76109 1.00000 1803G 1026G 776G 56.94 1.02 245 >>> 29 ssd 1.76109 1.00000 1803G 1132G 671G 62.79 1.12 270 >>> 30 ssd 1.76109 1.00000 1803G 944G 859G 52.35 0.93 224 >>> 31 ssd 1.76109 1.00000 1803G 1061G 742G 58.85 1.05 252 >>> 32 ssd 1.76109 1.00000 1803G 1003G 799G 55.67 0.99 239 >>> 33 ssd 1.76109 1.00000 1803G 1049G 753G 58.20 1.04 250 >>> 34 ssd 1.76109 1.00000 1803G 1086G 717G 60.23 1.07 257 >>> 35 ssd 1.76109 1.00000 1803G 978G 824G 54.26 0.97 232 >>> 36 ssd 1.76109 1.00000 1803G 1057G 745G 58.64 1.05 252 >>> 37 ssd 1.76109 1.00000 1803G 1025G 777G 56.88 1.01 244 >>> 38 ssd 1.76109 1.00000 1803G 1047G 756G 58.06 1.04 250 >>> 39 ssd 1.76109 1.00000 1803G 1031G 771G 57.20 1.02 246 >>> 40 ssd 1.76109 1.00000 1803G 1029G 774G 57.07 1.02 245 >>> 41 ssd 1.76109 1.00000 1803G 1033G 770G 57.28 1.02 245 >>> 42 ssd 1.76109 1.00000 1803G 993G 809G 55.10 0.98 236 >>> 43 ssd 1.76109 1.00000 1803G 1072G 731G 59.45 1.06 256 >>> 44 ssd 1.76109 1.00000 1803G 1039G 763G 57.64 1.03 248 >>> 45 ssd 1.76109 1.00000 1803G 992G 810G 55.06 0.98 236 >>> 46 ssd 1.76109 1.00000 1803G 1068G 735G 59.23 1.06 254 >>> 47 ssd 1.76109 1.00000 1803G 1020G 783G 56.57 1.01 242 >>> 48 ssd 1.76109 1.00000 1803G 945G 857G 52.44 0.94 225 >>> 49 ssd 1.76109 1.00000 1803G 649G 1154G 36.01 0.64 139 >>> 50 ssd 1.76109 1.00000 1803G 426G 1377G 23.64 0.42 83 >>> 51 ssd 1.76109 1.00000 1803G 610G 1193G 33.84 0.60 131 >>> 52 ssd 1.76109 1.00000 1803G 558G 1244G 30.98 0.55 118 >>> 53 ssd 1.76109 1.00000 1803G 731G 1072G 40.54 0.72 161 >>> 54 ssd 1.74599 1.00000 1787G 859G 928G 48.06 0.86 229 >>> 55 ssd 1.74599 1.00000 1787G 942G 844G 52.74 0.94 252 >>> 56 ssd 1.74599 1.00000 1787G 928G 859G 51.94 0.93 246 >>> 57 ssd 1.74599 1.00000 1787G 1039G 748G 58.15 1.04 277 >>> 58 ssd 1.74599 1.00000 1787G 963G 824G 53.87 0.96 255 >>> 59 ssd 1.74599 1.00000 1787G 909G 877G 50.89 0.91 241 >>> 60 ssd 1.74599 1.00000 1787G 1039G 748G 58.15 1.04 277 >>> 61 ssd 1.74599 1.00000 1787G 892G 895G 49.91 0.89 238 >>> 62 ssd 1.74599 1.00000 1787G 927G 859G 51.90 0.93 245 >>> 63 ssd 1.74599 1.00000 1787G 864G 922G 48.39 0.86 229 >>> 64 ssd 1.74599 1.00000 1787G 968G 819G 54.16 0.97 257 >>> 65 ssd 1.74599 1.00000 1787G 892G 894G 49.93 0.89 237 >>> 66 ssd 1.74599 1.00000 1787G 951G 836G 53.23 0.95 252 >>> 67 ssd 1.74599 1.00000 1787G 878G 908G 49.16 0.88 232 >>> 68 ssd 1.74599 1.00000 1787G 899G 888G 50.29 0.90 238 >>> 69 ssd 1.74599 1.00000 1787G 948G 839G 53.04 0.95 252 >>> 70 ssd 1.74599 1.00000 1787G 914G 873G 51.15 0.91 246 >>> 71 ssd 1.74599 1.00000 1787G 1004G 782G 56.21 1.00 266 >>> 72 ssd 1.74599 1.00000 1787G 812G 974G 45.47 0.81 216 >>> 73 ssd 1.74599 1.00000 1787G 932G 855G 52.15 0.93 247 >> _______________________________________________ >> ceph-users mailing list >> [email protected] <mailto:[email protected]> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
