Hi, a quick addition as I kept poking around, mulling over the CPU theory. I see that OSD 49 (Micron, Host Type 2) does max out close to the theoretical bandwith limit with ~450MiB written at 4k IOPS). It happily uses up to 200% CPU time and has somewhat lower “await” in iostat than in the load profile from my previous post about OSD 30. This is interesting because OSD 30 is also on a Host Type 2 (featuring a different controller than Type 1).
Yes, specific traffic patterns can have massively differing performance profiles, so, to probably quite a degree: apples and oranges. However, this indicates it’s a bit more which pattern is happening than the hardware not performing what I’m expecting from it. To me this means, I still don’t have an indicator of a specific bottleneck. Cheers and good night, :) Christian > On 6 Dec 2016, at 03:37, Christian Theune <c...@flyingcircus.io> wrote: > > Hi Christian (heh), > > thanks for picking this up. :) > > This has become a rather long post as I added more details and giving our > history, but if we make progress then maybe this can help others in the > future. I find slow requests extremely hard to debug and as I said: aside > from scratching my own itch, I’d be happy to help future travellers. > >> On 6 Dec 2016, at 00:59, Christian Balzer <ch...@gol.com> wrote: >> >> Hello, >> >> On Mon, 5 Dec 2016 15:25:37 +0100 Christian Theune wrote: >> >>> Hi, >>> >>> we’re currently expanding our cluster to grow the number of IOPS we can >>> provide to clients. We’re still on Hammer but in the process of upgrading >>> to Jewel. >> >> You might want to wait until the next Jewel release, given the current >> number of issues. > > Our issues or Jewel’s? (This is only half a joke, a friend’s Ceph cluster’s > issues on Jewel is making me quite nervous and I’m re-evaluating postponing > already.) > >>> We started adding pure-SSD OSDs in the last days (based on MICRON >> S610DC-3840) and the slow requests we’ve seen in the past have started to >> show a different pattern. >>> >> I looked in the archives and can't find a full description of your cluster >> hardware in any posts from you or the other Christian (hint, hint). >> Slow requests can nearly all the time being traced back to HW >> issues/limitations. > > We’re currently running on 42% Christians. ;) > > We currently have hosts of 2 generations in our cluster. They’re both > SuperMicro, sold by Thomas Krenn. > > Type 1 (4 Hosts) > > SuperMicro X9DR3-F > 64 GiB RAM > 2 Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz > LSI MegaRAID SAS 9271-8i > Dual 10-Gigabit X540-AT2 > OS on RAID 1 SATA HGST HUS724020ALS640 1.818 TB 7.2k > Journal on 400G Intel MLC NVME PCI-E 3.0 (DC P3600) (I thought those > should be DCP3700, maybe my inventory is wrong) > > Pool “rbd.ssd” > > 2 OSDs on 800GB Intel DC S3510 SSDSC2BX80 (jbod/raid 0) > > Pool “rbd.hdd” > > 5-6 OSDs on SATA HGST HUS724020ALS640 1.818 TB 7.2k (jbod/raid 0) > 1 OSD on SAS MICRON S610DC-3840 3.492 TB (jbod/raid 0) > > Type 2 (2 Hosts) > > SuperMicro X9SRL-F > 32 GiB RAM > 2 Intel(R) Xeon(R) CPU E5-2630L v2 @ 2.40GHz > LSI MegaRAID SAS 9260-4i > Dual 10-Gigabit X540-AT2 > OS on RAID 1 SATA (HGST HUS724020ALS640 1.818 TB) > Journal on 400G Intel MLC NVME PCI-E 3.0 (DC P3600) (I thought those > should be DCP3700, maybe my inventory is wrong) > > Pool “rbd.ssd” > > 2 OSDs on 800GB Intel DC S3510 SSDSC2BX80 (jbod/raid 0) > > Pool “rbd.hdd” > > 7 OSDs on HITACHI HUS156060VLS600 558.406 GB 15k (jbod/raid 0) > 1 OSD on SAS MICRON S610DC-3840 3.492 TB (jbod/raid 0) > > Network > > 10 GE Copper on Brocade VDX 6740-T Switches w/ 20 GBit interconnect > Storage Access and Cluster network on 2 physically separated ports > Some KVM servers are still running 1 GE interfaces for access, some 10 GE > > Software > > Ceph: 0.94.7 > Filesystems: OS on ext4, OSDs on xfs > Kernel: 4.4.27 (one host of type 2 still waiting for reboot to upgrade > from 4.1.16 in the next days) > Qemu: 2.7 with librbd > > Overall we’ve been more happy with the second type of hosts (the 15k disks > obviously don’t reach their limit in a mixed environment), but generally > we’ve been unhappy with having RAID controllers in there at all. We’re going > for pure HBA setups with newer machines. > > This cluster setup has been with us for a few years now and has seen various > generations of configurations ranging from initial huge OSD on RAID 5, over > temporary 2-disk RAID 0, and using CacheCade (until it died ungracefully). > Now it’s just single disk jbod/raid 0 with journals on NVMe SSD. We added a > pure SSD Pool with inline journals a while ago which does not show any slow > requests for the last 14 days or so. We added 10GE about 18 months ago and > migrated that from a temporary solution to Brocade in September. > >> Have you tested these SSDs for their suitability as a Ceph journal (I >> assume inline journals)? >> This has been discussed here countless times, google. > > We’re testing them right now. We stopped testing this in our development > environment as any learning aside from “works generally”/“doesn’t work at > all” has not been portable to our production load at all. All our experiments > trying to reliably replicate production traffic in a reasonable way have > failed. > > I’ve FIOed them and was able to see their spec’d write IOPS (18k random IO 4k > with 32 QD) and throughput. I wasn’t able to see 100k read IOPS but maxed out > around 30k. > >> If those SSDs aren't up to snuff, they may be worse than plain disks in >> some cases. >> >> In addition, these SSDs have an endurance of 1 DWPD, less than 0.5 when >> factoring journal and FS overheads and write amplification scenarios. >> I'd be worried about wearing these out long before 5 years are over. > > I reviewed the endurance from our existing hard drives. The hard drives had > less than 0.05 DWPD since we’ve added them. I wasn’t able to get data from > the NVMe drives about how much data they actually wrote to be able to compute > endurance. The Intel tool currently shows an endurance analyzer of 14.4 > years. This might be a completely useless number due to aggressive > over-provisioning of the SSD. > > However, I’ll be monitoring this more closely on the Microns and not worry > about the full 5 years just yet. > >>> I’m currently seeing those: >>> >>> 2016-12-05 15:13:37.527469 osd.60 172.22.4.46:6818/19894 8080 : cluster >>> [WRN] 5 slow requests, 1 included below; oldest blocked for > 31.675358 secs >>> 2016-12-05 15:13:37.527478 osd.60 172.22.4.46:6818/19894 8081 : cluster >>> [WRN] slow request 31.674886 seconds old, received at 2016-12-05 >>> 15:13:05.852525: osd_op(client.518589944.0:2734750 >>> rbd_data.1e2b40f879e2a9e3.00000000000000a2 [stat,set-alloc-hint object_size >>> 4194304 write_size 4194304,write 1892352~4096] 277.ceaf1c22 >>> ack+ondisk+write+known_if_redirected e1107736) currently waiting for rw >>> locks >>> >> Is osd.60 one of these SSDs? > > It is. However, I can also show any other of the OSDs from the same pool - > this isn’t restricted to the Micron SSDs but also shows for the 1.8 TB SATA > drives.I reviewed all the logs I got from the last two weeks and out of all > complaints those have always made up ~30% of the slow requests we see. Others > include: > > currently waiting for subops from … (~35%) > currently waiting for rw (~30%) > currently no flag points reached (~20%) > currently commit_sent (<5%) > currently reached_pg (<3%) > currently started (<1%) > currently waiting for scrub (extremely rare) > > The statistics for waiting for sub operations include OSD 60 (~1886) more > than any others, even though all of the others do show a significant amount > (~100-400). My current guess is that OSD 60 has gotten more than its fair > share of PGs temporarily and a lot of the movement on the cluster just > affected it much more. I added 3 of the Micron SSDs as the same time and the > two others are well in the normal range. I don’t have distribution over time, > so I can’t prove that right now. The Microns do have a) higher primary > affinity b) double the size than the normal SATA drives so that should result > in more requests ending up there. > > Looking at IOstat I don’t see them creating large backlogs - at least I can’t > spot them interactively. > > Looking at the specific traffic that Ceph with inline journal puts on the > device, then during backfill it appears to be saturated from a bandwith > perspective. I see around 100MB/s going in with around 1k write IOPS. During > this time, iostat shows an average queue length of around 200 requests and an > average wait of around 150ms, which is quite high. Adding a 4k FIO [1] at the > same time doesn’t change throughput or IOPS significantly. Stopping the OSD > then shows IOPS going up to 12k r/w, but of course bandwith is much less at > that point. > > Something we have looked out for again but weren’t able to find something on > is whether the controller might be limiting us in some way. Then again: the > Intel SSDs are attached to the same controller and have not shown *any* slow > request at all. All the slow requests are defined the the rbd.hdd pool (which > now includes the Microns as part of a migration strategy). > > I’m also wondering whether we may be CPU bound, but the stats do not indicate > that. We’re running the “ondemand” governor and I’ve seen references to > people preferring the “performance” governor. However, I would expect that > we’d see significantly reduced idle times when this becomes an issue. > >> Tools like atop and iostat can give you a good insight on how your storage >> subsystem is doing. > > Right. We have collectd running including the Ceph plugin which helps finding > historical data. At the moment I’m not getting much out of the interactive > tools as any issue will have resolved once I’m logged in. ;) > >> The moment you see slow requests, your cluster has issues you need to >> address. >> Especially if this happens when no scrubs are going on (restrict scrubs >> to off-peak hours). > > I’ve seen this happen when no scrubs are around. I can currently not pinpoint > performance issues to ongoing scrubs. We have limited scrubs to a low load > threshold but run them regularly during the day. > >> The 30 seconds is an arbitrary value and the WRN would warrant an ERR in >> my book, as some applications take a very dim view on being blocked for >> more than few seconds. > > Absolutely: we see adverse effects within VMs much earlier and have lowered > the complaint time to 15 seconds due to that. > > Here’s the Ceph config we’re running with. Some of the values are temporarily > tuned at runtime to support operations (like max backfills, etc.): > > [global] > fsid = ... > public network = 172.22.4.0/24 > cluster network = 172.22.8.0/24 > pid file = /run/ceph/$type-$id.pid > # should correspond to the setting in /etc/init.d/ceph > max open files = 262144 > mon host = <3 mons> > mon osd nearfull ratio = .9 > osd pool default min size = 1 > osd pool default size = 2 > # recovery sleep becomes available in infernalis. > # osd recovery sleep = 0.005 > osd snap trim sleep = 0.05 > osd scrub sleep = 0.05 > > [client] > log file = /var/log/ceph/client.log > rbd cache = true > rbd default format = 2 > admin socket = /run/ceph/rbd-$pid-$cctid.asok > > [mon] > admin socket = /run/ceph/$cluster-$name.asok > mon addr = 172.22.4.42:6789,172.22.4.43:6789,172.22.4.54:6789 > mon data = /srv/ceph/mon/$cluster-$id > mon data avail crit = 2 > mon data avail warn = 5 > mon osd allow primary affinity = true > mon pg warn max per osd = 3000 > > [mon.cartman07] > host = cartman07 > mon addr = 172.22.4.43:6789 > public addr = 172.22.4.43:6789 > cluster addr = 172.22.8.13:6789 > > [osd] > host = cartman07 > admin socket = /run/ceph/$cluster-$name.asok > public addr = 172.22.4.43 > cluster addr = 172.22.8.13 > filestore fiemap = true > filestore max sync interval = 5 > # slightly more than 1 week to avoid hitting the same day the same time > osd deep scrub interval = 700000 > # place this between "nearfull" and "full" > osd backfill full ratio = 0.92 > # 2* 5 seconds * 200MB/s disk throughput = 2000MB > osd journal size = 2000 > osd max backfills = 1 > osd op complaint time = 15 > osd recovery max active = 1 > osd recovery op priority = 1 > osd scrub load threshold = 2 > > Something I started playing around with today was looking up specific slow > requests from the admin sockets. Here’s what happened a bit earlier. > Something I see is that the request flaps between “reached pg” and “waiting > for rw locks”. I’ll provide a set of 3 dumps that are correlated. I started > with OSD 30 (Micron) and pulled the ones from OSD 33 and 22 as they are > referenced from the “waiting for subops from”. > > <historic-ops-osd30.txt.gz><historic-ops-osd22.txt.gz><historic-ops-osd33.txt.gz> > > I’m also looking forward to see how the cluster behaves tomorrow when it > finishes its current backfill after we finished adding the last 3 Microns > tonight (so 6 in total now). > > Thanks for listening, > Christian > > [1] fio --rw=readwrite --name=test --size=5000M --nrfiles=1 --numjobs=3 > --loops=1000 --direct=1 --rwmixread=50 --bs=4k --runtime=20000 --iodepth=32 > > -- > Christian Theune · c...@flyingcircus.io · +49 345 219401 0 > Flying Circus Internet Operations GmbH · http://flyingcircus.io > Forsterstraße 29 · 06112 Halle (Saale) · Deutschland > HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian. > Zagrodnick > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian Theune · c...@flyingcircus.io · +49 345 219401 0 Flying Circus Internet Operations GmbH · http://flyingcircus.io Forsterstraße 29 · 06112 Halle (Saale) · Deutschland HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian. Zagrodnick
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com