Hi,

a quick addition as I kept poking around, mulling over the CPU theory. I see 
that OSD 49 (Micron, Host Type 2) does max out close to the theoretical 
bandwith limit with ~450MiB written at 4k IOPS). It happily uses up to 200% CPU 
time and has somewhat lower “await” in iostat than in the load profile from my 
previous post about OSD 30. This is interesting because OSD 30 is also on a 
Host Type 2 (featuring a different controller than Type 1). 

Yes, specific traffic patterns can have massively differing performance 
profiles, so, to probably quite a degree: apples and oranges. However, this 
indicates it’s a bit more which pattern is happening than the hardware not 
performing what I’m expecting from it. To me this means, I still don’t have an 
indicator of a specific bottleneck.

Cheers and good night, :)
Christian 

> On 6 Dec 2016, at 03:37, Christian Theune <c...@flyingcircus.io> wrote:
> 
> Hi Christian (heh),
> 
> thanks for picking this up. :)
> 
> This has become a rather long post as I added more details and giving our 
> history, but if we make progress then maybe this can help others in the 
> future. I find slow requests extremely hard to debug and as I said: aside 
> from scratching my own itch, I’d be happy to help future travellers.
> 
>> On 6 Dec 2016, at 00:59, Christian Balzer <ch...@gol.com> wrote:
>> 
>> Hello,
>> 
>> On Mon, 5 Dec 2016 15:25:37 +0100 Christian Theune wrote:
>> 
>>> Hi,
>>> 
>>> we’re currently expanding our cluster to grow the number of IOPS we can 
>>> provide to clients. We’re still on Hammer but in the process of upgrading 
>>> to Jewel. 
>> 
>> You might want to wait until the next Jewel release, given the current
>> number of issues.
> 
> Our issues or Jewel’s? (This is only half a joke, a friend’s Ceph cluster’s 
> issues on Jewel is making me quite nervous and I’m re-evaluating postponing 
> already.)
> 
>>> We started adding pure-SSD OSDs in the last days (based on MICRON
>> S610DC-3840) and the slow requests we’ve seen in the past have started to
>> show a different pattern.
>>> 
>> I looked in the archives and can't find a full description of your cluster
>> hardware in any posts from you or the other Christian (hint, hint). 
>> Slow requests can nearly all the time being traced back to HW
>> issues/limitations.
> 
> We’re currently running on 42% Christians. ;)
> 
> We currently have hosts of 2 generations in our cluster. They’re both 
> SuperMicro, sold by Thomas Krenn.
> 
> Type 1 (4 Hosts)
> 
>    SuperMicro X9DR3-F
>    64 GiB RAM
>    2 Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
>    LSI MegaRAID SAS 9271-8i
>    Dual 10-Gigabit X540-AT2
>    OS on RAID 1 SATA HGST HUS724020ALS640 1.818 TB 7.2k
>    Journal on 400G Intel MLC NVME PCI-E 3.0 (DC P3600) (I thought those 
> should be DCP3700, maybe my inventory is wrong)
> 
>    Pool “rbd.ssd”
> 
>        2 OSDs on 800GB Intel DC S3510 SSDSC2BX80 (jbod/raid 0)
> 
>    Pool “rbd.hdd”
> 
>        5-6 OSDs on SATA HGST HUS724020ALS640 1.818 TB 7.2k (jbod/raid 0)
>        1 OSD on SAS MICRON S610DC-3840 3.492 TB (jbod/raid 0)
> 
> Type 2 (2 Hosts)
> 
>    SuperMicro X9SRL-F
>    32 GiB RAM
>    2 Intel(R) Xeon(R) CPU E5-2630L v2 @ 2.40GHz
>    LSI MegaRAID SAS 9260-4i
>    Dual 10-Gigabit X540-AT2
>    OS on RAID 1 SATA (HGST HUS724020ALS640 1.818 TB)
>    Journal on 400G Intel MLC NVME PCI-E 3.0 (DC P3600) (I thought those 
> should be DCP3700, maybe my inventory is wrong)
> 
>    Pool “rbd.ssd”
> 
>        2 OSDs on 800GB Intel DC S3510 SSDSC2BX80 (jbod/raid 0)
> 
>    Pool “rbd.hdd”
> 
>        7 OSDs on HITACHI HUS156060VLS600 558.406 GB 15k (jbod/raid 0)
>        1 OSD on SAS MICRON S610DC-3840 3.492 TB (jbod/raid 0)
> 
> Network
> 
>    10 GE Copper on Brocade VDX 6740-T Switches w/ 20 GBit interconnect
>     Storage Access and Cluster network on 2 physically separated ports
>     Some KVM servers are still running 1 GE interfaces for access, some 10 GE
> 
> Software
> 
>    Ceph: 0.94.7
>    Filesystems: OS on ext4, OSDs on xfs
>    Kernel: 4.4.27 (one host of type 2 still waiting for reboot to upgrade 
> from 4.1.16 in the next days)
>    Qemu: 2.7 with librbd
> 
> Overall we’ve been more happy with the second type of hosts (the 15k disks 
> obviously don’t reach their limit in a mixed environment), but generally 
> we’ve been unhappy with having RAID controllers in there at all. We’re going 
> for pure HBA setups with newer machines.
> 
> This cluster setup has been with us for a few years now and has seen various 
> generations of configurations ranging from initial huge OSD on RAID 5, over 
> temporary 2-disk RAID 0, and using CacheCade (until it died ungracefully). 
> Now it’s just single disk jbod/raid 0 with journals on NVMe SSD. We added a 
> pure SSD Pool with inline journals a while ago which does not show any slow 
> requests for the last 14 days or so. We added 10GE about 18 months ago and 
> migrated that from a temporary solution to Brocade in September.
> 
>> Have you tested these SSDs for their suitability as a Ceph journal (I
>> assume inline journals)? 
>> This has been discussed here countless times, google.
> 
> We’re testing them right now. We stopped testing this in our development 
> environment as any learning aside from “works generally”/“doesn’t work at 
> all” has not been portable to our production load at all. All our experiments 
> trying to reliably replicate production traffic in a reasonable way have 
> failed.
> 
> I’ve FIOed them and was able to see their spec’d write IOPS (18k random IO 4k 
> with 32 QD) and throughput. I wasn’t able to see 100k read IOPS but maxed out 
> around 30k.
> 
>> If those SSDs aren't up to snuff, they may be worse than plain disks in
>> some cases.
>> 
>> In addition, these SSDs have an endurance of 1 DWPD, less than 0.5 when
>> factoring journal and FS overheads and write amplification scenarios.
>> I'd be worried about wearing these out long before 5 years are over.
> 
> I reviewed the endurance from our existing hard drives. The hard drives had 
> less than 0.05 DWPD since we’ve added them. I wasn’t able to get data from 
> the NVMe drives about how much data they actually wrote to be able to compute 
> endurance. The Intel tool currently shows an endurance analyzer of 14.4 
> years. This might be a completely useless number due to aggressive 
> over-provisioning of the SSD.
> 
> However, I’ll be monitoring this more closely on the Microns and not worry 
> about the full 5 years just yet. 
> 
>>> I’m currently seeing those: 
>>> 
>>> 2016-12-05 15:13:37.527469 osd.60 172.22.4.46:6818/19894 8080 : cluster 
>>> [WRN] 5 slow requests, 1 included below; oldest blocked for > 31.675358 secs
>>> 2016-12-05 15:13:37.527478 osd.60 172.22.4.46:6818/19894 8081 : cluster 
>>> [WRN] slow request 31.674886 seconds old, received at 2016-12-05 
>>> 15:13:05.852525: osd_op(client.518589944.0:2734750 
>>> rbd_data.1e2b40f879e2a9e3.00000000000000a2 [stat,set-alloc-hint object_size 
>>> 4194304 write_size 4194304,write 1892352~4096] 277.ceaf1c22 
>>> ack+ondisk+write+known_if_redirected e1107736) currently waiting for rw 
>>> locks
>>> 
>> Is osd.60 one of these SSDs?
> 
> It is. However, I can also show any other of the OSDs from the same pool - 
> this isn’t restricted to the Micron SSDs but also shows for the 1.8 TB SATA 
> drives.I reviewed all the logs I got from the last two weeks and out of all 
> complaints those have always made up ~30% of the slow requests we see. Others 
> include:
> 
> currently waiting for subops from … (~35%)
> currently waiting for rw (~30%)
> currently no flag points reached (~20%)
> currently commit_sent (<5%)
> currently reached_pg (<3%)
> currently started (<1%)
> currently waiting for scrub (extremely rare)
> 
> The statistics for waiting for sub operations include OSD 60 (~1886) more 
> than any others, even though all of the others do show a significant amount 
> (~100-400). My current guess is that OSD 60 has gotten more than its fair 
> share of PGs temporarily and a lot of the movement on the cluster just 
> affected it much more. I added 3 of the Micron SSDs as the same time and the 
> two others are well in the normal range. I don’t have distribution over time, 
> so I can’t prove that right now. The Microns do have a) higher primary 
> affinity b) double the size than the normal SATA drives so that should result 
> in more requests ending up there.
> 
> Looking at IOstat I don’t see them creating large backlogs - at least I can’t 
> spot them interactively.
> 
> Looking at the specific traffic that Ceph with inline journal puts on the 
> device, then during backfill it appears to be saturated from a bandwith 
> perspective. I see around 100MB/s going in with around 1k write IOPS. During 
> this time, iostat shows an average queue length of around 200 requests and an 
> average wait of around 150ms, which is quite high. Adding a 4k FIO [1] at the 
> same time doesn’t change throughput or IOPS significantly. Stopping the OSD 
> then shows IOPS going up to 12k r/w, but of course bandwith is much less at 
> that point.
> 
> Something we have looked out for again but weren’t able to find something on 
> is whether the controller might be limiting us in some way. Then again: the 
> Intel SSDs are attached to the same controller and have not shown *any* slow 
> request at all. All the slow requests are defined the the rbd.hdd pool (which 
> now includes the Microns as part of a migration strategy).
> 
> I’m also wondering whether we may be CPU bound, but the stats do not indicate 
> that. We’re running the “ondemand” governor and I’ve seen references to 
> people preferring the “performance” governor. However, I would expect that 
> we’d see significantly reduced idle times when this becomes an issue.
> 
>> Tools like atop and iostat can give you a good insight on how your storage
>> subsystem is doing.
> 
> Right. We have collectd running including the Ceph plugin which helps finding 
> historical data. At the moment I’m not getting much out of the interactive 
> tools as any issue will have resolved once I’m logged in. ;)
> 
>> The moment you see slow requests, your cluster has issues you need to
>> address.
>> Especially if this happens when no scrubs are going on (restrict scrubs
>> to off-peak hours).
> 
> I’ve seen this happen when no scrubs are around. I can currently not pinpoint 
> performance issues to ongoing scrubs. We have limited scrubs to a low load 
> threshold but run them regularly during the day.
> 
>> The 30 seconds is an arbitrary value and the WRN would warrant an ERR in
>> my book, as some applications take a very dim view on being blocked for
>> more than few seconds.
> 
> Absolutely: we see adverse effects within VMs much earlier and have lowered 
> the complaint time to 15 seconds due to that. 
> 
> Here’s the Ceph config we’re running with. Some of the values are temporarily 
> tuned at runtime to support operations (like max backfills, etc.):
> 
> [global]
> fsid = ...
> public network = 172.22.4.0/24
> cluster network = 172.22.8.0/24
> pid file = /run/ceph/$type-$id.pid
> # should correspond to the setting in /etc/init.d/ceph
> max open files = 262144
> mon host = <3 mons>
> mon osd nearfull ratio = .9
> osd pool default min size = 1
> osd pool default size = 2
> # recovery sleep becomes available in infernalis.
> # osd recovery sleep = 0.005
> osd snap trim sleep = 0.05
> osd scrub sleep = 0.05
> 
> [client]
> log file = /var/log/ceph/client.log
> rbd cache = true
> rbd default format = 2
> admin socket = /run/ceph/rbd-$pid-$cctid.asok
> 
> [mon]
> admin socket = /run/ceph/$cluster-$name.asok
> mon addr = 172.22.4.42:6789,172.22.4.43:6789,172.22.4.54:6789
> mon data = /srv/ceph/mon/$cluster-$id
> mon data avail crit = 2
> mon data avail warn = 5
> mon osd allow primary affinity = true
> mon pg warn max per osd = 3000
> 
> [mon.cartman07]
> host = cartman07
> mon addr = 172.22.4.43:6789
> public addr = 172.22.4.43:6789
> cluster addr = 172.22.8.13:6789
> 
> [osd]
> host = cartman07
> admin socket = /run/ceph/$cluster-$name.asok
> public addr = 172.22.4.43
> cluster addr = 172.22.8.13
> filestore fiemap = true
> filestore max sync interval = 5
> # slightly more than 1 week to avoid hitting the same day the same time
> osd deep scrub interval = 700000
> # place this between "nearfull" and "full"
> osd backfill full ratio = 0.92
> # 2* 5 seconds * 200MB/s disk throughput = 2000MB
> osd journal size = 2000
> osd max backfills = 1
> osd op complaint time = 15
> osd recovery max active = 1
> osd recovery op priority = 1
> osd scrub load threshold = 2
> 
> Something I started playing around with today was looking up specific slow 
> requests from the admin sockets. Here’s what happened a bit earlier. 
> Something I see is that the request flaps between “reached pg” and “waiting 
> for rw locks”. I’ll provide a set of 3 dumps that are correlated. I started 
> with OSD 30 (Micron) and pulled the ones from OSD 33 and 22 as they are 
> referenced from the “waiting for subops from”.
> 
> <historic-ops-osd30.txt.gz><historic-ops-osd22.txt.gz><historic-ops-osd33.txt.gz>
> 
> I’m also looking forward to see how the cluster behaves tomorrow when it 
> finishes its current backfill after we finished adding the last 3 Microns 
> tonight (so 6 in total now).
> 
> Thanks for listening, 
> Christian
> 
> [1] fio --rw=readwrite --name=test --size=5000M --nrfiles=1 --numjobs=3 
> --loops=1000 --direct=1  --rwmixread=50 --bs=4k --runtime=20000 --iodepth=32
> 
> -- 
> Christian Theune · c...@flyingcircus.io · +49 345 219401 0
> Flying Circus Internet Operations GmbH · http://flyingcircus.io
> Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
> HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian. 
> Zagrodnick
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Christian Theune · c...@flyingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian. Zagrodnick

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to