Hi John, I am trying to squize extra performance from my test cluster too Dell R 620 with PERC 710 , RAID0, 10 GB network
Would you be willing to share your controller and kernel configuration ? For example, I am using BIOS profile 'Performance" with the following added to /etc/default/kernel intel_pstate=disable intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll and tuned profile throughput-performance All disks are configured with nr-request=1024 and read-ahead-kb=4096 SSD uses scheduled= noop while HDD uses deadline cache policy for SSD megacli -LDSetProp -WT -Immediate -L0 -a0 megacli -LDSetProp -NORA -Immediate -L0 -a0 megacli -LDSetProp -Direct -Immediate -L0 -a0 HDD cache policy has all caches enabled , WB and ADRA Many thanks Steven On 16 February 2018 at 19:06, John Petrini <jpetr...@coredial.com> wrote: > I thought I'd follow up on this just in case anyone else experiences > similar issues. We ended up increasing the tcmalloc thread cache size and > saw a huge improvement in latency. This got us out of the woods because we > were finally in a state where performance was good enough that it was no > longer impacting services. > > The tcmalloc issues are pretty well documented on this mailing list and I > don't believe they impact newer versions of Ceph but I thought I'd at least > give a data point. After making this change our average apply latency > dropped to 3.46ms during peak business hours. To give you an idea of how > significant that is here's a graph of the apply latency prior to the > change: https://imgur.com/KYUETvD > > This however did not resolve all of our issues. We were still seeing high > iowait (repeated spikes up to 400ms) on three of our OSD nodes on all > disks. We tried replacing the RAID controller (PERC H730) on these nodes > and while this resolved the issue on one server the two others remained > problematic. These two nodes were configured differently than the rest. > They'd been configured in non-raid mode while the others were configured as > individual raid-0. This turned out to be the problem. We ended up removing > the two nodes one at a time and rebuilding them with their disks configured > in independent raid-0 instead of non-raid. After this change iowait rarely > spikes above 15ms and averages <1ms. > > I was really surprised at the performance impact when using non-raid mode. > While I realize non-raid bypasses the controller cache I still would have > never expected such high latency. Dell has a whitepaper that recommends > using individual raid-0 but their own tests show only a small performance > advantage over non-raid. Note that we are running SAS disks, they actually > recommend non-raid mode for SATA but I have not tested this. You can view > the whtiepaper here: http://en.community.dell.com/ > techcenter/cloud/m/dell_cloud_resources/20442913/download > > I hope this helps someone. > > John Petrini > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com