Re: [ceph-users] ceph cluster performance

Dinu Vlad Wed, 06 Nov 2013 13:18:50 -0800

ST240FN0021 connected via a SAS2x36 to a LSI 9207-8i. 

By "fixed" - you mean replaced the SSDs?


Thanks,
Dinu

On Nov 6, 2013, at 10:25 PM, Mike Dawson <mike.daw...@cloudapt.com> wrote:

> We just fixed a performance issue on our cluster related to spikes of high 
> latency on some of our SSDs used for osd journals. In our case, the slow SSDs 
> showed spikes of 100x higher latency than expected.
> 
> What SSDs were you using that were so slow?
> 
> Cheers,
> Mike
> 
> On 11/6/2013 12:39 PM, Dinu Vlad wrote:
>> I'm using the latest 3.8.0 branch from raring. Is there a more recent/better 
>> kernel recommended?
>> 
>> Meanwhile, I think I might have identified the culprit - my SSD drives are 
>> extremely slow on sync writes, doing 5-600 iops max with 4k blocksize. By 
>> comparison, an Intel 530 in another server (also installed behind a SAS 
>> expander is doing the same test with ~ 8k iops. I guess I'm good for 
>> replacing them.
>> 
>> Removing the SSD drives from the setup and re-testing with ceph => 595 MB/s 
>> throughput under the same conditions (only mechanical drives, journal on a 
>> separate partition on each one, 8 rados bench processes, 16 threads each).
>> 
>> 
>> On Nov 5, 2013, at 4:38 PM, Mark Nelson <mark.nel...@inktank.com> wrote:
>> 
>>> Ok, some more thoughts:
>>> 
>>> 1) What kernel are you using?
>>> 
>>> 2) Mixing SATA and SAS on an expander backplane can some times have bad 
>>> effects.  We don't really know how bad this is and in what circumstances, 
>>> but the Nexenta folks have seen problems with ZFS on solaris and it's not 
>>> impossible linux may suffer too:
>>> 
>>> http://gdamore.blogspot.com/2010/08/why-sas-sata-is-not-such-great-idea.html
>>> 
>>> 3) If you are doing tests and look at disk throughput with something like 
>>> "collectl -sD -oT"  do the writes look balanced across the spinning disks?  
>>> Do any devices have much really high service times or queue times?
>>> 
>>> 4) Also, after the test is done, you can try:
>>> 
>>> find /var/run/ceph/*.asok -maxdepth 1 -exec sudo ceph --admin-daemon {} 
>>> dump_historic_ops \; > foo
>>> 
>>> and then grep for "duration" in foo.  You'll get a list of the slowest 
>>> operations over the last 10 minutes from every osd on the node.  Once you 
>>> identify a slow duration, you can go back and in an editor search for the 
>>> slow duration and look at where in the OSD it hung up.  That might tell us 
>>> more about slow/latent operations.
>>> 
>>> 5) Something interesting here is that I've heard from another party that in 
>>> a 36 drive Supermicro SC847E16 chassis they had 30 7.2K RPM disks and 6 
>>> SSDs on a SAS9207-8i controller and were pushing significantly faster 
>>> throughput than you are seeing (even given the greater number of drives).  
>>> So it's very interesting to me that you are pushing so much less.  The 36 
>>> drive supermicro chassis I have with no expanders and 30 drives with 6 SSDs 
>>> can push about 2100MB/s with a bunch of 9207-8i controllers and XFS (no 
>>> replication).
>>> 
>>> Mark
>>> 
>>> On 11/05/2013 05:15 AM, Dinu Vlad wrote:
>>>> Ok, so after tweaking the deadline scheduler and the filestore_wbthrottle* 
>>>> ceph settings I was able to get 440 MB/s from 8 rados bench instances, 
>>>> over a single osd node (pool pg_num = 1800, size = 1)
>>>> 
>>>> This still looks awfully slow to me - fio throughput across all disks 
>>>> reaches 2.8 GB/s!!
>>>> 
>>>> I'd appreciate any suggestion, where to look for the issue. Thanks!
>>>> 
>>>> 
>>>> On Oct 31, 2013, at 6:35 PM, Dinu Vlad <dinuvla...@gmail.com> wrote:
>>>> 
>>>>> 
>>>>> I tested the osd performance from a single node. For this purpose I 
>>>>> deployed a new cluster (using ceph-deploy, as before) and on 
>>>>> fresh/repartitioned drives. I created a single pool, 1800 pgs. I ran the 
>>>>> rados bench both on the osd server and on a remote one. Cluster 
>>>>> configuration stayed "default", with the same additions about xfs mount & 
>>>>> mkfs.xfs as before.
>>>>> 
>>>>> With a single host, the pgs were "stuck unclean" (active only, not 
>>>>> active+clean):
>>>>> 
>>>>> # ceph -s
>>>>>  cluster ffd16afa-6348-4877-b6bc-d7f9d82a4062
>>>>>   health HEALTH_WARN 1800 pgs stuck unclean
>>>>>   monmap e1: 3 mons at 
>>>>> {cephmon1=10.4.0.250:6789/0,cephmon2=10.4.0.251:6789/0,cephmon3=10.4.0.252:6789/0},
>>>>>  election epoch 4, quorum 0,1,2 cephmon1,cephmon2,cephmon3
>>>>>   osdmap e101: 18 osds: 18 up, 18 in
>>>>>    pgmap v1055: 1800 pgs: 1800 active; 0 bytes data, 732 MB used, 16758 
>>>>> GB / 16759 GB avail
>>>>>   mdsmap e1: 0/0/1 up
>>>>> 
>>>>> 
>>>>> Test results:
>>>>> Local test, 1 process, 16 threads: 241.7 MB/s
>>>>> Local test, 8 processes, 128 threads: 374.8 MB/s
>>>>> Remote test, 1 process, 16 threads: 231.8 MB/s
>>>>> Remote test, 8 processes, 128 threads: 366.1 MB/s
>>>>> 
>>>>> Maybe it's just me, but it seems on the low side too.
>>>>> 
>>>>> Thanks,
>>>>> Dinu
>>>>> 
>>>>> 
>>>>> On Oct 30, 2013, at 8:59 PM, Mark Nelson <mark.nel...@inktank.com> wrote:
>>>>> 
>>>>>> On 10/30/2013 01:51 PM, Dinu Vlad wrote:
>>>>>>> Mark,
>>>>>>> 
>>>>>>> The SSDs are 
>>>>>>> http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterprise-sata-ssd/?sku=ST240FN0021
>>>>>>>  and the HDDs are 
>>>>>>> http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/constellation/?sku=ST91000640SS.
>>>>>>> 
>>>>>>> The chasis is a "SiliconMechanics C602" - but I don't have the exact 
>>>>>>> model. It's based on Supermicro, has 24 slots front and 2 in the back 
>>>>>>> and a SAS expander.
>>>>>>> 
>>>>>>> I did a fio test (raw partitions, 4M blocksize, ioqueue maxed out 
>>>>>>> according to what the driver reports in dmesg). here are the results 
>>>>>>> (filtered):
>>>>>>> 
>>>>>>> Sequential:
>>>>>>> Run status group 0 (all jobs):
>>>>>>>  WRITE: io=176952MB, aggrb=2879.0MB/s, minb=106306KB/s, 
>>>>>>> maxb=191165KB/s, mint=60444msec, maxt=61463msec
>>>>>>> 
>>>>>>> Individually, the HDDs had best:worst 103:109 MB/s while the SSDs gave 
>>>>>>> 153:189 MB/s
>>>>>> 
>>>>>> Ok, that looks like what I'd expect to see given the controller being 
>>>>>> used.  SSDs are probably limited by total aggregate throughput.
>>>>>> 
>>>>>>> 
>>>>>>> Random:
>>>>>>> Run status group 0 (all jobs):
>>>>>>>  WRITE: io=106868MB, aggrb=1727.2MB/s, minb=67674KB/s, maxb=106493KB/s, 
>>>>>>> mint=60404msec, maxt=61875msec
>>>>>>> 
>>>>>>> Individually (best:worst) HDD 71:73 MB/s, SSD 68:101 MB/s (with only 
>>>>>>> one out of 6 doing 101)
>>>>>>> 
>>>>>>> This is on just one of the osd servers.
>>>>>> 
>>>>>> Where the ceph tests to one OSD server or across all servers?  It might 
>>>>>> be worth trying tests against a single server with no replication using 
>>>>>> multiple rados bench instances and just seeing what happens.
>>>>>> 
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Dinu
>>>>>>> 
>>>>>>> 
>>>>>>> On Oct 30, 2013, at 6:38 PM, Mark Nelson <mark.nel...@inktank.com> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> On 10/30/2013 09:05 AM, Dinu Vlad wrote:
>>>>>>>>> Hello,
>>>>>>>>> 
>>>>>>>>> I've been doing some tests on a newly installed ceph cluster:
>>>>>>>>> 
>>>>>>>>> # ceph osd create bench1 2048 2048
>>>>>>>>> # ceph osd create bench2 2048 2048
>>>>>>>>> # rbd -p bench1 create test
>>>>>>>>> # rbd -p bench1 bench-write test --io-pattern rand
>>>>>>>>> elapsed:   483  ops:   396579  ops/sec:   820.23  bytes/sec: 
>>>>>>>>> 2220781.36
>>>>>>>>> 
>>>>>>>>> # rados -p bench2 bench 300 write --show-time
>>>>>>>>> # (run 1)
>>>>>>>>> Total writes made:      20665
>>>>>>>>> Write size:             4194304
>>>>>>>>> Bandwidth (MB/sec):     274.923
>>>>>>>>> 
>>>>>>>>> Stddev Bandwidth:       96.3316
>>>>>>>>> Max bandwidth (MB/sec): 748
>>>>>>>>> Min bandwidth (MB/sec): 0
>>>>>>>>> Average Latency:        0.23273
>>>>>>>>> Stddev Latency:         0.262043
>>>>>>>>> Max latency:            1.69475
>>>>>>>>> Min latency:            0.057293
>>>>>>>>> 
>>>>>>>>> These results seem to be quite poor for the configuration:
>>>>>>>>> 
>>>>>>>>> MON: dual-cpu Xeon E5-2407 2.2 GHz, 48 GB RAM, 2xSSD for OS
>>>>>>>>> OSD: dual-cpu Xeon E5-2620 2.0 GHz, 64 GB RAM, 2xSSD for OS (on-board 
>>>>>>>>> controller), 18 HDD 1TB 7.2K rpm SAS for OSD drives and 6 SSDs (SATA) 
>>>>>>>>> for journal, attached to a LSI 9207-8i controller.
>>>>>>>>> All servers have dual 10GE network cards, connected to a pair of 
>>>>>>>>> dedicated switches. Each SSD has 3 10 GB partitions for journals.
>>>>>>>> 
>>>>>>>> Agreed, you should see much higher throughput with that kind of 
>>>>>>>> storage setup.  What brand/model SSDs are these?  Also, what brand and 
>>>>>>>> model of chassis?  With 24 drives and 8 SSDs I could push 2GB/s (no 
>>>>>>>> replication though) with a couple of concurrent rados bench processes 
>>>>>>>> going on our SC847A chassis, so ~550MB/s aggregate throughput for 18 
>>>>>>>> drives and 6 SSDs is definitely on the low side.
>>>>>>>> 
>>>>>>>> I'm actually not too familiar with what the RBD benchmarking commands 
>>>>>>>> are doing behind the scenes.  Typically I've tested fio on top of a 
>>>>>>>> filesystem on RBD.
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Using ubuntu 13.04, ceph 0.67.4, XFS for backend storage. Cluster was 
>>>>>>>>> installed using ceph-deploy. ceph.conf pretty much out of the box 
>>>>>>>>> (diff from default follows)
>>>>>>>>> 
>>>>>>>>> osd_journal_size = 10240
>>>>>>>>> osd mount options xfs = "rw,noatime,nobarrier,inode64"
>>>>>>>>> osd mkfs options xfs = "-f -i size=2048"
>>>>>>>>> 
>>>>>>>>> [osd]
>>>>>>>>> public network = 10.4.0.0/24
>>>>>>>>> cluster network = 10.254.254.0/24
>>>>>>>>> 
>>>>>>>>> All tests were run from a server outside the cluster, connected to 
>>>>>>>>> the storage network with 2x 10 GE nics.
>>>>>>>>> 
>>>>>>>>> I've done a few other tests of the individual components:
>>>>>>>>> - network: avg. 7.6 Gbit/s (iperf, mtu=1500), 9.6 Gbit/s (mtu=9000)
>>>>>>>>> - md raid0 write across all 18 HDDs - 1.4 GB/s sustained throughput
>>>>>>>>> - fio SSD write (xfs, 4k blocks, directio): ~ 250 MB/s, ~55K IOPS
>>>>>>>> 
>>>>>>>> What you might want to try doing is 4M direct IO writes using libaio 
>>>>>>>> and a high iodepth to all drives (spinning disks and SSDs) 
>>>>>>>> concurrently and see how both the per-drive and aggregate throughput 
>>>>>>>> is.
>>>>>>>> 
>>>>>>>> With just SSDs, I've been able to push the 9207-8i up to around 3GB/s 
>>>>>>>> with Ceph writes (1.5GB/s if you don't count journal writes), but 
>>>>>>>> perhaps there is something interesting about the way the hardware is 
>>>>>>>> setup on your system.
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I'd appreciate any suggestion that might help improve the performance 
>>>>>>>>> or identify a bottleneck.
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> Dinu
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list
>>>>>>>>> ceph-users@lists.ceph.com
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list
>>>>>>>> ceph-users@lists.ceph.com
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users@lists.ceph.com
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> 
>>> 
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph cluster performance

Reply via email to