ST240FN0021 connected via a SAS2x36 to a LSI 9207-8i. By "fixed" - you mean replaced the SSDs?
Thanks, Dinu On Nov 6, 2013, at 10:25 PM, Mike Dawson <mike.daw...@cloudapt.com> wrote: > We just fixed a performance issue on our cluster related to spikes of high > latency on some of our SSDs used for osd journals. In our case, the slow SSDs > showed spikes of 100x higher latency than expected. > > What SSDs were you using that were so slow? > > Cheers, > Mike > > On 11/6/2013 12:39 PM, Dinu Vlad wrote: >> I'm using the latest 3.8.0 branch from raring. Is there a more recent/better >> kernel recommended? >> >> Meanwhile, I think I might have identified the culprit - my SSD drives are >> extremely slow on sync writes, doing 5-600 iops max with 4k blocksize. By >> comparison, an Intel 530 in another server (also installed behind a SAS >> expander is doing the same test with ~ 8k iops. I guess I'm good for >> replacing them. >> >> Removing the SSD drives from the setup and re-testing with ceph => 595 MB/s >> throughput under the same conditions (only mechanical drives, journal on a >> separate partition on each one, 8 rados bench processes, 16 threads each). >> >> >> On Nov 5, 2013, at 4:38 PM, Mark Nelson <mark.nel...@inktank.com> wrote: >> >>> Ok, some more thoughts: >>> >>> 1) What kernel are you using? >>> >>> 2) Mixing SATA and SAS on an expander backplane can some times have bad >>> effects. We don't really know how bad this is and in what circumstances, >>> but the Nexenta folks have seen problems with ZFS on solaris and it's not >>> impossible linux may suffer too: >>> >>> http://gdamore.blogspot.com/2010/08/why-sas-sata-is-not-such-great-idea.html >>> >>> 3) If you are doing tests and look at disk throughput with something like >>> "collectl -sD -oT" do the writes look balanced across the spinning disks? >>> Do any devices have much really high service times or queue times? >>> >>> 4) Also, after the test is done, you can try: >>> >>> find /var/run/ceph/*.asok -maxdepth 1 -exec sudo ceph --admin-daemon {} >>> dump_historic_ops \; > foo >>> >>> and then grep for "duration" in foo. You'll get a list of the slowest >>> operations over the last 10 minutes from every osd on the node. Once you >>> identify a slow duration, you can go back and in an editor search for the >>> slow duration and look at where in the OSD it hung up. That might tell us >>> more about slow/latent operations. >>> >>> 5) Something interesting here is that I've heard from another party that in >>> a 36 drive Supermicro SC847E16 chassis they had 30 7.2K RPM disks and 6 >>> SSDs on a SAS9207-8i controller and were pushing significantly faster >>> throughput than you are seeing (even given the greater number of drives). >>> So it's very interesting to me that you are pushing so much less. The 36 >>> drive supermicro chassis I have with no expanders and 30 drives with 6 SSDs >>> can push about 2100MB/s with a bunch of 9207-8i controllers and XFS (no >>> replication). >>> >>> Mark >>> >>> On 11/05/2013 05:15 AM, Dinu Vlad wrote: >>>> Ok, so after tweaking the deadline scheduler and the filestore_wbthrottle* >>>> ceph settings I was able to get 440 MB/s from 8 rados bench instances, >>>> over a single osd node (pool pg_num = 1800, size = 1) >>>> >>>> This still looks awfully slow to me - fio throughput across all disks >>>> reaches 2.8 GB/s!! >>>> >>>> I'd appreciate any suggestion, where to look for the issue. Thanks! >>>> >>>> >>>> On Oct 31, 2013, at 6:35 PM, Dinu Vlad <dinuvla...@gmail.com> wrote: >>>> >>>>> >>>>> I tested the osd performance from a single node. For this purpose I >>>>> deployed a new cluster (using ceph-deploy, as before) and on >>>>> fresh/repartitioned drives. I created a single pool, 1800 pgs. I ran the >>>>> rados bench both on the osd server and on a remote one. Cluster >>>>> configuration stayed "default", with the same additions about xfs mount & >>>>> mkfs.xfs as before. >>>>> >>>>> With a single host, the pgs were "stuck unclean" (active only, not >>>>> active+clean): >>>>> >>>>> # ceph -s >>>>> cluster ffd16afa-6348-4877-b6bc-d7f9d82a4062 >>>>> health HEALTH_WARN 1800 pgs stuck unclean >>>>> monmap e1: 3 mons at >>>>> {cephmon1=10.4.0.250:6789/0,cephmon2=10.4.0.251:6789/0,cephmon3=10.4.0.252:6789/0}, >>>>> election epoch 4, quorum 0,1,2 cephmon1,cephmon2,cephmon3 >>>>> osdmap e101: 18 osds: 18 up, 18 in >>>>> pgmap v1055: 1800 pgs: 1800 active; 0 bytes data, 732 MB used, 16758 >>>>> GB / 16759 GB avail >>>>> mdsmap e1: 0/0/1 up >>>>> >>>>> >>>>> Test results: >>>>> Local test, 1 process, 16 threads: 241.7 MB/s >>>>> Local test, 8 processes, 128 threads: 374.8 MB/s >>>>> Remote test, 1 process, 16 threads: 231.8 MB/s >>>>> Remote test, 8 processes, 128 threads: 366.1 MB/s >>>>> >>>>> Maybe it's just me, but it seems on the low side too. >>>>> >>>>> Thanks, >>>>> Dinu >>>>> >>>>> >>>>> On Oct 30, 2013, at 8:59 PM, Mark Nelson <mark.nel...@inktank.com> wrote: >>>>> >>>>>> On 10/30/2013 01:51 PM, Dinu Vlad wrote: >>>>>>> Mark, >>>>>>> >>>>>>> The SSDs are >>>>>>> http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterprise-sata-ssd/?sku=ST240FN0021 >>>>>>> and the HDDs are >>>>>>> http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/constellation/?sku=ST91000640SS. >>>>>>> >>>>>>> The chasis is a "SiliconMechanics C602" - but I don't have the exact >>>>>>> model. It's based on Supermicro, has 24 slots front and 2 in the back >>>>>>> and a SAS expander. >>>>>>> >>>>>>> I did a fio test (raw partitions, 4M blocksize, ioqueue maxed out >>>>>>> according to what the driver reports in dmesg). here are the results >>>>>>> (filtered): >>>>>>> >>>>>>> Sequential: >>>>>>> Run status group 0 (all jobs): >>>>>>> WRITE: io=176952MB, aggrb=2879.0MB/s, minb=106306KB/s, >>>>>>> maxb=191165KB/s, mint=60444msec, maxt=61463msec >>>>>>> >>>>>>> Individually, the HDDs had best:worst 103:109 MB/s while the SSDs gave >>>>>>> 153:189 MB/s >>>>>> >>>>>> Ok, that looks like what I'd expect to see given the controller being >>>>>> used. SSDs are probably limited by total aggregate throughput. >>>>>> >>>>>>> >>>>>>> Random: >>>>>>> Run status group 0 (all jobs): >>>>>>> WRITE: io=106868MB, aggrb=1727.2MB/s, minb=67674KB/s, maxb=106493KB/s, >>>>>>> mint=60404msec, maxt=61875msec >>>>>>> >>>>>>> Individually (best:worst) HDD 71:73 MB/s, SSD 68:101 MB/s (with only >>>>>>> one out of 6 doing 101) >>>>>>> >>>>>>> This is on just one of the osd servers. >>>>>> >>>>>> Where the ceph tests to one OSD server or across all servers? It might >>>>>> be worth trying tests against a single server with no replication using >>>>>> multiple rados bench instances and just seeing what happens. >>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> Dinu >>>>>>> >>>>>>> >>>>>>> On Oct 30, 2013, at 6:38 PM, Mark Nelson <mark.nel...@inktank.com> >>>>>>> wrote: >>>>>>> >>>>>>>> On 10/30/2013 09:05 AM, Dinu Vlad wrote: >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> I've been doing some tests on a newly installed ceph cluster: >>>>>>>>> >>>>>>>>> # ceph osd create bench1 2048 2048 >>>>>>>>> # ceph osd create bench2 2048 2048 >>>>>>>>> # rbd -p bench1 create test >>>>>>>>> # rbd -p bench1 bench-write test --io-pattern rand >>>>>>>>> elapsed: 483 ops: 396579 ops/sec: 820.23 bytes/sec: >>>>>>>>> 2220781.36 >>>>>>>>> >>>>>>>>> # rados -p bench2 bench 300 write --show-time >>>>>>>>> # (run 1) >>>>>>>>> Total writes made: 20665 >>>>>>>>> Write size: 4194304 >>>>>>>>> Bandwidth (MB/sec): 274.923 >>>>>>>>> >>>>>>>>> Stddev Bandwidth: 96.3316 >>>>>>>>> Max bandwidth (MB/sec): 748 >>>>>>>>> Min bandwidth (MB/sec): 0 >>>>>>>>> Average Latency: 0.23273 >>>>>>>>> Stddev Latency: 0.262043 >>>>>>>>> Max latency: 1.69475 >>>>>>>>> Min latency: 0.057293 >>>>>>>>> >>>>>>>>> These results seem to be quite poor for the configuration: >>>>>>>>> >>>>>>>>> MON: dual-cpu Xeon E5-2407 2.2 GHz, 48 GB RAM, 2xSSD for OS >>>>>>>>> OSD: dual-cpu Xeon E5-2620 2.0 GHz, 64 GB RAM, 2xSSD for OS (on-board >>>>>>>>> controller), 18 HDD 1TB 7.2K rpm SAS for OSD drives and 6 SSDs (SATA) >>>>>>>>> for journal, attached to a LSI 9207-8i controller. >>>>>>>>> All servers have dual 10GE network cards, connected to a pair of >>>>>>>>> dedicated switches. Each SSD has 3 10 GB partitions for journals. >>>>>>>> >>>>>>>> Agreed, you should see much higher throughput with that kind of >>>>>>>> storage setup. What brand/model SSDs are these? Also, what brand and >>>>>>>> model of chassis? With 24 drives and 8 SSDs I could push 2GB/s (no >>>>>>>> replication though) with a couple of concurrent rados bench processes >>>>>>>> going on our SC847A chassis, so ~550MB/s aggregate throughput for 18 >>>>>>>> drives and 6 SSDs is definitely on the low side. >>>>>>>> >>>>>>>> I'm actually not too familiar with what the RBD benchmarking commands >>>>>>>> are doing behind the scenes. Typically I've tested fio on top of a >>>>>>>> filesystem on RBD. >>>>>>>> >>>>>>>>> >>>>>>>>> Using ubuntu 13.04, ceph 0.67.4, XFS for backend storage. Cluster was >>>>>>>>> installed using ceph-deploy. ceph.conf pretty much out of the box >>>>>>>>> (diff from default follows) >>>>>>>>> >>>>>>>>> osd_journal_size = 10240 >>>>>>>>> osd mount options xfs = "rw,noatime,nobarrier,inode64" >>>>>>>>> osd mkfs options xfs = "-f -i size=2048" >>>>>>>>> >>>>>>>>> [osd] >>>>>>>>> public network = 10.4.0.0/24 >>>>>>>>> cluster network = 10.254.254.0/24 >>>>>>>>> >>>>>>>>> All tests were run from a server outside the cluster, connected to >>>>>>>>> the storage network with 2x 10 GE nics. >>>>>>>>> >>>>>>>>> I've done a few other tests of the individual components: >>>>>>>>> - network: avg. 7.6 Gbit/s (iperf, mtu=1500), 9.6 Gbit/s (mtu=9000) >>>>>>>>> - md raid0 write across all 18 HDDs - 1.4 GB/s sustained throughput >>>>>>>>> - fio SSD write (xfs, 4k blocks, directio): ~ 250 MB/s, ~55K IOPS >>>>>>>> >>>>>>>> What you might want to try doing is 4M direct IO writes using libaio >>>>>>>> and a high iodepth to all drives (spinning disks and SSDs) >>>>>>>> concurrently and see how both the per-drive and aggregate throughput >>>>>>>> is. >>>>>>>> >>>>>>>> With just SSDs, I've been able to push the 9207-8i up to around 3GB/s >>>>>>>> with Ceph writes (1.5GB/s if you don't count journal writes), but >>>>>>>> perhaps there is something interesting about the way the hardware is >>>>>>>> setup on your system. >>>>>>>> >>>>>>>>> >>>>>>>>> I'd appreciate any suggestion that might help improve the performance >>>>>>>>> or identify a bottleneck. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Dinu >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> ceph-users mailing list >>>>>>>>> ceph-users@lists.ceph.com >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> ceph-users mailing list >>>>>>>> ceph-users@lists.ceph.com >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>> >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list >>>>>>> ceph-users@lists.ceph.com >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> ceph-users mailing list >>>>>> ceph-users@lists.ceph.com >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com