I would be interested in seeing the results from the post mentioned by an earlier contributor:
https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ Test an "old" M500 and a "new" M500 and see if the performance is A) acceptable and B) comparable. Find hardware revision or firmware revision in case of A=Good and B=different. If the "old" device doesn't test well in fio/dd testing, then the drives are (as expected) not a great choice for journals and you might want to look at hardware/backplane/RAID configuration differences that are somehow allowing them to perform adequately. On Fri, Oct 27, 2017 at 12:36 PM, Russell Glaue <rgl...@cait.org> wrote: > Yes, all the MD500s we use are both journal and OSD, even the older ones. > We have a 3 year lifecycle and move older nodes from one ceph cluster to > another. > On old systems with 3 year old MD500s, they run as RAID0, and run faster > than our current problem system with 1 year old MD500s, ran as nonraid > pass-through on the controller. > > All disks are SATA and are connected to a SAS controller. We were > wondering if the SAS/SATA conversion is an issue. Yet, the older systems > don't exhibit a problem. > > I found what I wanted to know from a colleague, that when the current ceph > cluster was put together, the SSDs tested at 300+MB/s, and ceph cluster > writes at 30MB/s. > > Using SMART tools, the reserved cells in all drives is nearly 100%. > > Restarting the OSDs minorly improved performance. Still betting on > hardware issues that a firmware upgrade may resolve. > > -RG > > > On Oct 27, 2017 1:14 PM, "Brian Andrus" <brian.and...@dreamhost.com> > wrote: > > @Russel, are your "older Crucial M500"s being used as journals? > > Crucial M500s are not to be used as a Ceph journal in my last experience > with them. They make good OSDs with an NVMe in front of them perhaps, but > not much else. > > Ceph uses O_DSYNC for journal writes and these drives do not handle them > as expected. It's been many years since I've dealt with the M500s > specifically, but it has to do with the capacitor/power save feature and > how it handles those types of writes. I'm sorry I don't have the emails > with specifics around anymore, but last I remember, this was a hardware > issue and could not be resolved with firmware. > > Paging Kyle Bader... > > On Fri, Oct 27, 2017 at 9:24 AM, Russell Glaue <rgl...@cait.org> wrote: > >> We have older crucial M500 disks operating without such problems. So, I >> have to believe it is a hardware firmware issue. >> And its peculiar seeing performance boost slightly, even 24 hours later, >> when I stop then start the OSDs. >> >> Our actual writes are low, as most of our Ceph Cluster based images are >> low-write, high-memory. So a 20GB/day life/write capacity is a non-issue >> for us. Only write speed is the concern. Our write-intensive images are >> locked on non-ceph disks. >> >> What are others using for SSD drives in their Ceph cluster? >> With 0.50+ DWPD (Drive Writes Per Day), the Kingston SEDC400S37 models >> seems to be the best for the price today. >> >> >> >> On Fri, Oct 27, 2017 at 6:34 AM, Maged Mokhtar <mmokh...@petasan.org> >> wrote: >> >>> It is quiet likely related, things are pointing to bad disks. Probably >>> the best thing is to plan for disk replacement, the sooner the better as it >>> could get worse. >>> >>> >>> >>> On 2017-10-27 02:22, Christian Wuerdig wrote: >>> >>> Hm, no necessarily directly related to your performance problem, >>> however: These SSDs have a listed endurance of 72TB total data written >>> - over a 5 year period that's 40GB a day or approx 0.04 DWPD. Given >>> that you run the journal for each OSD on the same disk, that's >>> effectively at most 0.02 DWPD (about 20GB per day per disk). I don't >>> know many who'd run a cluster on disks like those. Also it means these >>> are pure consumer drives which have a habit of exhibiting random >>> performance at times (based on unquantified anecdotal personal >>> experience with other consumer model SSDs). I wouldn't touch these >>> with a long stick for anything but small toy-test clusters. >>> >>> On Fri, Oct 27, 2017 at 3:44 AM, Russell Glaue <rgl...@cait.org> wrote: >>> >>> >>> On Wed, Oct 25, 2017 at 7:09 PM, Maged Mokhtar <mmokh...@petasan.org> >>> wrote: >>> >>> >>> It depends on what stage you are in: >>> in production, probably the best thing is to setup a monitoring tool >>> (collectd/grahite/prometheus/grafana) to monitor both ceph stats as >>> well as >>> resource load. This will, among other things, show you if you have >>> slowing >>> disks. >>> >>> >>> I am monitoring Ceph performance with ceph-dash >>> (http://cephdash.crapworks.de/), that is why I knew to look into the >>> slow >>> writes issue. And I am using Monitorix (http://www.monitorix.org/) to >>> monitor system resources, including Disk I/O. >>> >>> However, though I can monitor individual disk performance at the system >>> level, it seems Ceph does not tax any disk more than the worst disk. So >>> in >>> my monitoring charts, all disks have the same performance. >>> All four nodes are base-lining at 50 writes/sec during the cluster's >>> normal >>> load, with the non-problem hosts spiking up to 150, and the problem host >>> only spikes up to 100. >>> But during the window of time I took the problem host OSDs down to run >>> the >>> bench tests, the OSDs on the other nodes increased to 300-500 writes/sec. >>> Otherwise, the chart looks the same for all disks on all ceph >>> nodes/hosts. >>> >>> Before production you should first make sure your SSDs are suitable for >>> Ceph, either by being recommend by other Ceph users or you test them >>> yourself for sync writes performance using fio tool as outlined earlier. >>> Then after you build your cluster you can use rados and/or rbd bencmark >>> tests to benchmark your cluster and find bottlenecks using >>> atop/sar/collectl >>> which will help you tune your cluster. >>> >>> >>> All 36 OSDs are: Crucial_CT960M500SSD1 >>> >>> Rados bench tests were done at the beginning. The speed was much faster >>> than >>> it is now. I cannot recall the test results, someone else on my team ran >>> them. Recently, I had thought the slow disk problem was a configuration >>> issue with Ceph - before I posted here. Now we are hoping it may be >>> resolved >>> with a firmware update. (If it is firmware related, rebooting the problem >>> node may temporarily resolve this) >>> >>> >>> Though you did see better improvements, your cluster with 27 SSDs should >>> give much higher numbers than 3k iops. If you are running rados bench >>> while >>> you have other client ios, then obviously the reported number by the tool >>> will be less than what the cluster is actually giving...which you can >>> find >>> out via ceph status command, it will print the total cluster throughput >>> and >>> iops. If the total is still low i would recommend running the fio raw >>> disk >>> test, maybe the disks are not suitable. When you removed your 9 bad disk >>> from 36 and your performance doubled, you still had 2 other disk slowing >>> you..meaning near 100% busy ? It makes me feel the disk type used is not >>> good. For these near 100% busy disks can you also measure their raw disk >>> iops at that load (i am not sure atop shows this, if not use >>> sat/syssyat/iostat/collecl). >>> >>> >>> I ran another bench test today with all 36 OSDs up. The overall >>> performance >>> was improved slightly compared to the original tests. Only 3 OSDs on the >>> problem host were increasing to 101% disk busy. >>> The iops reported from ceph status during this bench test ranged from >>> 1.6k >>> to 3.3k, the test yielding 4k iops. >>> >>> Yes, the two other OSDs/disks that were the bottleneck were at 101% disk >>> busy. The other OSD disks on the same host were sailing along at like >>> 50-60% >>> busy. >>> >>> All 36 OSD disks are exactly the same disk. They were all purchased at >>> the >>> same time. All were installed at the same time. >>> I cannot believe it is a problem with the disk model. A failed/bad disk, >>> perhaps is possible. But the disk model itself cannot be the problem >>> based >>> on what I am seeing. If I am seeing bad performance on all disks on one >>> ceph >>> node/host, but not on another ceph node with these same disks, it has to >>> be >>> some other factor. This is why I am now guessing a firmware upgrade is >>> needed. >>> >>> Also, as I eluded to here earlier. I took down all 9 OSDs in the problem >>> host yesterday to run the bench test. >>> Today, with those 9 OSDs back online, I rerun the bench test, I am see >>> 2-3 >>> OSD disks with 101% busy on the problem host, and the other disks are >>> lower >>> than 80%. So, for whatever reason, shutting down the OSDs and starting >>> them >>> back up, allowed many (not all) of the OSDs performance to improve on the >>> problem host. >>> >>> >>> Maged >>> >>> On 2017-10-25 23:44, Russell Glaue wrote: >>> >>> Thanks to all. >>> I took the OSDs down in the problem host, without shutting down the >>> machine. >>> As predicted, our MB/s about doubled. >>> Using this bench/atop procedure, I found two other OSDs on another host >>> that are the next bottlenecks. >>> >>> Is this the only good way to really test the performance of the drives as >>> OSDs? Is there any other way? >>> >>> While running the bench on all 36 OSDs, the 9 problem OSDs stuck out. But >>> two new problem OSDs I just discovered in this recent test of 27 OSDs did >>> not stick out at all. Because ceph bench distributes the load making only >>> the very worst denominators show up in atop. So ceph is a slow as your >>> slowest drive. >>> >>> It would be really great if I could run the bench test, and some how get >>> the bench to use only certain OSDs during the test. Then I could run the >>> test, avoiding the OSDs that I already know is a problem, so I can find >>> the >>> next worst OSD. >>> >>> >>> [ the bench test ] >>> rados bench -p scbench -b 4096 30 write -t 32 >>> >>> [ original results with all 36 OSDs ] >>> Total time run: 30.822350 >>> Total writes made: 31032 >>> Write size: 4096 >>> Object size: 4096 >>> Bandwidth (MB/sec): 3.93282 >>> Stddev Bandwidth: 3.66265 >>> Max bandwidth (MB/sec): 13.668 >>> Min bandwidth (MB/sec): 0 >>> Average IOPS: 1006 >>> Stddev IOPS: 937 >>> Max IOPS: 3499 >>> Min IOPS: 0 >>> Average Latency(s): 0.0317779 >>> Stddev Latency(s): 0.164076 >>> Max latency(s): 2.27707 >>> Min latency(s): 0.0013848 >>> Cleaning up (deleting benchmark objects) >>> Clean up completed and total clean up time :20.166559 >>> >>> [ after stopping all of the OSDs (9) on the problem host ] >>> Total time run: 32.586830 >>> Total writes made: 59491 >>> Write size: 4096 >>> Object size: 4096 >>> Bandwidth (MB/sec): 7.13131 >>> Stddev Bandwidth: 9.78725 >>> Max bandwidth (MB/sec): 29.168 >>> Min bandwidth (MB/sec): 0 >>> Average IOPS: 1825 >>> Stddev IOPS: 2505 >>> Max IOPS: 7467 >>> Min IOPS: 0 >>> Average Latency(s): 0.0173691 >>> Stddev Latency(s): 0.21634 >>> Max latency(s): 6.71283 >>> Min latency(s): 0.00107473 >>> Cleaning up (deleting benchmark objects) >>> Clean up completed and total clean up time :16.269393 >>> >>> >>> >>> On Fri, Oct 20, 2017 at 1:35 PM, Russell Glaue <rgl...@cait.org> wrote: >>> >>> >>> On the machine in question, the 2nd newest, we are using the LSI MegaRAID >>> SAS-3 3008 [Fury], which allows us a "Non-RAID" option, and has no >>> battery. >>> The older two use the LSI MegaRAID SAS 2208 [Thunderbolt] I reported >>> earlier, each single drive configured as RAID0. >>> >>> Thanks for everyone's help. >>> I am going to run a 32 thread bench test after taking the 2nd machine out >>> of the cluster with noout. >>> After it is out of the cluster, I am expecting the slow write issue will >>> not surface. >>> >>> >>> On Fri, Oct 20, 2017 at 5:27 AM, David Turner <drakonst...@gmail.com> >>> wrote: >>> >>> >>> I can attest that the battery in the raid controller is a thing. I'm >>> used to using lsi controllers, but my current position has hp raid >>> controllers and we just tracked down 10 of our nodes that had >100ms >>> await >>> pretty much always were the only 10 nodes in the cluster with failed >>> batteries on the raid controllers. >>> >>> >>> On Thu, Oct 19, 2017, 8:15 PM Christian Balzer <ch...@gol.com> wrote: >>> >>> >>> >>> Hello, >>> >>> On Thu, 19 Oct 2017 17:14:17 -0500 Russell Glaue wrote: >>> >>> That is a good idea. >>> However, a previous rebalancing processes has brought performance of >>> our >>> Guest VMs to a slow drag. >>> >>> >>> Never mind that I'm not sure that these SSDs are particular well suited >>> for Ceph, your problem is clearly located on that one node. >>> >>> Not that I think it's the case, but make sure your PG distribution is >>> not >>> skewed with many more PGs per OSD on that node. >>> >>> Once you rule that out my first guess is the RAID controller, you're >>> running the SSDs are single RAID0s I presume? >>> If so a either configuration difference or a failed BBU on the >>> controller >>> could result in the writeback cache being disabled, which would explain >>> things beautifully. >>> >>> As for a temporary test/fix (with reduced redundancy of course), set >>> noout >>> (or mon_osd_down_out_subtree_limit accordingly) and turn the slow host >>> off. >>> >>> This should result in much better performance than you have now and of >>> course be the final confirmation of that host being the culprit. >>> >>> Christian >>> >>> >>> On Thu, Oct 19, 2017 at 3:55 PM, Jean-Charles Lopez >>> <jelo...@redhat.com> >>> wrote: >>> >>> Hi Russell, >>> >>> as you have 4 servers, assuming you are not doing EC pools, just >>> stop all >>> the OSDs on the second questionable server, mark the OSDs on that >>> server as >>> out, let the cluster rebalance and when all PGs are active+clean >>> just >>> replay the test. >>> >>> All IOs should then go only to the other 3 servers. >>> >>> JC >>> >>> On Oct 19, 2017, at 13:49, Russell Glaue <rgl...@cait.org> wrote: >>> >>> No, I have not ruled out the disk controller and backplane making >>> the >>> disks slower. >>> Is there a way I could test that theory, other than swapping out >>> hardware? >>> -RG >>> >>> On Thu, Oct 19, 2017 at 3:44 PM, David Turner >>> <drakonst...@gmail.com> >>> wrote: >>> >>> Have you ruled out the disk controller and backplane in the server >>> running slower? >>> >>> On Thu, Oct 19, 2017 at 4:42 PM Russell Glaue <rgl...@cait.org> >>> wrote: >>> >>> I ran the test on the Ceph pool, and ran atop on all 4 storage >>> servers, >>> as suggested. >>> >>> Out of the 4 servers: >>> 3 of them performed with 17% to 30% disk %busy, and 11% CPU wait. >>> Momentarily spiking up to 50% on one server, and 80% on another >>> The 2nd newest server was almost averaging 90% disk %busy and >>> 150% CPU >>> wait. And more than momentarily spiking to 101% disk busy and >>> 250% CPU wait. >>> For this 2nd newest server, this was the statistics for about 8 >>> of 9 >>> disks, with the 9th disk not far behind the others. >>> >>> I cannot believe all 9 disks are bad >>> They are the same disks as the newest 1st server, >>> Crucial_CT960M500SSD1, >>> and same exact server hardware too. >>> They were purchased at the same time in the same purchase order >>> and >>> arrived at the same time. >>> So I cannot believe I just happened to put 9 bad disks in one >>> server, >>> and 9 good ones in the other. >>> >>> I know I have Ceph configured exactly the same on all servers >>> And I am sure I have the hardware settings configured exactly the >>> same >>> on the 1st and 2nd servers. >>> So if I were someone else, I would say it maybe is bad hardware >>> on the >>> 2nd server. >>> But the 2nd server is running very well without any hint of a >>> problem. >>> >>> Any other ideas or suggestions? >>> >>> -RG >>> >>> >>> On Wed, Oct 18, 2017 at 3:40 PM, Maged Mokhtar >>> <mmokh...@petasan.org> >>> wrote: >>> >>> just run the same 32 threaded rados test as you did before and >>> this >>> time run atop while the test is running looking for %busy of >>> cpu/disks. It >>> should give an idea if there is a bottleneck in them. >>> >>> On 2017-10-18 21:35, Russell Glaue wrote: >>> >>> I cannot run the write test reviewed at the >>> ceph-how-to-test-if-your-s >>> sd-is-suitable-as-a-journal-device blog. The tests write >>> directly to >>> the raw disk device. >>> Reading an infile (created with urandom) on one SSD, writing the >>> outfile to another osd, yields about 17MB/s. >>> But Isn't this write speed limited by the speed in which in the >>> dd >>> infile can be read? >>> And I assume the best test should be run with no other load. >>> >>> How does one run the rados bench "as stress"? >>> >>> -RG >>> >>> >>> On Wed, Oct 18, 2017 at 1:33 PM, Maged Mokhtar >>> <mmokh...@petasan.org> >>> wrote: >>> >>> measuring resource load as outlined earlier will show if the >>> drives >>> are performing well or not. Also how many osds do you have ? >>> >>> On 2017-10-18 19:26, Russell Glaue wrote: >>> >>> The SSD drives are Crucial M500 >>> A Ceph user did some benchmarks and found it had good >>> performance >>> https://forum.proxmox.com/threads/ceph-bad-performance-in- >>> qemu-guests.21551/ >>> >>> However, a user comment from 3 years ago on the blog post you >>> linked >>> to says to avoid the Crucial M500 >>> >>> Yet, this performance posting tells that the Crucial M500 is >>> good. >>> https://inside.servers.com/ssd-performance-2017-c4307a92dea >>> >>> On Wed, Oct 18, 2017 at 11:53 AM, Maged Mokhtar >>> <mmokh...@petasan.org> >>> wrote: >>> >>> Check out the following link: some SSDs perform bad in Ceph >>> due to >>> sync writes to journal >>> >>> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-tes >>> t-if-your-ssd-is-suitable-as-a-journal-device/ >>> >>> Anther thing that can help is to re-run the rados 32 threads >>> as >>> stress and view resource usage using atop (or collectl/sar) to >>> check for >>> %busy cpu and %busy disks to give you an idea of what is >>> holding down your >>> cluster..for example: if cpu/disk % are all low then check >>> your >>> network/switches. If disk %busy is high (90%) for all disks >>> then your >>> disks are the bottleneck: which either means you have SSDs >>> that are not >>> suitable for Ceph or you have too few disks (which i doubt is >>> the case). If >>> only 1 disk %busy is high, there may be something wrong with >>> this disk >>> should be removed. >>> >>> Maged >>> >>> On 2017-10-18 18:13, Russell Glaue wrote: >>> >>> In my previous post, in one of my points I was wondering if >>> the >>> request size would increase if I enabled jumbo packets. >>> currently it is >>> disabled. >>> >>> @jdillama: The qemu settings for both these two guest >>> machines, with >>> RAID/LVM and Ceph/rbd images, are the same. I am not thinking >>> that changing >>> the qemu settings of "min_io_size=<limited to >>> 16bits>,opt_io_size=<RBD >>> image object size>" will directly address the issue. >>> >>> @mmokhtar: Ok. So you suggest the request size is the result >>> of the >>> problem and not the cause of the problem. meaning I should go >>> after a >>> different issue. >>> >>> I have been trying to get write speeds up to what people on >>> this mail >>> list are discussing. >>> It seems that for our configuration, as it matches others, we >>> should >>> be getting about 70MB/s write speed. >>> But we are not getting that. >>> Single writes to disk are lucky to get 5MB/s to 6MB/s, but are >>> typically 1MB/s to 2MB/s. >>> Monitoring the entire Ceph cluster (using >>> http://cephdash.crapworks.de/), I have seen very rare >>> momentary >>> spikes up to 30MB/s. >>> >>> My storage network is connected via a 10Gb switch >>> I have 4 storage servers with a LSI Logic MegaRAID SAS 2208 >>> controller >>> Each storage server has 9 1TB SSD drives, each drive as 1 osd >>> (no >>> RAID) >>> Each drive is one LVM group, with two volumes - one volume for >>> the >>> osd, one volume for the journal >>> Each osd is formatted with xfs >>> The crush map is simple: default->rack->[host[1..4]->osd] with >>> an >>> evenly distributed weight >>> The redundancy is triple replication >>> >>> While I have read comments that having the osd and journal on >>> the >>> same disk decreases write speed, I have also read that once >>> past 8 OSDs per >>> node this is the recommended configuration, however this is >>> also the reason >>> why SSD drives are used exclusively for OSDs in the storage >>> nodes. >>> None-the-less, I was still expecting write speeds to be above >>> 30MB/s, >>> not below 6MB/s. >>> Even at 12x slower than the RAID, using my previously posted >>> iostat >>> data set, I should be seeing write speeds that average 10MB/s, >>> not 2MB/s. >>> >>> In regards to the rados benchmark tests you asked me to run, >>> here is >>> the output: >>> >>> [centos7]# rados bench -p scbench -b 4096 30 write -t 1 >>> Maintaining 1 concurrent writes of 4096 bytes to objects of >>> size 4096 >>> for up to 30 seconds or 0 objects >>> Object prefix: benchmark_data_hamms.sys.cu.cait.org_85049 >>> sec Cur ops started finished avg MB/s cur MB/s last >>> lat(s) >>> avg lat(s) >>> 0 0 0 0 0 0 >>> - >>> 0 >>> 1 1 201 200 0.78356 0.78125 >>> 0.00522307 >>> 0.00496574 >>> 2 1 469 468 0.915303 1.04688 >>> 0.00437497 >>> 0.00426141 >>> 3 1 741 740 0.964371 1.0625 >>> 0.00512853 >>> 0.0040434 >>> 4 1 888 887 0.866739 0.574219 >>> 0.00307699 >>> 0.00450177 >>> 5 1 1147 1146 0.895725 1.01172 >>> 0.00376454 >>> 0.0043559 >>> 6 1 1325 1324 0.862293 0.695312 >>> 0.00459443 >>> 0.004525 >>> 7 1 1494 1493 0.83339 0.660156 >>> 0.00461002 >>> 0.00458452 >>> 8 1 1736 1735 0.847369 0.945312 >>> 0.00253971 >>> 0.00460458 >>> 9 1 1998 1997 0.866922 1.02344 >>> 0.00236573 >>> 0.00450172 >>> 10 1 2260 2259 0.882563 1.02344 >>> 0.00262179 >>> 0.00442152 >>> 11 1 2526 2525 0.896775 1.03906 >>> 0.00336914 >>> 0.00435092 >>> 12 1 2760 2759 0.898203 0.914062 >>> 0.00351827 >>> 0.00434491 >>> 13 1 3016 3015 0.906025 1 >>> 0.00335703 >>> 0.00430691 >>> 14 1 3257 3256 0.908545 0.941406 >>> 0.00332344 >>> 0.00429495 >>> 15 1 3490 3489 0.908644 0.910156 >>> 0.00318815 >>> 0.00426387 >>> 16 1 3728 3727 0.909952 0.929688 >>> 0.0032881 >>> 0.00428895 >>> 17 1 3986 3985 0.915703 1.00781 >>> 0.00274809 >>> 0.0042614 >>> 18 1 4250 4249 0.922116 1.03125 >>> 0.00287411 >>> 0.00423214 >>> 19 1 4505 4504 0.926003 0.996094 >>> 0.00375435 >>> 0.00421442 >>> 2017-10-18 10:56:31.267173 min lat: 0.00181259 max lat: >>> 0.270553 avg >>> lat: 0.00420118 >>> sec Cur ops started finished avg MB/s cur MB/s last >>> lat(s) >>> avg lat(s) >>> 20 1 4757 4756 0.928915 0.984375 >>> 0.00463972 >>> 0.00420118 >>> 21 1 5009 5008 0.93155 0.984375 >>> 0.00360065 >>> 0.00418937 >>> 22 1 5235 5234 0.929329 0.882812 >>> 0.00626214 >>> 0.004199 >>> 23 1 5500 5499 0.933925 1.03516 >>> 0.00466584 >>> 0.00417836 >>> 24 1 5708 5707 0.928861 0.8125 >>> 0.00285727 >>> 0.00420146 >>> 25 0 5964 5964 0.931858 1.00391 >>> 0.00417383 >>> 0.0041881 >>> 26 1 6216 6215 0.933722 0.980469 >>> 0.0041009 >>> 0.00417915 >>> 27 1 6481 6480 0.937474 1.03516 >>> 0.00307484 >>> 0.00416118 >>> 28 1 6745 6744 0.940819 1.03125 >>> 0.00266329 >>> 0.00414777 >>> 29 1 7003 7002 0.943124 1.00781 >>> 0.00305905 >>> 0.00413758 >>> 30 1 7271 7270 0.946578 1.04688 >>> 0.00391017 >>> 0.00412238 >>> Total time run: 30.006060 >>> Total writes made: 7272 >>> Write size: 4096 >>> Object size: 4096 >>> Bandwidth (MB/sec): 0.946684 >>> Stddev Bandwidth: 0.123762 >>> Max bandwidth (MB/sec): 1.0625 >>> Min bandwidth (MB/sec): 0.574219 >>> Average IOPS: 242 >>> Stddev IOPS: 31 >>> Max IOPS: 272 >>> Min IOPS: 147 >>> Average Latency(s): 0.00412247 >>> Stddev Latency(s): 0.00648437 >>> Max latency(s): 0.270553 >>> Min latency(s): 0.00175318 >>> Cleaning up (deleting benchmark objects) >>> Clean up completed and total clean up time :29.069423 >>> >>> [centos7]# rados bench -p scbench -b 4096 30 write -t 32 >>> Maintaining 32 concurrent writes of 4096 bytes to objects of >>> size >>> 4096 for up to 30 seconds or 0 objects >>> Object prefix: benchmark_data_hamms.sys.cu.cait.org_86076 >>> sec Cur ops started finished avg MB/s cur MB/s last >>> lat(s) >>> avg lat(s) >>> 0 0 0 0 0 0 >>> - >>> 0 >>> 1 32 3013 2981 11.6438 11.6445 >>> 0.00247906 >>> 0.00572026 >>> 2 32 5349 5317 10.3834 9.125 >>> 0.00246662 >>> 0.00932016 >>> 3 32 5707 5675 7.3883 1.39844 >>> 0.00389774 >>> 0.0156726 >>> 4 32 5895 5863 5.72481 0.734375 >>> 1.13137 >>> 0.0167946 >>> 5 32 6869 6837 5.34068 3.80469 >>> 0.0027652 >>> 0.0226577 >>> 6 32 8901 8869 5.77306 7.9375 >>> 0.0053211 >>> 0.0216259 >>> 7 32 10800 10768 6.00785 7.41797 >>> 0.00358187 >>> 0.0207418 >>> 8 32 11825 11793 5.75728 4.00391 >>> 0.00217575 >>> 0.0215494 >>> 9 32 12941 12909 5.6019 4.35938 >>> 0.00278512 >>> 0.0220567 >>> 10 32 13317 13285 5.18849 1.46875 >>> 0.0034973 >>> 0.0240665 >>> 11 32 16189 16157 5.73653 11.2188 >>> 0.00255841 >>> 0.0212708 >>> 12 32 16749 16717 5.44077 2.1875 >>> 0.00330334 >>> 0.0215915 >>> 13 32 16756 16724 5.02436 0.0273438 >>> 0.00338994 >>> 0.021849 >>> 14 32 17908 17876 4.98686 4.5 >>> 0.00402598 >>> 0.0244568 >>> 15 32 17936 17904 4.66171 0.109375 >>> 0.00375799 >>> 0.0245545 >>> 16 32 18279 18247 4.45409 1.33984 >>> 0.00483873 >>> 0.0267929 >>> 17 32 18372 18340 4.21346 0.363281 >>> 0.00505187 >>> 0.0275887 >>> 18 32 19403 19371 4.20309 4.02734 >>> 0.00545154 >>> 0.029348 >>> 19 31 19845 19814 4.07295 1.73047 >>> 0.00254726 >>> 0.0306775 >>> 2017-10-18 10:57:58.160536 min lat: 0.0015005 max lat: 2.27707 >>> avg >>> lat: 0.0307559 >>> sec Cur ops started finished avg MB/s cur MB/s last >>> lat(s) >>> avg lat(s) >>> 20 31 20401 20370 3.97788 2.17188 >>> 0.00307238 >>> 0.0307559 >>> 21 32 21338 21306 3.96254 3.65625 >>> 0.00464563 >>> 0.0312288 >>> 22 32 23057 23025 4.0876 6.71484 >>> 0.00296295 >>> 0.0299267 >>> 23 32 23057 23025 3.90988 0 >>> - >>> 0.0299267 >>> 24 32 23803 23771 3.86837 1.45703 >>> 0.00301471 >>> 0.0312804 >>> 25 32 24112 24080 3.76191 1.20703 >>> 0.00191063 >>> 0.0331462 >>> 26 31 25303 25272 3.79629 4.65625 >>> 0.00794399 >>> 0.0329129 >>> 27 32 28803 28771 4.16183 13.668 >>> 0.0109817 >>> 0.0297469 >>> 28 32 29592 29560 4.12325 3.08203 >>> 0.00188185 >>> 0.0301911 >>> 29 32 30595 30563 4.11616 3.91797 >>> 0.00379099 >>> 0.0296794 >>> 30 32 31031 30999 4.03572 1.70312 >>> 0.00283347 >>> 0.0302411 >>> Total time run: 30.822350 >>> Total writes made: 31032 >>> Write size: 4096 >>> Object size: 4096 >>> Bandwidth (MB/sec): 3.93282 >>> Stddev Bandwidth: 3.66265 >>> Max bandwidth (MB/sec): 13.668 >>> Min bandwidth (MB/sec): 0 >>> Average IOPS: 1006 >>> Stddev IOPS: 937 >>> Max IOPS: 3499 >>> Min IOPS: 0 >>> Average Latency(s): 0.0317779 >>> Stddev Latency(s): 0.164076 >>> Max latency(s): 2.27707 >>> Min latency(s): 0.0013848 >>> Cleaning up (deleting benchmark objects) >>> Clean up completed and total clean up time :20.166559 >>> >>> >>> >>> >>> On Wed, Oct 18, 2017 at 8:51 AM, Maged Mokhtar >>> <mmokh...@petasan.org> >>> wrote: >>> >>> First a general comment: local RAID will be faster than Ceph >>> for a >>> single threaded (queue depth=1) io operation test. A single >>> thread Ceph >>> client will see at best same disk speed for reads and for >>> writes 4-6 times >>> slower than single disk. Not to mention the latency of local >>> disks will >>> much better. Where Ceph shines is when you have many >>> concurrent ios, it >>> scales whereas RAID will decrease speed per client as you add >>> more. >>> >>> Having said that, i would recommend running rados/rbd >>> bench-write >>> and measure 4k iops at 1 and 32 threads to get a better idea >>> of how your >>> cluster performs: >>> >>> ceph osd pool create testpool 256 256 >>> rados bench -p testpool -b 4096 30 write -t 1 >>> rados bench -p testpool -b 4096 30 write -t 32 >>> ceph osd pool delete testpool testpool >>> --yes-i-really-really-mean-it >>> >>> rbd bench-write test-image --io-threads=1 --io-size 4096 >>> --io-pattern rand --rbd_cache=false >>> rbd bench-write test-image --io-threads=32 --io-size 4096 >>> --io-pattern rand --rbd_cache=false >>> >>> I think the request size difference you see is due to the io >>> scheduler in the case of local disks having more ios to >>> re-group so has a >>> better chance in generating larger requests. Depending on >>> your kernel, the >>> io scheduler may be different for rbd (blq-mq) vs sdx (cfq) >>> but again i >>> would think the request size is a result not a cause. >>> >>> Maged >>> >>> On 2017-10-17 23:12, Russell Glaue wrote: >>> >>> I am running ceph jewel on 5 nodes with SSD OSDs. >>> I have an LVM image on a local RAID of spinning disks. >>> I have an RBD image on in a pool of SSD disks. >>> Both disks are used to run an almost identical CentOS 7 >>> system. >>> Both systems were installed with the same kickstart, though >>> the disk >>> partitioning is different. >>> >>> I want to make writes on the the ceph image faster. For >>> example, >>> lots of writes to MySQL (via MySQL replication) on a ceph SSD >>> image are >>> about 10x slower than on a spindle RAID disk image. The MySQL >>> server on >>> ceph rbd image has a hard time keeping up in replication. >>> >>> So I wanted to test writes on these two systems >>> I have a 10GB compressed (gzip) file on both servers. >>> I simply gunzip the file on both systems, while running >>> iostat. >>> >>> The primary difference I see in the results is the average >>> size of >>> the request to the disk. >>> CentOS7-lvm-raid-sata writes a lot faster to disk, and the >>> size of >>> the request is about 40x, but the number of writes per second >>> is about the >>> same >>> This makes me want to conclude that the smaller size of the >>> request >>> for CentOS7-ceph-rbd-ssd system is the cause of it being >>> slow. >>> >>> >>> How can I make the size of the request larger for ceph rbd >>> images, >>> so I can increase the write throughput? >>> Would this be related to having jumbo packets enabled in my >>> ceph >>> storage network? >>> >>> >>> Here is a sample of the results: >>> >>> [CentOS7-lvm-raid-sata] >>> $ gunzip large10gFile.gz & >>> $ iostat -x vg_root-lv_var -d 5 -m -N >>> Device: rrqm/s wrqm/s r/s w/s rMB/s >>> wMB/s >>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>> ... >>> vg_root-lv_var 0.00 0.00 30.60 452.20 13.60 >>> 222.15 >>> 1000.04 8.69 14.05 0.99 14.93 2.07 100.04 >>> vg_root-lv_var 0.00 0.00 88.20 182.00 39.20 >>> 89.43 >>> 974.95 4.65 9.82 0.99 14.10 3.70 100.00 >>> vg_root-lv_var 0.00 0.00 75.45 278.24 33.53 >>> 136.70 >>> 985.73 4.36 33.26 1.34 41.91 0.59 20.84 >>> vg_root-lv_var 0.00 0.00 111.60 181.80 49.60 >>> 89.34 >>> 969.84 2.60 8.87 0.81 13.81 0.13 3.90 >>> vg_root-lv_var 0.00 0.00 68.40 109.60 30.40 >>> 53.63 >>> 966.87 1.51 8.46 0.84 13.22 0.80 14.16 >>> ... >>> >>> [CentOS7-ceph-rbd-ssd] >>> $ gunzip large10gFile.gz & >>> $ iostat -x vg_root-lv_data -d 5 -m -N >>> Device: rrqm/s wrqm/s r/s w/s rMB/s >>> wMB/s >>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>> ... >>> vg_root-lv_data 0.00 0.00 46.40 167.80 0.88 >>> 1.46 >>> 22.36 1.23 5.66 2.47 6.54 4.52 96.82 >>> vg_root-lv_data 0.00 0.00 16.60 55.20 0.36 >>> 0.14 >>> 14.44 0.99 13.91 9.12 15.36 13.71 98.46 >>> vg_root-lv_data 0.00 0.00 69.00 173.80 1.34 >>> 1.32 >>> 22.48 1.25 5.19 3.77 5.75 3.94 95.68 >>> vg_root-lv_data 0.00 0.00 74.40 293.40 1.37 >>> 1.47 >>> 15.83 1.22 3.31 2.06 3.63 2.54 93.26 >>> vg_root-lv_data 0.00 0.00 90.80 359.00 1.96 >>> 3.41 >>> 24.45 1.63 3.63 1.94 4.05 2.10 94.38 >>> ... >>> >>> [iostat key] >>> w/s == The number (after merges) of write requests completed >>> per >>> second for the device. >>> wMB/s == The number of sectors (kilobytes, megabytes) written >>> to the >>> device per second. >>> avgrq-sz == The average size (in kilobytes) of the requests >>> that >>> were issued to the device. >>> avgqu-sz == The average queue length of the requests that >>> were >>> issued to the device. >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> >>> >>> >>> -- >>> Christian Balzer Network/Systems Engineer >>> ch...@gol.com Rakuten Communications >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> >>> >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > > > -- > Brian Andrus | Cloud Systems Engineer | DreamHost > brian.and...@dreamhost.com | www.dreamhost.com > > > -- Brian Andrus | Cloud Systems Engineer | DreamHost brian.and...@dreamhost.com | www.dreamhost.com
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com