It is quiet likely related, things are pointing to bad disks. Probably
the best thing is to plan for disk replacement, the sooner the better as
it could get worse.
On 2017-10-27 02:22, Christian Wuerdig wrote:
> Hm, no necessarily directly related to your performance problem,
> however: These SSDs have a listed endurance of 72TB total data written
> - over a 5 year period that's 40GB a day or approx 0.04 DWPD. Given
> that you run the journal for each OSD on the same disk, that's
> effectively at most 0.02 DWPD (about 20GB per day per disk). I don't
> know many who'd run a cluster on disks like those. Also it means these
> are pure consumer drives which have a habit of exhibiting random
> performance at times (based on unquantified anecdotal personal
> experience with other consumer model SSDs). I wouldn't touch these
> with a long stick for anything but small toy-test clusters.
>
> On Fri, Oct 27, 2017 at 3:44 AM, Russell Glaue <rgl...@cait.org> wrote:
> On Wed, Oct 25, 2017 at 7:09 PM, Maged Mokhtar <mmokh...@petasan.org> wrote:
> It depends on what stage you are in:
> in production, probably the best thing is to setup a monitoring tool
> (collectd/grahite/prometheus/grafana) to monitor both ceph stats as well as
> resource load. This will, among other things, show you if you have slowing
> disks.
> I am monitoring Ceph performance with ceph-dash
> (http://cephdash.crapworks.de/), that is why I knew to look into the slow
> writes issue. And I am using Monitorix (http://www.monitorix.org/) to
> monitor system resources, including Disk I/O.
>
> However, though I can monitor individual disk performance at the system
> level, it seems Ceph does not tax any disk more than the worst disk. So in
> my monitoring charts, all disks have the same performance.
> All four nodes are base-lining at 50 writes/sec during the cluster's normal
> load, with the non-problem hosts spiking up to 150, and the problem host
> only spikes up to 100.
> But during the window of time I took the problem host OSDs down to run the
> bench tests, the OSDs on the other nodes increased to 300-500 writes/sec.
> Otherwise, the chart looks the same for all disks on all ceph nodes/hosts.
>
> Before production you should first make sure your SSDs are suitable for
> Ceph, either by being recommend by other Ceph users or you test them
> yourself for sync writes performance using fio tool as outlined earlier.
> Then after you build your cluster you can use rados and/or rbd bencmark
> tests to benchmark your cluster and find bottlenecks using atop/sar/collectl
> which will help you tune your cluster.
> All 36 OSDs are: Crucial_CT960M500SSD1
>
> Rados bench tests were done at the beginning. The speed was much faster than
> it is now. I cannot recall the test results, someone else on my team ran
> them. Recently, I had thought the slow disk problem was a configuration
> issue with Ceph - before I posted here. Now we are hoping it may be resolved
> with a firmware update. (If it is firmware related, rebooting the problem
> node may temporarily resolve this)
>
> Though you did see better improvements, your cluster with 27 SSDs should
> give much higher numbers than 3k iops. If you are running rados bench while
> you have other client ios, then obviously the reported number by the tool
> will be less than what the cluster is actually giving...which you can find
> out via ceph status command, it will print the total cluster throughput and
> iops. If the total is still low i would recommend running the fio raw disk
> test, maybe the disks are not suitable. When you removed your 9 bad disk
> from 36 and your performance doubled, you still had 2 other disk slowing
> you..meaning near 100% busy ? It makes me feel the disk type used is not
> good. For these near 100% busy disks can you also measure their raw disk
> iops at that load (i am not sure atop shows this, if not use
> sat/syssyat/iostat/collecl).
> I ran another bench test today with all 36 OSDs up. The overall performance
> was improved slightly compared to the original tests. Only 3 OSDs on the
> problem host were increasing to 101% disk busy.
> The iops reported from ceph status during this bench test ranged from 1.6k
> to 3.3k, the test yielding 4k iops.
>
> Yes, the two other OSDs/disks that were the bottleneck were at 101% disk
> busy. The other OSD disks on the same host were sailing along at like 50-60%
> busy.
>
> All 36 OSD disks are exactly the same disk. They were all purchased at the
> same time. All were installed at the same time.
> I cannot believe it is a problem with the disk model. A failed/bad disk,
> perhaps is possible. But the disk model itself cannot be the problem based
> on what I am seeing. If I am seeing bad performance on all disks on one ceph
> node/host, but not on another ceph node with these same disks, it has to be
> some other factor. This is why I am now guessing a firmware upgrade is
> needed.
>
> Also, as I eluded to here earlier. I took down all 9 OSDs in the problem
> host yesterday to run the bench test.
> Today, with those 9 OSDs back online, I rerun the bench test, I am see 2-3
> OSD disks with 101% busy on the problem host, and the other disks are lower
> than 80%. So, for whatever reason, shutting down the OSDs and starting them
> back up, allowed many (not all) of the OSDs performance to improve on the
> problem host.
>
> Maged
>
> On 2017-10-25 23:44, Russell Glaue wrote:
>
> Thanks to all.
> I took the OSDs down in the problem host, without shutting down the
> machine.
> As predicted, our MB/s about doubled.
> Using this bench/atop procedure, I found two other OSDs on another host
> that are the next bottlenecks.
>
> Is this the only good way to really test the performance of the drives as
> OSDs? Is there any other way?
>
> While running the bench on all 36 OSDs, the 9 problem OSDs stuck out. But
> two new problem OSDs I just discovered in this recent test of 27 OSDs did
> not stick out at all. Because ceph bench distributes the load making only
> the very worst denominators show up in atop. So ceph is a slow as your
> slowest drive.
>
> It would be really great if I could run the bench test, and some how get
> the bench to use only certain OSDs during the test. Then I could run the
> test, avoiding the OSDs that I already know is a problem, so I can find the
> next worst OSD.
>
> [ the bench test ]
> rados bench -p scbench -b 4096 30 write -t 32
>
> [ original results with all 36 OSDs ]
> Total time run: 30.822350
> Total writes made: 31032
> Write size: 4096
> Object size: 4096
> Bandwidth (MB/sec): 3.93282
> Stddev Bandwidth: 3.66265
> Max bandwidth (MB/sec): 13.668
> Min bandwidth (MB/sec): 0
> Average IOPS: 1006
> Stddev IOPS: 937
> Max IOPS: 3499
> Min IOPS: 0
> Average Latency(s): 0.0317779
> Stddev Latency(s): 0.164076
> Max latency(s): 2.27707
> Min latency(s): 0.0013848
> Cleaning up (deleting benchmark objects)
> Clean up completed and total clean up time :20.166559
>
> [ after stopping all of the OSDs (9) on the problem host ]
> Total time run: 32.586830
> Total writes made: 59491
> Write size: 4096
> Object size: 4096
> Bandwidth (MB/sec): 7.13131
> Stddev Bandwidth: 9.78725
> Max bandwidth (MB/sec): 29.168
> Min bandwidth (MB/sec): 0
> Average IOPS: 1825
> Stddev IOPS: 2505
> Max IOPS: 7467
> Min IOPS: 0
> Average Latency(s): 0.0173691
> Stddev Latency(s): 0.21634
> Max latency(s): 6.71283
> Min latency(s): 0.00107473
> Cleaning up (deleting benchmark objects)
> Clean up completed and total clean up time :16.269393
>
> On Fri, Oct 20, 2017 at 1:35 PM, Russell Glaue <rgl...@cait.org> wrote:
> On the machine in question, the 2nd newest, we are using the LSI MegaRAID
> SAS-3 3008 [Fury], which allows us a "Non-RAID" option, and has no battery.
> The older two use the LSI MegaRAID SAS 2208 [Thunderbolt] I reported
> earlier, each single drive configured as RAID0.
>
> Thanks for everyone's help.
> I am going to run a 32 thread bench test after taking the 2nd machine out
> of the cluster with noout.
> After it is out of the cluster, I am expecting the slow write issue will
> not surface.
>
> On Fri, Oct 20, 2017 at 5:27 AM, David Turner <drakonst...@gmail.com>
> wrote:
> I can attest that the battery in the raid controller is a thing. I'm
> used to using lsi controllers, but my current position has hp raid
> controllers and we just tracked down 10 of our nodes that had >100ms await
> pretty much always were the only 10 nodes in the cluster with failed
> batteries on the raid controllers.
>
> On Thu, Oct 19, 2017, 8:15 PM Christian Balzer <ch...@gol.com> wrote:
>
> Hello,
>
> On Thu, 19 Oct 2017 17:14:17 -0500 Russell Glaue wrote:
>
> That is a good idea.
> However, a previous rebalancing processes has brought performance of
> our
> Guest VMs to a slow drag.
>
> Never mind that I'm not sure that these SSDs are particular well suited
> for Ceph, your problem is clearly located on that one node.
>
> Not that I think it's the case, but make sure your PG distribution is
> not
> skewed with many more PGs per OSD on that node.
>
> Once you rule that out my first guess is the RAID controller, you're
> running the SSDs are single RAID0s I presume?
> If so a either configuration difference or a failed BBU on the
> controller
> could result in the writeback cache being disabled, which would explain
> things beautifully.
>
> As for a temporary test/fix (with reduced redundancy of course), set
> noout
> (or mon_osd_down_out_subtree_limit accordingly) and turn the slow host
> off.
>
> This should result in much better performance than you have now and of
> course be the final confirmation of that host being the culprit.
>
> Christian
>
> On Thu, Oct 19, 2017 at 3:55 PM, Jean-Charles Lopez
> <jelo...@redhat.com>
> wrote:
>
> Hi Russell,
>
> as you have 4 servers, assuming you are not doing EC pools, just
> stop all
> the OSDs on the second questionable server, mark the OSDs on that
> server as
> out, let the cluster rebalance and when all PGs are active+clean
> just
> replay the test.
>
> All IOs should then go only to the other 3 servers.
>
> JC
>
> On Oct 19, 2017, at 13:49, Russell Glaue <rgl...@cait.org> wrote:
>
> No, I have not ruled out the disk controller and backplane making
> the
> disks slower.
> Is there a way I could test that theory, other than swapping out
> hardware?
> -RG
>
> On Thu, Oct 19, 2017 at 3:44 PM, David Turner
> <drakonst...@gmail.com>
> wrote:
>
> Have you ruled out the disk controller and backplane in the server
> running slower?
>
> On Thu, Oct 19, 2017 at 4:42 PM Russell Glaue <rgl...@cait.org>
> wrote:
>
> I ran the test on the Ceph pool, and ran atop on all 4 storage
> servers,
> as suggested.
>
> Out of the 4 servers:
> 3 of them performed with 17% to 30% disk %busy, and 11% CPU wait.
> Momentarily spiking up to 50% on one server, and 80% on another
> The 2nd newest server was almost averaging 90% disk %busy and
> 150% CPU
> wait. And more than momentarily spiking to 101% disk busy and
> 250% CPU wait.
> For this 2nd newest server, this was the statistics for about 8
> of 9
> disks, with the 9th disk not far behind the others.
>
> I cannot believe all 9 disks are bad
> They are the same disks as the newest 1st server,
> Crucial_CT960M500SSD1,
> and same exact server hardware too.
> They were purchased at the same time in the same purchase order
> and
> arrived at the same time.
> So I cannot believe I just happened to put 9 bad disks in one
> server,
> and 9 good ones in the other.
>
> I know I have Ceph configured exactly the same on all servers
> And I am sure I have the hardware settings configured exactly the
> same
> on the 1st and 2nd servers.
> So if I were someone else, I would say it maybe is bad hardware
> on the
> 2nd server.
> But the 2nd server is running very well without any hint of a
> problem.
>
> Any other ideas or suggestions?
>
> -RG
>
> On Wed, Oct 18, 2017 at 3:40 PM, Maged Mokhtar
> <mmokh...@petasan.org>
> wrote:
>
> just run the same 32 threaded rados test as you did before and
> this
> time run atop while the test is running looking for %busy of
> cpu/disks. It
> should give an idea if there is a bottleneck in them.
>
> On 2017-10-18 21:35, Russell Glaue wrote:
>
> I cannot run the write test reviewed at the
> ceph-how-to-test-if-your-s
> sd-is-suitable-as-a-journal-device blog. The tests write
> directly to
> the raw disk device.
> Reading an infile (created with urandom) on one SSD, writing the
> outfile to another osd, yields about 17MB/s.
> But Isn't this write speed limited by the speed in which in the
> dd
> infile can be read?
> And I assume the best test should be run with no other load.
>
> How does one run the rados bench "as stress"?
>
> -RG
>
> On Wed, Oct 18, 2017 at 1:33 PM, Maged Mokhtar
> <mmokh...@petasan.org>
> wrote:
>
> measuring resource load as outlined earlier will show if the
> drives
> are performing well or not. Also how many osds do you have ?
>
> On 2017-10-18 19:26, Russell Glaue wrote:
>
> The SSD drives are Crucial M500
> A Ceph user did some benchmarks and found it had good
> performance
> https://forum.proxmox.com/threads/ceph-bad-performance-in-
> qemu-guests.21551/
>
> However, a user comment from 3 years ago on the blog post you
> linked
> to says to avoid the Crucial M500
>
> Yet, this performance posting tells that the Crucial M500 is
> good.
> https://inside.servers.com/ssd-performance-2017-c4307a92dea
>
> On Wed, Oct 18, 2017 at 11:53 AM, Maged Mokhtar
> <mmokh...@petasan.org>
> wrote:
>
> Check out the following link: some SSDs perform bad in Ceph
> due to
> sync writes to journal
>
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-tes
> t-if-your-ssd-is-suitable-as-a-journal-device/
>
> Anther thing that can help is to re-run the rados 32 threads
> as
> stress and view resource usage using atop (or collectl/sar) to
> check for
> %busy cpu and %busy disks to give you an idea of what is
> holding down your
> cluster..for example: if cpu/disk % are all low then check
> your
> network/switches. If disk %busy is high (90%) for all disks
> then your
> disks are the bottleneck: which either means you have SSDs
> that are not
> suitable for Ceph or you have too few disks (which i doubt is
> the case). If
> only 1 disk %busy is high, there may be something wrong with
> this disk
> should be removed.
>
> Maged
>
> On 2017-10-18 18:13, Russell Glaue wrote:
>
> In my previous post, in one of my points I was wondering if
> the
> request size would increase if I enabled jumbo packets.
> currently it is
> disabled.
>
> @jdillama: The qemu settings for both these two guest
> machines, with
> RAID/LVM and Ceph/rbd images, are the same. I am not thinking
> that changing
> the qemu settings of "min_io_size=<limited to
> 16bits>,opt_io_size=<RBD
> image object size>" will directly address the issue.
>
> @mmokhtar: Ok. So you suggest the request size is the result
> of the
> problem and not the cause of the problem. meaning I should go
> after a
> different issue.
>
> I have been trying to get write speeds up to what people on
> this mail
> list are discussing.
> It seems that for our configuration, as it matches others, we
> should
> be getting about 70MB/s write speed.
> But we are not getting that.
> Single writes to disk are lucky to get 5MB/s to 6MB/s, but are
> typically 1MB/s to 2MB/s.
> Monitoring the entire Ceph cluster (using
> http://cephdash.crapworks.de/), I have seen very rare
> momentary
> spikes up to 30MB/s.
>
> My storage network is connected via a 10Gb switch
> I have 4 storage servers with a LSI Logic MegaRAID SAS 2208
> controller
> Each storage server has 9 1TB SSD drives, each drive as 1 osd
> (no
> RAID)
> Each drive is one LVM group, with two volumes - one volume for
> the
> osd, one volume for the journal
> Each osd is formatted with xfs
> The crush map is simple: default->rack->[host[1..4]->osd] with
> an
> evenly distributed weight
> The redundancy is triple replication
>
> While I have read comments that having the osd and journal on
> the
> same disk decreases write speed, I have also read that once
> past 8 OSDs per
> node this is the recommended configuration, however this is
> also the reason
> why SSD drives are used exclusively for OSDs in the storage
> nodes.
> None-the-less, I was still expecting write speeds to be above
> 30MB/s,
> not below 6MB/s.
> Even at 12x slower than the RAID, using my previously posted
> iostat
> data set, I should be seeing write speeds that average 10MB/s,
> not 2MB/s.
>
> In regards to the rados benchmark tests you asked me to run,
> here is
> the output:
>
> [centos7]# rados bench -p scbench -b 4096 30 write -t 1
> Maintaining 1 concurrent writes of 4096 bytes to objects of
> size 4096
> for up to 30 seconds or 0 objects
> Object prefix: benchmark_data_hamms.sys.cu.cait.org_85049
> sec Cur ops started finished avg MB/s cur MB/s last
> lat(s)
> avg lat(s)
> 0 0 0 0 0 0
> -
> 0
> 1 1 201 200 0.78356 0.78125
> 0.00522307
> 0.00496574
> 2 1 469 468 0.915303 1.04688
> 0.00437497
> 0.00426141
> 3 1 741 740 0.964371 1.0625
> 0.00512853
> 0.0040434
> 4 1 888 887 0.866739 0.574219
> 0.00307699
> 0.00450177
> 5 1 1147 1146 0.895725 1.01172
> 0.00376454
> 0.0043559
> 6 1 1325 1324 0.862293 0.695312
> 0.00459443
> 0.004525
> 7 1 1494 1493 0.83339 0.660156
> 0.00461002
> 0.00458452
> 8 1 1736 1735 0.847369 0.945312
> 0.00253971
> 0.00460458
> 9 1 1998 1997 0.866922 1.02344
> 0.00236573
> 0.00450172
> 10 1 2260 2259 0.882563 1.02344
> 0.00262179
> 0.00442152
> 11 1 2526 2525 0.896775 1.03906
> 0.00336914
> 0.00435092
> 12 1 2760 2759 0.898203 0.914062
> 0.00351827
> 0.00434491
> 13 1 3016 3015 0.906025 1
> 0.00335703
> 0.00430691
> 14 1 3257 3256 0.908545 0.941406
> 0.00332344
> 0.00429495
> 15 1 3490 3489 0.908644 0.910156
> 0.00318815
> 0.00426387
> 16 1 3728 3727 0.909952 0.929688
> 0.0032881
> 0.00428895
> 17 1 3986 3985 0.915703 1.00781
> 0.00274809
> 0.0042614
> 18 1 4250 4249 0.922116 1.03125
> 0.00287411
> 0.00423214
> 19 1 4505 4504 0.926003 0.996094
> 0.00375435
> 0.00421442
> 2017-10-18 10:56:31.267173 min lat: 0.00181259 max lat:
> 0.270553 avg
> lat: 0.00420118
> sec Cur ops started finished avg MB/s cur MB/s last
> lat(s)
> avg lat(s)
> 20 1 4757 4756 0.928915 0.984375
> 0.00463972
> 0.00420118
> 21 1 5009 5008 0.93155 0.984375
> 0.00360065
> 0.00418937
> 22 1 5235 5234 0.929329 0.882812
> 0.00626214
> 0.004199
> 23 1 5500 5499 0.933925 1.03516
> 0.00466584
> 0.00417836
> 24 1 5708 5707 0.928861 0.8125
> 0.00285727
> 0.00420146
> 25 0 5964 5964 0.931858 1.00391
> 0.00417383
> 0.0041881
> 26 1 6216 6215 0.933722 0.980469
> 0.0041009
> 0.00417915
> 27 1 6481 6480 0.937474 1.03516
> 0.00307484
> 0.00416118
> 28 1 6745 6744 0.940819 1.03125
> 0.00266329
> 0.00414777
> 29 1 7003 7002 0.943124 1.00781
> 0.00305905
> 0.00413758
> 30 1 7271 7270 0.946578 1.04688
> 0.00391017
> 0.00412238
> Total time run: 30.006060
> Total writes made: 7272
> Write size: 4096
> Object size: 4096
> Bandwidth (MB/sec): 0.946684
> Stddev Bandwidth: 0.123762
> Max bandwidth (MB/sec): 1.0625
> Min bandwidth (MB/sec): 0.574219
> Average IOPS: 242
> Stddev IOPS: 31
> Max IOPS: 272
> Min IOPS: 147
> Average Latency(s): 0.00412247
> Stddev Latency(s): 0.00648437
> Max latency(s): 0.270553
> Min latency(s): 0.00175318
> Cleaning up (deleting benchmark objects)
> Clean up completed and total clean up time :29.069423
>
> [centos7]# rados bench -p scbench -b 4096 30 write -t 32
> Maintaining 32 concurrent writes of 4096 bytes to objects of
> size
> 4096 for up to 30 seconds or 0 objects
> Object prefix: benchmark_data_hamms.sys.cu.cait.org_86076
> sec Cur ops started finished avg MB/s cur MB/s last
> lat(s)
> avg lat(s)
> 0 0 0 0 0 0
> -
> 0
> 1 32 3013 2981 11.6438 11.6445
> 0.00247906
> 0.00572026
> 2 32 5349 5317 10.3834 9.125
> 0.00246662
> 0.00932016
> 3 32 5707 5675 7.3883 1.39844
> 0.00389774
> 0.0156726
> 4 32 5895 5863 5.72481 0.734375
> 1.13137
> 0.0167946
> 5 32 6869 6837 5.34068 3.80469
> 0.0027652
> 0.0226577
> 6 32 8901 8869 5.77306 7.9375
> 0.0053211
> 0.0216259
> 7 32 10800 10768 6.00785 7.41797
> 0.00358187
> 0.0207418
> 8 32 11825 11793 5.75728 4.00391
> 0.00217575
> 0.0215494
> 9 32 12941 12909 5.6019 4.35938
> 0.00278512
> 0.0220567
> 10 32 13317 13285 5.18849 1.46875
> 0.0034973
> 0.0240665
> 11 32 16189 16157 5.73653 11.2188
> 0.00255841
> 0.0212708
> 12 32 16749 16717 5.44077 2.1875
> 0.00330334
> 0.0215915
> 13 32 16756 16724 5.02436 0.0273438
> 0.00338994
> 0.021849
> 14 32 17908 17876 4.98686 4.5
> 0.00402598
> 0.0244568
> 15 32 17936 17904 4.66171 0.109375
> 0.00375799
> 0.0245545
> 16 32 18279 18247 4.45409 1.33984
> 0.00483873
> 0.0267929
> 17 32 18372 18340 4.21346 0.363281
> 0.00505187
> 0.0275887
> 18 32 19403 19371 4.20309 4.02734
> 0.00545154
> 0.029348
> 19 31 19845 19814 4.07295 1.73047
> 0.00254726
> 0.0306775
> 2017-10-18 10:57:58.160536 min lat: 0.0015005 max lat: 2.27707
> avg
> lat: 0.0307559
> sec Cur ops started finished avg MB/s cur MB/s last
> lat(s)
> avg lat(s)
> 20 31 20401 20370 3.97788 2.17188
> 0.00307238
> 0.0307559
> 21 32 21338 21306 3.96254 3.65625
> 0.00464563
> 0.0312288
> 22 32 23057 23025 4.0876 6.71484
> 0.00296295
> 0.0299267
> 23 32 23057 23025 3.90988 0
> -
> 0.0299267
> 24 32 23803 23771 3.86837 1.45703
> 0.00301471
> 0.0312804
> 25 32 24112 24080 3.76191 1.20703
> 0.00191063
> 0.0331462
> 26 31 25303 25272 3.79629 4.65625
> 0.00794399
> 0.0329129
> 27 32 28803 28771 4.16183 13.668
> 0.0109817
> 0.0297469
> 28 32 29592 29560 4.12325 3.08203
> 0.00188185
> 0.0301911
> 29 32 30595 30563 4.11616 3.91797
> 0.00379099
> 0.0296794
> 30 32 31031 30999 4.03572 1.70312
> 0.00283347
> 0.0302411
> Total time run: 30.822350
> Total writes made: 31032
> Write size: 4096
> Object size: 4096
> Bandwidth (MB/sec): 3.93282
> Stddev Bandwidth: 3.66265
> Max bandwidth (MB/sec): 13.668
> Min bandwidth (MB/sec): 0
> Average IOPS: 1006
> Stddev IOPS: 937
> Max IOPS: 3499
> Min IOPS: 0
> Average Latency(s): 0.0317779
> Stddev Latency(s): 0.164076
> Max latency(s): 2.27707
> Min latency(s): 0.0013848
> Cleaning up (deleting benchmark objects)
> Clean up completed and total clean up time :20.166559
>
> On Wed, Oct 18, 2017 at 8:51 AM, Maged Mokhtar
> <mmokh...@petasan.org>
> wrote:
>
> First a general comment: local RAID will be faster than Ceph
> for a
> single threaded (queue depth=1) io operation test. A single
> thread Ceph
> client will see at best same disk speed for reads and for
> writes 4-6 times
> slower than single disk. Not to mention the latency of local
> disks will
> much better. Where Ceph shines is when you have many
> concurrent ios, it
> scales whereas RAID will decrease speed per client as you add
> more.
>
> Having said that, i would recommend running rados/rbd
> bench-write
> and measure 4k iops at 1 and 32 threads to get a better idea
> of how your
> cluster performs:
>
> ceph osd pool create testpool 256 256
> rados bench -p testpool -b 4096 30 write -t 1
> rados bench -p testpool -b 4096 30 write -t 32
> ceph osd pool delete testpool testpool
> --yes-i-really-really-mean-it
>
> rbd bench-write test-image --io-threads=1 --io-size 4096
> --io-pattern rand --rbd_cache=false
> rbd bench-write test-image --io-threads=32 --io-size 4096
> --io-pattern rand --rbd_cache=false
>
> I think the request size difference you see is due to the io
> scheduler in the case of local disks having more ios to
> re-group so has a
> better chance in generating larger requests. Depending on
> your kernel, the
> io scheduler may be different for rbd (blq-mq) vs sdx (cfq)
> but again i
> would think the request size is a result not a cause.
>
> Maged
>
> On 2017-10-17 23:12, Russell Glaue wrote:
>
> I am running ceph jewel on 5 nodes with SSD OSDs.
> I have an LVM image on a local RAID of spinning disks.
> I have an RBD image on in a pool of SSD disks.
> Both disks are used to run an almost identical CentOS 7
> system.
> Both systems were installed with the same kickstart, though
> the disk
> partitioning is different.
>
> I want to make writes on the the ceph image faster. For
> example,
> lots of writes to MySQL (via MySQL replication) on a ceph SSD
> image are
> about 10x slower than on a spindle RAID disk image. The MySQL
> server on
> ceph rbd image has a hard time keeping up in replication.
>
> So I wanted to test writes on these two systems
> I have a 10GB compressed (gzip) file on both servers.
> I simply gunzip the file on both systems, while running
> iostat.
>
> The primary difference I see in the results is the average
> size of
> the request to the disk.
> CentOS7-lvm-raid-sata writes a lot faster to disk, and the
> size of
> the request is about 40x, but the number of writes per second
> is about the
> same
> This makes me want to conclude that the smaller size of the
> request
> for CentOS7-ceph-rbd-ssd system is the cause of it being
> slow.
>
> How can I make the size of the request larger for ceph rbd
> images,
> so I can increase the write throughput?
> Would this be related to having jumbo packets enabled in my
> ceph
> storage network?
>
> Here is a sample of the results:
>
> [CentOS7-lvm-raid-sata]
> $ gunzip large10gFile.gz &
> $ iostat -x vg_root-lv_var -d 5 -m -N
> Device: rrqm/s wrqm/s r/s w/s rMB/s
> wMB/s
> avgrq-sz avgqu-sz await r_await w_await svctm %util
> ...
> vg_root-lv_var 0.00 0.00 30.60 452.20 13.60
> 222.15
> 1000.04 8.69 14.05 0.99 14.93 2.07 100.04
> vg_root-lv_var 0.00 0.00 88.20 182.00 39.20
> 89.43
> 974.95 4.65 9.82 0.99 14.10 3.70 100.00
> vg_root-lv_var 0.00 0.00 75.45 278.24 33.53
> 136.70
> 985.73 4.36 33.26 1.34 41.91 0.59 20.84
> vg_root-lv_var 0.00 0.00 111.60 181.80 49.60
> 89.34
> 969.84 2.60 8.87 0.81 13.81 0.13 3.90
> vg_root-lv_var 0.00 0.00 68.40 109.60 30.40
> 53.63
> 966.87 1.51 8.46 0.84 13.22 0.80 14.16
> ...
>
> [CentOS7-ceph-rbd-ssd]
> $ gunzip large10gFile.gz &
> $ iostat -x vg_root-lv_data -d 5 -m -N
> Device: rrqm/s wrqm/s r/s w/s rMB/s
> wMB/s
> avgrq-sz avgqu-sz await r_await w_await svctm %util
> ...
> vg_root-lv_data 0.00 0.00 46.40 167.80 0.88
> 1.46
> 22.36 1.23 5.66 2.47 6.54 4.52 96.82
> vg_root-lv_data 0.00 0.00 16.60 55.20 0.36
> 0.14
> 14.44 0.99 13.91 9.12 15.36 13.71 98.46
> vg_root-lv_data 0.00 0.00 69.00 173.80 1.34
> 1.32
> 22.48 1.25 5.19 3.77 5.75 3.94 95.68
> vg_root-lv_data 0.00 0.00 74.40 293.40 1.37
> 1.47
> 15.83 1.22 3.31 2.06 3.63 2.54 93.26
> vg_root-lv_data 0.00 0.00 90.80 359.00 1.96
> 3.41
> 24.45 1.63 3.63 1.94 4.05 2.10 94.38
> ...
>
> [iostat key]
> w/s == The number (after merges) of write requests completed
> per
> second for the device.
> wMB/s == The number of sectors (kilobytes, megabytes) written
> to the
> device per second.
> avgrq-sz == The average size (in kilobytes) of the requests
> that
> were issued to the device.
> avgqu-sz == The average queue length of the requests that
> were
> issued to the device.
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
ch...@gol.com Rakuten Communications
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com