Re: [ceph-users] How to increase the size of requests written to a ceph image

Brian Andrus Fri, 27 Oct 2017 14:13:11 -0700

I would be interested in seeing the results from the post mentioned by an
earlier contributor:


https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

Test an "old" M500 and a "new" M500 and see if the performance is A)
acceptable and B) comparable. Find hardware revision or firmware revision
in case of A=Good and B=different.

If the "old" device doesn't test well in fio/dd testing, then the drives
are (as expected) not a great choice for journals and you might want to
look at hardware/backplane/RAID configuration differences that are somehow
allowing them to perform adequately.

On Fri, Oct 27, 2017 at 12:36 PM, Russell Glaue <rgl...@cait.org> wrote:

> Yes, all the MD500s we use are both journal and OSD, even the older ones.
> We have a 3 year lifecycle and move older nodes from one ceph cluster to
> another.
> On old systems with 3 year old MD500s, they run as RAID0, and run faster
> than our current problem system with 1 year old MD500s, ran as nonraid
> pass-through on the controller.
>
> All disks are SATA and are connected to a SAS controller. We were
> wondering if the SAS/SATA conversion is an issue. Yet, the older systems
> don't exhibit a problem.
>
> I found what I wanted to know from a colleague, that when the current ceph
> cluster was put together, the SSDs tested at 300+MB/s, and ceph cluster
> writes at 30MB/s.
>
> Using SMART tools, the reserved cells in all drives is nearly 100%.
>
> Restarting the OSDs minorly improved performance. Still betting on
> hardware issues that a firmware upgrade may resolve.
>
> -RG
>
>
> On Oct 27, 2017 1:14 PM, "Brian Andrus" <brian.and...@dreamhost.com>
> wrote:
>
> @Russel, are your "older Crucial M500"s being used as journals?
>
> Crucial M500s are not to be used as a Ceph journal in my last experience
> with them. They make good OSDs with an NVMe in front of them perhaps, but
> not much else.
>
> Ceph uses O_DSYNC for journal writes and these drives do not handle them
> as expected. It's been many years since I've dealt with the M500s
> specifically, but it has to do with the capacitor/power save feature and
> how it handles those types of writes. I'm sorry I don't have the emails
> with specifics around anymore, but last I remember, this was a hardware
> issue and could not be resolved with firmware.
>
> Paging Kyle Bader...
>
> On Fri, Oct 27, 2017 at 9:24 AM, Russell Glaue <rgl...@cait.org> wrote:
>
>> We have older crucial M500 disks operating without such problems. So, I
>> have to believe it is a hardware firmware issue.
>> And its peculiar seeing performance boost slightly, even 24 hours later,
>> when I stop then start the OSDs.
>>
>> Our actual writes are low, as most of our Ceph Cluster based images are
>> low-write, high-memory. So a 20GB/day life/write capacity is a non-issue
>> for us. Only write speed is the concern. Our write-intensive images are
>> locked on non-ceph disks.
>>
>> What are others using for SSD drives in their Ceph cluster?
>> With 0.50+ DWPD (Drive Writes Per Day), the Kingston SEDC400S37 models
>> seems to be the best for the price today.
>>
>>
>>
>> On Fri, Oct 27, 2017 at 6:34 AM, Maged Mokhtar <mmokh...@petasan.org>
>> wrote:
>>
>>> It is quiet likely related, things are pointing to bad disks. Probably
>>> the best thing is to plan for disk replacement, the sooner the better as it
>>> could get worse.
>>>
>>>
>>>
>>> On 2017-10-27 02:22, Christian Wuerdig wrote:
>>>
>>> Hm, no necessarily directly related to your performance problem,
>>> however: These SSDs have a listed endurance of 72TB total data written
>>> - over a 5 year period that's 40GB a day or approx 0.04 DWPD. Given
>>> that you run the journal for each OSD on the same disk, that's
>>> effectively at most 0.02 DWPD (about 20GB per day per disk). I don't
>>> know many who'd run a cluster on disks like those. Also it means these
>>> are pure consumer drives which have a habit of exhibiting random
>>> performance at times (based on unquantified anecdotal personal
>>> experience with other consumer model SSDs). I wouldn't touch these
>>> with a long stick for anything but small toy-test clusters.
>>>
>>> On Fri, Oct 27, 2017 at 3:44 AM, Russell Glaue <rgl...@cait.org> wrote:
>>>
>>>
>>> On Wed, Oct 25, 2017 at 7:09 PM, Maged Mokhtar <mmokh...@petasan.org>
>>> wrote:
>>>
>>>
>>> It depends on what stage you are in:
>>> in production, probably the best thing is to setup a monitoring tool
>>> (collectd/grahite/prometheus/grafana) to monitor both ceph stats as
>>> well as
>>> resource load. This will, among other things, show you if you have
>>> slowing
>>> disks.
>>>
>>>
>>> I am monitoring Ceph performance with ceph-dash
>>> (http://cephdash.crapworks.de/), that is why I knew to look into the
>>> slow
>>> writes issue. And I am using Monitorix (http://www.monitorix.org/) to
>>> monitor system resources, including Disk I/O.
>>>
>>> However, though I can monitor individual disk performance at the system
>>> level, it seems Ceph does not tax any disk more than the worst disk. So
>>> in
>>> my monitoring charts, all disks have the same performance.
>>> All four nodes are base-lining at 50 writes/sec during the cluster's
>>> normal
>>> load, with the non-problem hosts spiking up to 150, and the problem host
>>> only spikes up to 100.
>>> But during the window of time I took the problem host OSDs down to run
>>> the
>>> bench tests, the OSDs on the other nodes increased to 300-500 writes/sec.
>>> Otherwise, the chart looks the same for all disks on all ceph
>>> nodes/hosts.
>>>
>>> Before production you should first make sure your SSDs are suitable for
>>> Ceph, either by being recommend by other Ceph users or you test them
>>> yourself for sync writes performance using fio tool as outlined earlier.
>>> Then after you build your cluster you can use rados and/or rbd bencmark
>>> tests to benchmark your cluster and find bottlenecks using
>>> atop/sar/collectl
>>> which will help you tune your cluster.
>>>
>>>
>>> All 36 OSDs are: Crucial_CT960M500SSD1
>>>
>>> Rados bench tests were done at the beginning. The speed was much faster
>>> than
>>> it is now. I cannot recall the test results, someone else on my team ran
>>> them. Recently, I had thought the slow disk problem was a configuration
>>> issue with Ceph - before I posted here. Now we are hoping it may be
>>> resolved
>>> with a firmware update. (If it is firmware related, rebooting the problem
>>> node may temporarily resolve this)
>>>
>>>
>>> Though you did see better improvements, your cluster with 27 SSDs should
>>> give much higher numbers than 3k iops. If you are running rados bench
>>> while
>>> you have other client ios, then obviously the reported number by the tool
>>> will be less than what the cluster is actually giving...which you can
>>> find
>>> out via ceph status command, it will print the total cluster throughput
>>> and
>>> iops. If the total is still low i would recommend running the fio raw
>>> disk
>>> test, maybe the disks are not suitable. When you removed your 9 bad disk
>>> from 36 and your performance doubled, you still had 2 other disk slowing
>>> you..meaning near 100% busy ? It makes me feel the disk type used is not
>>> good. For these near 100% busy disks can you also measure their raw disk
>>> iops at that load (i am not sure atop shows this, if not use
>>> sat/syssyat/iostat/collecl).
>>>
>>>
>>> I ran another bench test today with all 36 OSDs up. The overall
>>> performance
>>> was improved slightly compared to the original tests. Only 3 OSDs on the
>>> problem host were increasing to 101% disk busy.
>>> The iops reported from ceph status during this bench test ranged from
>>> 1.6k
>>> to 3.3k, the test yielding 4k iops.
>>>
>>> Yes, the two other OSDs/disks that were the bottleneck were at 101% disk
>>> busy. The other OSD disks on the same host were sailing along at like
>>> 50-60%
>>> busy.
>>>
>>> All 36 OSD disks are exactly the same disk. They were all purchased at
>>> the
>>> same time. All were installed at the same time.
>>> I cannot believe it is a problem with the disk model. A failed/bad disk,
>>> perhaps is possible. But the disk model itself cannot be the problem
>>> based
>>> on what I am seeing. If I am seeing bad performance on all disks on one
>>> ceph
>>> node/host, but not on another ceph node with these same disks, it has to
>>> be
>>> some other factor. This is why I am now guessing a firmware upgrade is
>>> needed.
>>>
>>> Also, as I eluded to here earlier. I took down all 9 OSDs in the problem
>>> host yesterday to run the bench test.
>>> Today, with those 9 OSDs back online, I rerun the bench test, I am see
>>> 2-3
>>> OSD disks with 101% busy on the problem host, and the other disks are
>>> lower
>>> than 80%. So, for whatever reason, shutting down the OSDs and starting
>>> them
>>> back up, allowed many (not all) of the OSDs performance to improve on the
>>> problem host.
>>>
>>>
>>> Maged
>>>
>>> On 2017-10-25 23:44, Russell Glaue wrote:
>>>
>>> Thanks to all.
>>> I took the OSDs down in the problem host, without shutting down the
>>> machine.
>>> As predicted, our MB/s about doubled.
>>> Using this bench/atop procedure, I found two other OSDs on another host
>>> that are the next bottlenecks.
>>>
>>> Is this the only good way to really test the performance of the drives as
>>> OSDs? Is there any other way?
>>>
>>> While running the bench on all 36 OSDs, the 9 problem OSDs stuck out. But
>>> two new problem OSDs I just discovered in this recent test of 27 OSDs did
>>> not stick out at all. Because ceph bench distributes the load making only
>>> the very worst denominators show up in atop. So ceph is a slow as your
>>> slowest drive.
>>>
>>> It would be really great if I could run the bench test, and some how get
>>> the bench to use only certain OSDs during the test. Then I could run the
>>> test, avoiding the OSDs that I already know is a problem, so I can find
>>> the
>>> next worst OSD.
>>>
>>>
>>> [ the bench test ]
>>> rados bench -p scbench -b 4096 30 write -t 32
>>>
>>> [ original results with all 36 OSDs ]
>>> Total time run:         30.822350
>>> Total writes made:      31032
>>> Write size:             4096
>>> Object size:            4096
>>> Bandwidth (MB/sec):     3.93282
>>> Stddev Bandwidth:       3.66265
>>> Max bandwidth (MB/sec): 13.668
>>> Min bandwidth (MB/sec): 0
>>> Average IOPS:           1006
>>> Stddev IOPS:            937
>>> Max IOPS:               3499
>>> Min IOPS:               0
>>> Average Latency(s):     0.0317779
>>> Stddev Latency(s):      0.164076
>>> Max latency(s):         2.27707
>>> Min latency(s):         0.0013848
>>> Cleaning up (deleting benchmark objects)
>>> Clean up completed and total clean up time :20.166559
>>>
>>> [ after stopping all of the OSDs (9) on the problem host ]
>>> Total time run:         32.586830
>>> Total writes made:      59491
>>> Write size:             4096
>>> Object size:            4096
>>> Bandwidth (MB/sec):     7.13131
>>> Stddev Bandwidth:       9.78725
>>> Max bandwidth (MB/sec): 29.168
>>> Min bandwidth (MB/sec): 0
>>> Average IOPS:           1825
>>> Stddev IOPS:            2505
>>> Max IOPS:               7467
>>> Min IOPS:               0
>>> Average Latency(s):     0.0173691
>>> Stddev Latency(s):      0.21634
>>> Max latency(s):         6.71283
>>> Min latency(s):         0.00107473
>>> Cleaning up (deleting benchmark objects)
>>> Clean up completed and total clean up time :16.269393
>>>
>>>
>>>
>>> On Fri, Oct 20, 2017 at 1:35 PM, Russell Glaue <rgl...@cait.org> wrote:
>>>
>>>
>>> On the machine in question, the 2nd newest, we are using the LSI MegaRAID
>>> SAS-3 3008 [Fury], which allows us a "Non-RAID" option, and has no
>>> battery.
>>> The older two use the LSI MegaRAID SAS 2208 [Thunderbolt] I reported
>>> earlier, each single drive configured as RAID0.
>>>
>>> Thanks for everyone's help.
>>> I am going to run a 32 thread bench test after taking the 2nd machine out
>>> of the cluster with noout.
>>> After it is out of the cluster, I am expecting the slow write issue will
>>> not surface.
>>>
>>>
>>> On Fri, Oct 20, 2017 at 5:27 AM, David Turner <drakonst...@gmail.com>
>>> wrote:
>>>
>>>
>>> I can attest that the battery in the raid controller is a thing. I'm
>>> used to using lsi controllers, but my current position has hp raid
>>> controllers and we just tracked down 10 of our nodes that had >100ms
>>> await
>>> pretty much always were the only 10 nodes in the cluster with failed
>>> batteries on the raid controllers.
>>>
>>>
>>> On Thu, Oct 19, 2017, 8:15 PM Christian Balzer <ch...@gol.com> wrote:
>>>
>>>
>>>
>>> Hello,
>>>
>>> On Thu, 19 Oct 2017 17:14:17 -0500 Russell Glaue wrote:
>>>
>>> That is a good idea.
>>> However, a previous rebalancing processes has brought performance of
>>> our
>>> Guest VMs to a slow drag.
>>>
>>>
>>> Never mind that I'm not sure that these SSDs are particular well suited
>>> for Ceph, your problem is clearly located on that one node.
>>>
>>> Not that I think it's the case, but make sure your PG distribution is
>>> not
>>> skewed with many more PGs per OSD on that node.
>>>
>>> Once you rule that out my first guess is the RAID controller, you're
>>> running the SSDs are single RAID0s I presume?
>>> If so a either configuration difference or a failed BBU on the
>>> controller
>>> could result in the writeback cache being disabled, which would explain
>>> things beautifully.
>>>
>>> As for a temporary test/fix (with reduced redundancy of course), set
>>> noout
>>> (or mon_osd_down_out_subtree_limit accordingly) and turn the slow host
>>> off.
>>>
>>> This should result in much better performance than you have now and of
>>> course be the final confirmation of that host being the culprit.
>>>
>>> Christian
>>>
>>>
>>> On Thu, Oct 19, 2017 at 3:55 PM, Jean-Charles Lopez
>>> <jelo...@redhat.com>
>>> wrote:
>>>
>>> Hi Russell,
>>>
>>> as you have 4 servers, assuming you are not doing EC pools, just
>>> stop all
>>> the OSDs on the second questionable server, mark the OSDs on that
>>> server as
>>> out, let the cluster rebalance and when all PGs are active+clean
>>> just
>>> replay the test.
>>>
>>> All IOs should then go only to the other 3 servers.
>>>
>>> JC
>>>
>>> On Oct 19, 2017, at 13:49, Russell Glaue <rgl...@cait.org> wrote:
>>>
>>> No, I have not ruled out the disk controller and backplane making
>>> the
>>> disks slower.
>>> Is there a way I could test that theory, other than swapping out
>>> hardware?
>>> -RG
>>>
>>> On Thu, Oct 19, 2017 at 3:44 PM, David Turner
>>> <drakonst...@gmail.com>
>>> wrote:
>>>
>>> Have you ruled out the disk controller and backplane in the server
>>> running slower?
>>>
>>> On Thu, Oct 19, 2017 at 4:42 PM Russell Glaue <rgl...@cait.org>
>>> wrote:
>>>
>>> I ran the test on the Ceph pool, and ran atop on all 4 storage
>>> servers,
>>> as suggested.
>>>
>>> Out of the 4 servers:
>>> 3 of them performed with 17% to 30% disk %busy, and 11% CPU wait.
>>> Momentarily spiking up to 50% on one server, and 80% on another
>>> The 2nd newest server was almost averaging 90% disk %busy and
>>> 150% CPU
>>> wait. And more than momentarily spiking to 101% disk busy and
>>> 250% CPU wait.
>>> For this 2nd newest server, this was the statistics for about 8
>>> of 9
>>> disks, with the 9th disk not far behind the others.
>>>
>>> I cannot believe all 9 disks are bad
>>> They are the same disks as the newest 1st server,
>>> Crucial_CT960M500SSD1,
>>> and same exact server hardware too.
>>> They were purchased at the same time in the same purchase order
>>> and
>>> arrived at the same time.
>>> So I cannot believe I just happened to put 9 bad disks in one
>>> server,
>>> and 9 good ones in the other.
>>>
>>> I know I have Ceph configured exactly the same on all servers
>>> And I am sure I have the hardware settings configured exactly the
>>> same
>>> on the 1st and 2nd servers.
>>> So if I were someone else, I would say it maybe is bad hardware
>>> on the
>>> 2nd server.
>>> But the 2nd server is running very well without any hint of a
>>> problem.
>>>
>>> Any other ideas or suggestions?
>>>
>>> -RG
>>>
>>>
>>> On Wed, Oct 18, 2017 at 3:40 PM, Maged Mokhtar
>>> <mmokh...@petasan.org>
>>> wrote:
>>>
>>> just run the same 32 threaded rados test as you did before and
>>> this
>>> time run atop while the test is running looking for %busy of
>>> cpu/disks. It
>>> should give an idea if there is a bottleneck in them.
>>>
>>> On 2017-10-18 21:35, Russell Glaue wrote:
>>>
>>> I cannot run the write test reviewed at the
>>> ceph-how-to-test-if-your-s
>>> sd-is-suitable-as-a-journal-device blog. The tests write
>>> directly to
>>> the raw disk device.
>>> Reading an infile (created with urandom) on one SSD, writing the
>>> outfile to another osd, yields about 17MB/s.
>>> But Isn't this write speed limited by the speed in which in the
>>> dd
>>> infile can be read?
>>> And I assume the best test should be run with no other load.
>>>
>>> How does one run the rados bench "as stress"?
>>>
>>> -RG
>>>
>>>
>>> On Wed, Oct 18, 2017 at 1:33 PM, Maged Mokhtar
>>> <mmokh...@petasan.org>
>>> wrote:
>>>
>>> measuring resource load as outlined earlier will show if the
>>> drives
>>> are performing well or not. Also how many osds do you have  ?
>>>
>>> On 2017-10-18 19:26, Russell Glaue wrote:
>>>
>>> The SSD drives are Crucial M500
>>> A Ceph user did some benchmarks and found it had good
>>> performance
>>> https://forum.proxmox.com/threads/ceph-bad-performance-in-
>>> qemu-guests.21551/
>>>
>>> However, a user comment from 3 years ago on the blog post you
>>> linked
>>> to says to avoid the Crucial M500
>>>
>>> Yet, this performance posting tells that the Crucial M500 is
>>> good.
>>> https://inside.servers.com/ssd-performance-2017-c4307a92dea
>>>
>>> On Wed, Oct 18, 2017 at 11:53 AM, Maged Mokhtar
>>> <mmokh...@petasan.org>
>>> wrote:
>>>
>>> Check out the following link: some SSDs perform bad in Ceph
>>> due to
>>> sync writes to journal
>>>
>>> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-tes
>>> t-if-your-ssd-is-suitable-as-a-journal-device/
>>>
>>> Anther thing that can help is to re-run the rados 32 threads
>>> as
>>> stress and view resource usage using atop (or collectl/sar) to
>>> check for
>>> %busy cpu and %busy disks to give you an idea of what is
>>> holding down your
>>> cluster..for example: if cpu/disk % are all low then check
>>> your
>>> network/switches.  If disk %busy is high (90%) for all disks
>>> then your
>>> disks are the bottleneck: which either means you have SSDs
>>> that are not
>>> suitable for Ceph or you have too few disks (which i doubt is
>>> the case). If
>>> only 1 disk %busy is high, there may be something wrong with
>>> this disk
>>> should be removed.
>>>
>>> Maged
>>>
>>> On 2017-10-18 18:13, Russell Glaue wrote:
>>>
>>> In my previous post, in one of my points I was wondering if
>>> the
>>> request size would increase if I enabled jumbo packets.
>>> currently it is
>>> disabled.
>>>
>>> @jdillama: The qemu settings for both these two guest
>>> machines, with
>>> RAID/LVM and Ceph/rbd images, are the same. I am not thinking
>>> that changing
>>> the qemu settings of "min_io_size=<limited to
>>> 16bits>,opt_io_size=<RBD
>>> image object size>" will directly address the issue.
>>>
>>> @mmokhtar: Ok. So you suggest the request size is the result
>>> of the
>>> problem and not the cause of the problem. meaning I should go
>>> after a
>>> different issue.
>>>
>>> I have been trying to get write speeds up to what people on
>>> this mail
>>> list are discussing.
>>> It seems that for our configuration, as it matches others, we
>>> should
>>> be getting about 70MB/s write speed.
>>> But we are not getting that.
>>> Single writes to disk are lucky to get 5MB/s to 6MB/s, but are
>>> typically 1MB/s to 2MB/s.
>>> Monitoring the entire Ceph cluster (using
>>> http://cephdash.crapworks.de/), I have seen very rare
>>> momentary
>>> spikes up to 30MB/s.
>>>
>>> My storage network is connected via a 10Gb switch
>>> I have 4 storage servers with a LSI Logic MegaRAID SAS 2208
>>> controller
>>> Each storage server has 9 1TB SSD drives, each drive as 1 osd
>>> (no
>>> RAID)
>>> Each drive is one LVM group, with two volumes - one volume for
>>> the
>>> osd, one volume for the journal
>>> Each osd is formatted with xfs
>>> The crush map is simple: default->rack->[host[1..4]->osd] with
>>> an
>>> evenly distributed weight
>>> The redundancy is triple replication
>>>
>>> While I have read comments that having the osd and journal on
>>> the
>>> same disk decreases write speed, I have also read that once
>>> past 8 OSDs per
>>> node this is the recommended configuration, however this is
>>> also the reason
>>> why SSD drives are used exclusively for OSDs in the storage
>>> nodes.
>>> None-the-less, I was still expecting write speeds to be above
>>> 30MB/s,
>>> not below 6MB/s.
>>> Even at 12x slower than the RAID, using my previously posted
>>> iostat
>>> data set, I should be seeing write speeds that average 10MB/s,
>>> not 2MB/s.
>>>
>>> In regards to the rados benchmark tests you asked me to run,
>>> here is
>>> the output:
>>>
>>> [centos7]# rados bench -p scbench -b 4096 30 write -t 1
>>> Maintaining 1 concurrent writes of 4096 bytes to objects of
>>> size 4096
>>> for up to 30 seconds or 0 objects
>>> Object prefix: benchmark_data_hamms.sys.cu.cait.org_85049
>>>   sec Cur ops   started  finished  avg MB/s  cur MB/s last
>>> lat(s)
>>>  avg lat(s)
>>>     0       0         0         0         0         0
>>> -
>>>       0
>>>     1       1       201       200   0.78356   0.78125
>>> 0.00522307
>>>  0.00496574
>>>     2       1       469       468  0.915303   1.04688
>>> 0.00437497
>>>  0.00426141
>>>     3       1       741       740  0.964371    1.0625
>>> 0.00512853
>>> 0.0040434
>>>     4       1       888       887  0.866739  0.574219
>>> 0.00307699
>>>  0.00450177
>>>     5       1      1147      1146  0.895725   1.01172
>>> 0.00376454
>>> 0.0043559
>>>     6       1      1325      1324  0.862293  0.695312
>>> 0.00459443
>>>  0.004525
>>>     7       1      1494      1493   0.83339  0.660156
>>> 0.00461002
>>>  0.00458452
>>>     8       1      1736      1735  0.847369  0.945312
>>> 0.00253971
>>>  0.00460458
>>>     9       1      1998      1997  0.866922   1.02344
>>> 0.00236573
>>>  0.00450172
>>>    10       1      2260      2259  0.882563   1.02344
>>> 0.00262179
>>>  0.00442152
>>>    11       1      2526      2525  0.896775   1.03906
>>> 0.00336914
>>>  0.00435092
>>>    12       1      2760      2759  0.898203  0.914062
>>> 0.00351827
>>>  0.00434491
>>>    13       1      3016      3015  0.906025         1
>>> 0.00335703
>>>  0.00430691
>>>    14       1      3257      3256  0.908545  0.941406
>>> 0.00332344
>>>  0.00429495
>>>    15       1      3490      3489  0.908644  0.910156
>>> 0.00318815
>>>  0.00426387
>>>    16       1      3728      3727  0.909952  0.929688
>>> 0.0032881
>>>  0.00428895
>>>    17       1      3986      3985  0.915703   1.00781
>>> 0.00274809
>>> 0.0042614
>>>    18       1      4250      4249  0.922116   1.03125
>>> 0.00287411
>>>  0.00423214
>>>    19       1      4505      4504  0.926003  0.996094
>>> 0.00375435
>>>  0.00421442
>>> 2017-10-18 10:56:31.267173 min lat: 0.00181259 max lat:
>>> 0.270553 avg
>>> lat: 0.00420118
>>>   sec Cur ops   started  finished  avg MB/s  cur MB/s last
>>> lat(s)
>>>  avg lat(s)
>>>    20       1      4757      4756  0.928915  0.984375
>>> 0.00463972
>>>  0.00420118
>>>    21       1      5009      5008   0.93155  0.984375
>>> 0.00360065
>>>  0.00418937
>>>    22       1      5235      5234  0.929329  0.882812
>>> 0.00626214
>>>  0.004199
>>>    23       1      5500      5499  0.933925   1.03516
>>> 0.00466584
>>>  0.00417836
>>>    24       1      5708      5707  0.928861    0.8125
>>> 0.00285727
>>>  0.00420146
>>>    25       0      5964      5964  0.931858   1.00391
>>> 0.00417383
>>> 0.0041881
>>>    26       1      6216      6215  0.933722  0.980469
>>> 0.0041009
>>>  0.00417915
>>>    27       1      6481      6480  0.937474   1.03516
>>> 0.00307484
>>>  0.00416118
>>>    28       1      6745      6744  0.940819   1.03125
>>> 0.00266329
>>>  0.00414777
>>>    29       1      7003      7002  0.943124   1.00781
>>> 0.00305905
>>>  0.00413758
>>>    30       1      7271      7270  0.946578   1.04688
>>> 0.00391017
>>>  0.00412238
>>> Total time run:         30.006060
>>> Total writes made:      7272
>>> Write size:             4096
>>> Object size:            4096
>>> Bandwidth (MB/sec):     0.946684
>>> Stddev Bandwidth:       0.123762
>>> Max bandwidth (MB/sec): 1.0625
>>> Min bandwidth (MB/sec): 0.574219
>>> Average IOPS:           242
>>> Stddev IOPS:            31
>>> Max IOPS:               272
>>> Min IOPS:               147
>>> Average Latency(s):     0.00412247
>>> Stddev Latency(s):      0.00648437
>>> Max latency(s):         0.270553
>>> Min latency(s):         0.00175318
>>> Cleaning up (deleting benchmark objects)
>>> Clean up completed and total clean up time :29.069423
>>>
>>> [centos7]# rados bench -p scbench -b 4096 30 write -t 32
>>> Maintaining 32 concurrent writes of 4096 bytes to objects of
>>> size
>>> 4096 for up to 30 seconds or 0 objects
>>> Object prefix: benchmark_data_hamms.sys.cu.cait.org_86076
>>>   sec Cur ops   started  finished  avg MB/s  cur MB/s last
>>> lat(s)
>>>  avg lat(s)
>>>     0       0         0         0         0         0
>>> -
>>>       0
>>>     1      32      3013      2981   11.6438   11.6445
>>> 0.00247906
>>>  0.00572026
>>>     2      32      5349      5317   10.3834     9.125
>>> 0.00246662
>>>  0.00932016
>>>     3      32      5707      5675    7.3883   1.39844
>>> 0.00389774
>>> 0.0156726
>>>     4      32      5895      5863   5.72481  0.734375
>>> 1.13137
>>> 0.0167946
>>>     5      32      6869      6837   5.34068   3.80469
>>> 0.0027652
>>> 0.0226577
>>>     6      32      8901      8869   5.77306    7.9375
>>> 0.0053211
>>> 0.0216259
>>>     7      32     10800     10768   6.00785   7.41797
>>> 0.00358187
>>> 0.0207418
>>>     8      32     11825     11793   5.75728   4.00391
>>> 0.00217575
>>> 0.0215494
>>>     9      32     12941     12909    5.6019   4.35938
>>> 0.00278512
>>> 0.0220567
>>>    10      32     13317     13285   5.18849   1.46875
>>> 0.0034973
>>> 0.0240665
>>>    11      32     16189     16157   5.73653   11.2188
>>> 0.00255841
>>> 0.0212708
>>>    12      32     16749     16717   5.44077    2.1875
>>> 0.00330334
>>> 0.0215915
>>>    13      32     16756     16724   5.02436 0.0273438
>>> 0.00338994
>>>  0.021849
>>>    14      32     17908     17876   4.98686       4.5
>>> 0.00402598
>>> 0.0244568
>>>    15      32     17936     17904   4.66171  0.109375
>>> 0.00375799
>>> 0.0245545
>>>    16      32     18279     18247   4.45409   1.33984
>>> 0.00483873
>>> 0.0267929
>>>    17      32     18372     18340   4.21346  0.363281
>>> 0.00505187
>>> 0.0275887
>>>    18      32     19403     19371   4.20309   4.02734
>>> 0.00545154
>>>  0.029348
>>>    19      31     19845     19814   4.07295   1.73047
>>> 0.00254726
>>> 0.0306775
>>> 2017-10-18 10:57:58.160536 min lat: 0.0015005 max lat: 2.27707
>>> avg
>>> lat: 0.0307559
>>>   sec Cur ops   started  finished  avg MB/s  cur MB/s last
>>> lat(s)
>>>  avg lat(s)
>>>    20      31     20401     20370   3.97788   2.17188
>>> 0.00307238
>>> 0.0307559
>>>    21      32     21338     21306   3.96254   3.65625
>>> 0.00464563
>>> 0.0312288
>>>    22      32     23057     23025    4.0876   6.71484
>>> 0.00296295
>>> 0.0299267
>>>    23      32     23057     23025   3.90988         0
>>> -
>>> 0.0299267
>>>    24      32     23803     23771   3.86837   1.45703
>>> 0.00301471
>>> 0.0312804
>>>    25      32     24112     24080   3.76191   1.20703
>>> 0.00191063
>>> 0.0331462
>>>    26      31     25303     25272   3.79629   4.65625
>>> 0.00794399
>>> 0.0329129
>>>    27      32     28803     28771   4.16183    13.668
>>> 0.0109817
>>> 0.0297469
>>>    28      32     29592     29560   4.12325   3.08203
>>> 0.00188185
>>> 0.0301911
>>>    29      32     30595     30563   4.11616   3.91797
>>> 0.00379099
>>> 0.0296794
>>>    30      32     31031     30999   4.03572   1.70312
>>> 0.00283347
>>> 0.0302411
>>> Total time run:         30.822350
>>> Total writes made:      31032
>>> Write size:             4096
>>> Object size:            4096
>>> Bandwidth (MB/sec):     3.93282
>>> Stddev Bandwidth:       3.66265
>>> Max bandwidth (MB/sec): 13.668
>>> Min bandwidth (MB/sec): 0
>>> Average IOPS:           1006
>>> Stddev IOPS:            937
>>> Max IOPS:               3499
>>> Min IOPS:               0
>>> Average Latency(s):     0.0317779
>>> Stddev Latency(s):      0.164076
>>> Max latency(s):         2.27707
>>> Min latency(s):         0.0013848
>>> Cleaning up (deleting benchmark objects)
>>> Clean up completed and total clean up time :20.166559
>>>
>>>
>>>
>>>
>>> On Wed, Oct 18, 2017 at 8:51 AM, Maged Mokhtar
>>> <mmokh...@petasan.org>
>>> wrote:
>>>
>>> First a general comment: local RAID will be faster than Ceph
>>> for a
>>> single threaded (queue depth=1) io operation test. A single
>>> thread Ceph
>>> client will see at best same disk speed for reads and for
>>> writes 4-6 times
>>> slower than single disk. Not to mention the latency of local
>>> disks will
>>> much better. Where Ceph shines is when you have many
>>> concurrent ios, it
>>> scales whereas RAID will decrease speed per client as you add
>>> more.
>>>
>>> Having said that, i would recommend running rados/rbd
>>> bench-write
>>> and measure 4k iops at 1 and 32 threads to get a better idea
>>> of how your
>>> cluster performs:
>>>
>>> ceph osd pool create testpool 256 256
>>> rados bench -p testpool -b 4096 30 write -t 1
>>> rados bench -p testpool -b 4096 30 write -t 32
>>> ceph osd pool delete testpool testpool
>>> --yes-i-really-really-mean-it
>>>
>>> rbd bench-write test-image --io-threads=1 --io-size 4096
>>> --io-pattern rand --rbd_cache=false
>>> rbd bench-write test-image --io-threads=32 --io-size 4096
>>> --io-pattern rand --rbd_cache=false
>>>
>>> I think the request size difference you see is due to the io
>>> scheduler in the case of local disks having more ios to
>>> re-group so has a
>>> better chance in generating larger requests. Depending on
>>> your kernel, the
>>> io scheduler may be different for rbd (blq-mq) vs sdx (cfq)
>>> but again i
>>> would think the request size is a result not a cause.
>>>
>>> Maged
>>>
>>> On 2017-10-17 23:12, Russell Glaue wrote:
>>>
>>> I am running ceph jewel on 5 nodes with SSD OSDs.
>>> I have an LVM image on a local RAID of spinning disks.
>>> I have an RBD image on in a pool of SSD disks.
>>> Both disks are used to run an almost identical CentOS 7
>>> system.
>>> Both systems were installed with the same kickstart, though
>>> the disk
>>> partitioning is different.
>>>
>>> I want to make writes on the the ceph image faster. For
>>> example,
>>> lots of writes to MySQL (via MySQL replication) on a ceph SSD
>>> image are
>>> about 10x slower than on a spindle RAID disk image. The MySQL
>>> server on
>>> ceph rbd image has a hard time keeping up in replication.
>>>
>>> So I wanted to test writes on these two systems
>>> I have a 10GB compressed (gzip) file on both servers.
>>> I simply gunzip the file on both systems, while running
>>> iostat.
>>>
>>> The primary difference I see in the results is the average
>>> size of
>>> the request to the disk.
>>> CentOS7-lvm-raid-sata writes a lot faster to disk, and the
>>> size of
>>> the request is about 40x, but the number of writes per second
>>> is about the
>>> same
>>> This makes me want to conclude that the smaller size of the
>>> request
>>> for CentOS7-ceph-rbd-ssd system is the cause of it being
>>> slow.
>>>
>>>
>>> How can I make the size of the request larger for ceph rbd
>>> images,
>>> so I can increase the write throughput?
>>> Would this be related to having jumbo packets enabled in my
>>> ceph
>>> storage network?
>>>
>>>
>>> Here is a sample of the results:
>>>
>>> [CentOS7-lvm-raid-sata]
>>> $ gunzip large10gFile.gz &
>>> $ iostat -x vg_root-lv_var -d 5 -m -N
>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s
>>> wMB/s
>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>> ...
>>> vg_root-lv_var     0.00     0.00   30.60  452.20    13.60
>>> 222.15
>>>  1000.04     8.69   14.05    0.99   14.93   2.07 100.04
>>> vg_root-lv_var     0.00     0.00   88.20  182.00    39.20
>>> 89.43
>>> 974.95     4.65    9.82    0.99   14.10   3.70 100.00
>>> vg_root-lv_var     0.00     0.00   75.45  278.24    33.53
>>> 136.70
>>> 985.73     4.36   33.26    1.34   41.91   0.59  20.84
>>> vg_root-lv_var     0.00     0.00  111.60  181.80    49.60
>>> 89.34
>>> 969.84     2.60    8.87    0.81   13.81   0.13   3.90
>>> vg_root-lv_var     0.00     0.00   68.40  109.60    30.40
>>> 53.63
>>> 966.87     1.51    8.46    0.84   13.22   0.80  14.16
>>> ...
>>>
>>> [CentOS7-ceph-rbd-ssd]
>>> $ gunzip large10gFile.gz &
>>> $ iostat -x vg_root-lv_data -d 5 -m -N
>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s
>>> wMB/s
>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>> ...
>>> vg_root-lv_data     0.00     0.00   46.40  167.80     0.88
>>> 1.46
>>>    22.36     1.23    5.66    2.47    6.54   4.52  96.82
>>> vg_root-lv_data     0.00     0.00   16.60   55.20     0.36
>>> 0.14
>>>    14.44     0.99   13.91    9.12   15.36  13.71  98.46
>>> vg_root-lv_data     0.00     0.00   69.00  173.80     1.34
>>> 1.32
>>>    22.48     1.25    5.19    3.77    5.75   3.94  95.68
>>> vg_root-lv_data     0.00     0.00   74.40  293.40     1.37
>>> 1.47
>>>    15.83     1.22    3.31    2.06    3.63   2.54  93.26
>>> vg_root-lv_data     0.00     0.00   90.80  359.00     1.96
>>> 3.41
>>>    24.45     1.63    3.63    1.94    4.05   2.10  94.38
>>> ...
>>>
>>> [iostat key]
>>> w/s == The number (after merges) of write requests completed
>>> per
>>> second for the device.
>>> wMB/s == The number of sectors (kilobytes, megabytes) written
>>> to the
>>> device per second.
>>> avgrq-sz == The average size (in kilobytes) of the requests
>>> that
>>> were issued to the device.
>>> avgqu-sz == The average queue length of the requests that
>>> were
>>> issued to the device.
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Christian Balzer        Network/Systems Engineer
>>> ch...@gol.com           Rakuten Communications
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
> Brian Andrus | Cloud Systems Engineer | DreamHost
> brian.and...@dreamhost.com | www.dreamhost.com
>
>
>


-- 
Brian Andrus | Cloud Systems Engineer | DreamHost
brian.and...@dreamhost.com | www.dreamhost.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to increase the size of requests written to a ceph image

Reply via email to