Re: [ceph-users] Slow RBD Benchmark Compared To Direct I/O Test

Christian Balzer Wed, 23 Apr 2014 08:41:32 -0700

On Wed, 23 Apr 2014 12:39:20 +0800 Indra Pramana wrote:

> Hi Christian,
> 
> Good day to you, and thank you for your reply.
> 
> On Tue, Apr 22, 2014 at 12:53 PM, Christian Balzer <ch...@gol.com> wrote:
> 
> > On Tue, 22 Apr 2014 02:45:24 +0800 Indra Pramana wrote:
> >
> > > Hi Christian,
> > >
> > > Good day to you, and thank you for your reply. :)  See my reply
> > > inline.
> > >
> > > On Mon, Apr 21, 2014 at 10:20 PM, Christian Balzer <ch...@gol.com>
> > wrote:
> > >
> > > >
> > > > Hello,
> > > >
> > > > On Mon, 21 Apr 2014 20:47:21 +0800 Indra Pramana wrote:
> > > >
> > > > > Dear all,
> > > > >
> > > > > I have a Ceph RBD cluster with around 31 OSDs running SSD drives,
> > > > > and I tried to use the benchmark tools recommended by Sebastien
> > > > > on his blog here:
> > > > >
> > > > How many OSDs per storage node and what is in those storage nodes
> > > > in terms of controller, CPU, RAM?
> > > >
> > >
> > > Each storage node has mainly 4 OSDs, although I have one node having
> > > 6. Each OSD consists of 480 GB / 500 GB SSD drives (depends on the
> > > brand).
> > >
> > So I make that 7 or 8 nodes then?
> >
> 
> Sorry, I miscalculated earlier. I have a total of 26 OSDs in 6 hosts. All
> hosts have 4 OSDs in general, except one host have 6 OSDs.
> 
Good to know.
That pretty much limits your cluster to about 3GB/s.


> 
> >  > Each node has mainly SATA 2.0 controllers (newer one uses SATA 3.0),
> > > 4-core 3.3 GHz CPU, 16 GB of RAM.
> > >
> > That sounds good enough as far as memory and CPU are concerned.
> > The SATA-2 speed will limit you, I have some journal SSDs hanging of
> > SATA-2 and they can't get over 250MB/s while they can get to 350MB/s on
> > SATA-3.
> >
> > >
> > > > > http://www.sebastien-han.fr/blog/2012/08/26/ceph-benchmarks/
> > > > >
> > > > Sebastien has done a great job with those, however with Ceph being
> > > > such a fast moving target quite a bit of that information is
> > > > somewhat dated.
> > > >
> > > > > Our configuration:
> > > > >
> > > > > - Ceph version 0.67.7
> > > > That's also a bit dated.
> > > >
> > >
> > > Yes, decided to stick with the latest stable version of dumpling. Do
> > > you think upgrading to Emperor might help to improve performance?
> > >
> > Given that older versions of Ceph tend to get little support (bug fixes
> > backported) and that Firefly is around the corner I would suggest
> > moving to Emperor to rule out any problems with Dumpling, get
> > experience with inevitable cluster upgrades and have a smoother path
> > to Firefly when it comes out.
> >
> 
> Noted -- will consider upgrading to Emperor.
> 
>  >  > - 31 OSDs of 500 GB SSD drives each
> > > > > - Journal for each OSD is configured on the same SSD drive itself
> > > > > - Journal size 10 GB
> > > > >
> > > > > After doing some tests recommended on the article, I find out
> > > > > that generally:
> > > > >
> > > > > - Local disk benchmark tests using dd is fast, around 245 MB/s
> > > > > since we are using SSDs.
> > > > > - Network benchmark tests using iperf and netcat is also fast, I
> > > > > can get around 9.9 Mbit/sec since we are using 10G network.
> > > >
> > > > I think you mean 9.9Gb/s there. ^o^
> > > >
> > >
> > > Yes, I meant 9.9 Gbit/sec. Sorry for the typo.
> > >
> > > How many network ports per node, cluster network or not?
> > > >
> > >
> > > Each OSD has 2 x 10 Gbps connection to our 10 gigabit-switch, one for
> > > client network and another one is for replication network between
> > > OSDs.
> > >
> > All very good and by the book.
> >
> > >
> > > > > However:
> > > > >
> > > > > - RADOS bench test (rados bench -p my_pool 300 write) on the
> > > > > whole cluster is slow, averaging around 112 MB/s for write.
> > > >
> > > > That commands fires of a single thread, which is unlikely to be
> > > > able to saturate things.
> > > >
> > > > Try that with a "-t 32" before the time (300) and if that improves
> > > > things increase that value until it doesn't (probably around 128).
> > > >
> > >
> > > Using 32 concurrent writes, result is below. The speed really
> > > fluctuates.
> > >
> > >  Total time run:         64.31704964.317049
> > > Total writes made:      1095
> > > Write size:             4194304
> > > Bandwidth (MB/sec):     68.100
> > >
> > > Stddev Bandwidth:       44.6773
> > > Max bandwidth (MB/sec): 184
> > > Min bandwidth (MB/sec): 0
> > > Average Latency:        1.87761
> > > Stddev Latency:         1.90906
> > > Max latency:            9.99347
> > > Min latency:            0.075849
> > >
> > That is really weird, it should get faster, not slower. ^o^
> > I assume you've run this a number of times?
> >
> > Also my apologies, the default is 16 threads, not 1, but that still
> > isn't enough to get my cluster to full speed:
> > ---
> > Bandwidth (MB/sec):     349.044
> >
> > Stddev Bandwidth:       107.582
> > Max bandwidth (MB/sec): 408
> > ---
> > at 64 threads it will ramp up from a slow start to:
> > ---
> > Bandwidth (MB/sec):     406.967
> >
> > Stddev Bandwidth:       114.015
> > Max bandwidth (MB/sec): 452
> > ---
> >
> > But what stands out is your latency. I don't have a 10GBE network to
> > compare, but my Infiniband based cluster (going through at least one
> > switch) gives me values like this:
> > ---
> > Average Latency:        0.335519
> > Stddev Latency:         0.177663
> > Max latency:            1.37517
> > Min latency:            0.1017
> > ---
> >
> > Of course that latency is not just the network.
> >
> 
> What else can contribute to this latency? Storage node load, disk speed,
> anything else?
> 
That and the network itself are pretty much it, you should know once
you've run those test with atop or iostat on the storage nodes.

> 
> > I would suggest running atop (gives you more information at one
> > glance) or "iostat -x 3" on all your storage nodes during these tests
> > to identify any node or OSD that is overloaded in some way.
> >
> 
> Will try.
> 
Do that and let us know about the results.

> 
> >  > Are you testing this from just one client?
> > > >
> > >
> > > Yes. One KVM hypervisor host.
> > >
> > > How is that client connected to the Ceph network?
> > > >
> > >
> > > It's connected through the same 10Gb network. iperf result shows no
> > > issue on the bandwidth between the client and the MONs/OSDs.
> > >
> > >
> > > > Another thing comes to mind, how many pg_num and pgp_num are in
> > > > your "my_pool"?
> > > > You could have some quite unevenly distributed data.
> > > >
> > >
> > > pg_num/pgp_num for the pool is currently set to 850.
> > >
> > If this isn't production yet, I would strongly suggest upping that to
> > 2048 for a much smoother distribution and adhering to the recommended
> > values for this.
> >
> 
> That's the problem -- it's already in production. Any advice on how I can
> increase PGs without causing inconvenience to the users? Can I increase
> PGs one step at a time to prevent excessive I/O load and slow requests,
> e.g. increase 100 at a time?
> 
If you look into the archives of this ML you will find other people
suggesting just that, increasing the PGs gradually.

On the other hand, if you have a prolonged period of low activity (night,
weekend) and not that much data just go for it. ^o^

> With 26 OSDs, the recommended value would be 1300 PGs, correct? 2048 will
> be too high?
>  
Too high is VERY relative with PGs. Going from 8192 to 1048576 is likely
to be too high (using plenty of resource without smoothing things
noticeably).

In your case 1024 would be too low, so the next factor of 2 number is 2048.
And if you know that your cluster will grow further, use the estimated
number for that. 

PGs don't come for free, but they're not prohibitively expensive either,
especially when you're battling uneven distribution.

Regards.

Christian

> 
> > > >  > - Invididual test using "ceph tell osd.X bench" gives different
> > > >  > results
> > > > > per OSD but also averaging around 110-130 MB/s only.
> > > > >
> > > > That at least is easily explained by what I'm mentioning below
> > > > about the remaining performance of your SSD when journal and OSD
> > > > data are on it at the same time.
> > > > > Anyone can advise what could be the reason of why our RADOS/Ceph
> > > > > benchmark test result is slow compared to a direct physical drive
> > > > > test on the OSDs directly? Anything on Ceph configuration that we
> > > > > need to optimise further?
> > > > >
> > > > For starters, since your journals (I frequently wonder if journals
> > > > ought be something that can be turned off) are on the same device
> > > > as the OSD data, your total throughput and IOPS of that device
> > > > have now been halved.
> > > >
> > > > And what replication level are you using? That again will cut into
> > > > your cluster wide throughput and IOPS.
> > > >
> > >
> > > I maintain 2 replicas on the pool.
> > >
> >
> > So to simplify things I will assume 8 nodes with OSDs each and all
> > SSDs on SATA-2, giving a raw speed of 250MB/s per SSD.
> > The speed per OSD will be just half that, though, since it has to share
> > that with the journal.
> > So just 500MB/s of potential speed per node or 4GB/s for the whole
> > cluster.
> >
> > Now here is where it gets tricky.
> > With just one thread and one client you will write to one PG, first to
> > journal of the primary OSD, then that will be written to the journal of
> > the secondary OSD (on another node) and your transaction will be
> > ACK'ed. This if course doesn't take any advantage of the parallelism
> > of Ceph and will never get close to achieving maximum bandwidth per
> > client. But it also won't be impacted by on which OSDs the PGs reside,
> > as there is no competition from other clients/threads.
> >
> > With 16 threads (and more) the PG distribution becomes very crucial.
> > Ideally each thread would be writing to different primary OSDs and all
> > the secondary OSDs would be ones that aren't primary ones (32 assumed
> > OSDs/2).
> >
> > But if the PGs are clumpy and for example osd.0 happens to the primary
> > for one PG being written to by one thread and the secondary for another
> > thread at the same time it bandwidth just dropped again.
> >
> 
> Noted, thanks for this.
> 
> Cheers.
> 
> 
> 
> 
> >
> > Regards,
> >
> > Christian
> > >
> > > >
> > > > I've read a number of times that Ceph will be in general half as
> > > > fast as your expected speed from the cluster hardware your
> > > > deploying, but that of course is something based on many factors
> > > > and needs verification in each specific case.
> > > >
> > > > For me, I have OSDs (11 disk RAID6 on an Areca 1882 with 1GB
> > > > cache, 2 OSDs each on 2 nodes total) that can handle the fio run
> > > > below directly on the OSD at 37k IOPS (since it fits into the
> > > > cache nicely). ---
> > > > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
> > > > --numjobs=1 --rw=randwrite --name=fiojob --blocksize_range=4k-4K
> > > > --iodepth=16 ---
> > > > The Journal SSD is about same.
> > > >
> > > > However that same benchmark just delivers a mere 3100 IOPS when run
> > > > from a VM (userspace RBD, caching enabled but that makes no
> > > > difference at all) and the journal SSDs are busier (25%) than the
> > > > actual OSDs (5%), but still nowhere near their capacity.
> > > > This leads me to believe that aside from network latencies (4xQDDR
> > > > Infiniband here, which has less latency than 10GBE) that there is a
> > > > lot of space for improvement when it comes to how Ceph handles
> > > > things (bottlenecks in the code) and tuning in general.
> > > >
> > >
> > > Thanks for sharing.
> > >
> > > Any further tuning configuration which can be suggested is greatly
> > > appreciated.
> > >
> > > Cheers.
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > ch...@gol.com           Global OnLine Japan/Fusion Communications
> > http://www.gol.com/
> >
> 
> Call
> Send SMS
> Add to Skype
> You'll need Skype CreditFree via Skype


-- 
Christian Balzer        Network/Systems Engineer                
ch...@gol.com           Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow RBD Benchmark Compared To Direct I/O Test

Reply via email to