Re: [ceph-users] Impact of fancy striping

nicolasc Thu, 12 Dec 2013 08:24:43 -0800

Hi James, Robert, Craig,

Thank your for those informative answers! You all pointed outinteresting issues.

I know losing 1 SAS disk in RAID0 means losing all journals, but this isfor testing so I do not care.

I do not think sequential write speed to the RAID0 array is thebottleneck (I benchmarked it at more than 500MB/s). However, I failed torealize that the synchronous writes of several OSDs would become randominstead of sequential, thank you for explaining that.

I want to try this setup with several journals on a single partition (tomitigate seek time), and I also want to try replacing my 9 OSDs (pernode) by a big RAID0 array of 9 disks --- leaving replication to Ceph.But first I wanted to get an idea of SSD performance, so I created a 1GBRAMdisk for every OSD journal.

Shockingly, even with every journal on a dedicated RAMdisk, I stillwitnessed less than 100MB/s sequential writes with 4MB blocks. This iswriting to an RBD image, independently of the format, the size, thestriping pattern, or whether the image is mounted (with XFS on it) ordirectly accessed.

So, maybe my journal setup is not satisfying, but the bottleneck seemsto be somewhere else. Any idea at all about striping? Or maybe pool/PGconfig? (I blindly followed the PG ratios indicated in the docs).


Thank you all for your help. Best regards,

Nicolas Canceill
Scalable Storage Systems
SURFsara (Amsterdam, NL)




On 12/06/2013 07:31 PM, Robert van Leeuwen wrote:

If I understand correctly you have one sas disk as a journal for multiple OSDs.
If you do small synchronous writes it will become a IO bottleneck pretty 
quickly:
Due to multiple journals on the same disk it will no longer be sequential 
writes writes to one journal but  4k writes to x journals making it fully 
random.
I would expect a performance of 100 to 200 IOPS max.
Doing an iostat -x or atop should show this bottleneck immediately.
This is also the reason to go with SSDs: they have reasonable random IO 
performance.

Cheers,
Robert van Leeuwen

Sent from my iPad

On 6 dec. 2013, at 17:05, "nicolasc" <nicolas.cance...@surfsara.nl> wrote:

Hi James,

Thank you for this clarification. I am quite aware of that, which is why the 
journals are on SAS disks in RAID0 (SSDs out of scope).

I still have trouble believing that fast-but-not-super-fast journals is the 
main reason for the poor performances observed. Maybe I am mistaken?

Best regards,

Nicolas Canceill
Scalable Storage Systems
SURFsara (Amsterdam, NL)



On 12/03/2013 03:01 PM, James Pearce wrote:

I would really appreciate it if someone could:
- explain why the journal setup is way more important than striping settings;

I'm not sure if it's what you're asking, but any write must be physically 
written to the journal before the operation is acknowledged.  So the overall 
cluster performance (or rather write latency) is always governed by the speed 
of those journals.  Data is then gathered up into (hopefully) larger blocks and 
committed to OSDs later.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





On 12/11/2013 12:51 AM, Craig Lewis wrote:

A general rule of thumb for separate journal devices is to use 1 SSDfor every 4 OSDs. Since SSDs have no seek penalty, 4 partitions arefine. Going much above the 1:4 ratio can saturate the SSD.
On your SAS journal device, by creating 9 partitions, you're forcinghead seeks for every journal write (assuming all 9 OSDs are writing).Try using the SAS device with a single partition and 9 journals. Thatgives you a change to get sequential IO. For an anecdote of thiseffect, check out http://thedailywtf.com/Articles/The-Certified-DBA.aspx.
Even then, I suspect you'll saturate the RAID0'ed SAS devices as theygenerally have less sequential IO than SSDs.
I assume that you're aware that by using RAID0 for the journals, asingle SAS disk failure will take down all 9 OSDs.
*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com <mailto:cle...@centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/> | Twitter<http://www.twitter.com/centraldesktop> | Facebook<http://www.facebook.com/CentralDesktop> | LinkedIn<http://www.linkedin.com/groups?gid=147417> | Blog<http://cdblog.centraldesktop.com/>
On 11/29/13 05:58 , nicolasc wrote:
Hi James,
Unfortunately, SSDs are out of budget. Currently there are 2 SASdisks in RAID0 on each node, split into 9 partitions: one for eachOSD journal on the node. I benchmarked the RAID0 volumes at around500MB/s in sequential sustained write, so that's not bad --- maybeaccess latency is also an issue?
This journal problem is a bit of wizardry to me, I even had weirdintermittent issues with OSDs not starting because the journal wasnot found, so please do not hesitate to suggest a better journal setup.
I will try to look into this issue of device cache flush. Do you havea tracker link for the bug?
Last question (for every one) is: which one of the journal config orthe striping config has, in your opinion, the most influence on my"performance decreases with small blocks" problem?
Best regards,

Nicolas Canceill
Scalable Storage Systems
SURFsara (Amsterdam, NL)


On 11/29/2013 02:06 PM, James Pearce wrote:
Did you try moving the journals to separate SSDs?
It was recently discovered that due to a kernel bug/design, thejournal writes are translated into device cache flush commands, sothinking about that I wonder also whether there would be performanceimprovement in the case that journal and OSD are on the samephysical drive implementing the workaround, since currently thesystem is presumably hitting spindle latency for every write?
On 2013-11-29 12:46, nicolasc wrote:
Hi every one,

I am currently testing a use-case with large rbd images (several TB),
each containing an XFS filesystem, which I mount on local clients. I
have been testing the throughput writing on a single file in the XFS
mount, using "dd oflag=direct", for various block sizes.

With a default config, the "XFS writes with dd" show very good
performances for 1GB blocks, but it drops down to average HDD
performances for 4MB blocks, and to only a few MB/s for 4kB blocks.
Changing the XFS block size did not help, so I tried fancy striping---
max block size is 256kB in XFS anyway.

First, using 4kB rados objects to store the 4kB stripes was awful,
because rados does not like small objects. Then, I used fancy striping
to store several 4kB stripes into a single 4MB object, but it hardly
improved the performance with 4kB blocks, while drastically degrading
the performance for large blocks.

Given my use-case, the block size of writes cannot exceed 4MB. I do
not know a lof of applications that write to disk by 1GB blocks.
Currently, on a 6-nodes, 54-OSDs cluster, with journal on dedicated
SAS disks and 10GbE dedicated uplink, I am getting performances
equivalent to a basic local disc.

So I am wondering: is it possible to have good performances with XFS
on rbd images, using a reasonable block size?

In case you think the answer is "yes", I would greatly appreciate it
if you could gave me a clue about the striping magic involved.

Best regards,

Nicolas Canceill
Scalable Storage Systems
SURFsara (Amsterdam, NL)

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Impact of fancy striping

Reply via email to