Striping with stripe units other than 1 is something I also tested. I found 
that with EC pools non-trivial striping should be avoided. Firstly, EC is 
already a striped format and, secondly, striping on top of that with 
stripe_unit>1 will make every write an ec_overwrite, because now shards are 
rarely if ever written as a whole.

The native striping in EC pools comes from k, data is striped over k disks. The 
higher k the more throughput at the expense of cpu and network.

In my long list, this should actually be point

6) Use stripe_unit=1 (default).

To get back to your question, this is another argument for k=power-of-two. 
Object sizes in ceph are always powers of 2 and stripe sizes contain k as a 
factor. Hence, any prime factor other than 2 in k will imply a mismatch. How 
badly a mismatch affects performance should be tested.

Example: on our 6+2 EC pool I have stripe_width  24576, which has 3 as a 
factor. The 3 comes from k=6=3*2 and will always be there. This implies a 
misalignment and some writes will have to be split/padded in the middle. This 
does not happen too often per object, so 6+2 performance is good, but not as 
good as 8+2 performance.

Some numbers:

1) rbd object size 8MB, 4 servers writing with 1 processes each (=4 workers):
EC profile     4K random write      sequential write 8M write size
               IOP/s aggregated     MB/s aggregated
 5+2            802.30              1156.05
 6+2           1188.26              1873.67
 8+2           1210.27              2510.78
10+4            421.80               681.22

2) rbd object size 8MB, 4 servers writing with 4 processes each (=16 workers):
EC profile     4K random write      sequential write 8M write size
               IOP/s aggregated     MB/s aggregated
6+2            1384.43              3139.14
8+2            1343.34              4069.27

The EC-profiles with factor 5 are so bad that I didn't repeat the multi-process 
tests (2) with these. I had limited time and went for the discard-early 
strategy to find suitable parameters.

The 25% smaller throughput (6+2 vs 8+2) in test (2) is probably due to the fact 
that data is striped over 6 instead of 8 disks. There might be some impact of 
the factor 3 somewhere as well, but it seems negligible in the scenario I 
tested.

Results with non-trivial striping (stripe_size>1) were so poor, I did not even 
include them in my report.

We use the 8+2 pool for ceph fs, where throughput is important. The 6+2 pool is 
used for VMs (RBD images), where IOP/s are more important. It also offers a 
higher redundancy level. Its an acceptable compromise for us.

Note that numbers will vary depending on hardware, OSD config, kernel 
parameters etc, etc. One needs to test what one has.

Best regards,

=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of Lars 
Marowsky-Bree <l...@suse.com>
Sent: 11 July 2019 10:14:04
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] What's the best practice for Erasure Coding

On 2019-07-09T07:27:28, Frank Schilder <fr...@dtu.dk> wrote:

> Small addition:
>
> This result holds for rbd bench. It seems to imply good performance for 
> large-file IO on cephfs, since cephfs will split large files into many 
> objects of size object_size. Small-file IO is a different story.
>
> The formula should be N*alloc_size=object_size/k, where N is some integer. 
> alloc_size should be an integer multiple of object_size/k.

If using rbd striping, I'd also assume that making rbd's stripe_unit be
equal to, or at least a multiple of, the stripe_width of the EC pool is
sensible.

(Similar for CephFS's layout.)

Does this hold in your environment?


--
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to