Re: [ceph-users] What's the best practice for Erasure Coding

Nathan Fish Mon, 08 Jul 2019 09:08:05 -0700

This is very interesting, thank you. I'm curious, what is the reason
for avoiding k's with large prime factors? If I set k=5, what happens?


On Mon, Jul 8, 2019 at 8:56 AM Lei Liu <liul.st...@gmail.com> wrote:
>
> Hi Frank,
>
> Thanks for sharing valuable experience.
>
> Frank Schilder <fr...@dtu.dk> 于2019年7月8日周一 下午4:36写道：
>>
>> Hi David,
>>
>> I'm running a cluster with bluestore on raw devices (no lvm) and all 
>> journals collocated on the same disk with the data. Disks are spinning 
>> NL-SAS. Our goal was to build storage at lowest cost, therefore all data on 
>> HDD only. I got a few SSDs that I'm using for FS and RBD meta data. All 
>> large pools are EC on spinning disk.
>>
>> I spent at least one month to run detailed benchmarks (rbd bench) depending 
>> on EC profile, object size, write size, etc. Results were varying a lot. My 
>> advice would be to run benchmarks with your hardware. If there was a single 
>> perfect choice, there wouldn't be so many options. For example, my tests 
>> will not be valid when using separate fast disks for WAL and DB.
>>
>> There are some results though that might be valid in general:
>>
>> 1) EC pools have high throughput but low IOP/s compared with replicated pools
>>
>> I see single-thread write speeds of up to 1.2GB (gigabyte) per second, which 
>> is probably the network limit and not the disk limit. IOP/s get better with 
>> more disks, but are way lower than what replicated pools can provide. On a 
>> cephfs with EC data pool, small-file IO will be comparably slow and eat a 
>> lot of resources.
>>
>> 2) I observe massive network traffic amplification on small IO sizes, which 
>> is due to the way EC overwrites are handled. This is one bottleneck for 
>> IOP/s. We have 10G infrastructure and use 2x10G client and 4x10G OSD 
>> network. OSD bandwidth at least 2x client network, better 4x or more.
>>
>> 3) k should only have small prime factors, power of 2 if possible
>>
>> I tested k=5,6,8,10,12. Best results in decreasing order: k=8, k=6. All 
>> other choices were poor. The value of m seems not relevant for performance. 
>> Larger k will require more failure domains (more hardware).
>>
>> 4) object size matters
>>
>> The best throughput (1M write size) I see with object sizes of 4MB or 8MB, 
>> with IOP/s getting somewhat better with slower object sizes but throughput 
>> dropping fast. I use the default of 4MB in production. Works well for us.
>>
>> 5) jerasure is quite good and seems most flexible
>>
>> jerasure is quite CPU efficient and can handle smaller chunk sizes than 
>> other plugins, which is preferrable for IOP/s. However, CPU usage can become 
>> a problem and a plugin optimized for specific values of k and m might help 
>> here. Under usual circumstances I see very low load on all OSD hosts, even 
>> under rebalancing. However, I remember that once I needed to rebuild 
>> something on all OSDs (I don't remember what it was, sorry). In this 
>> situation, CPU load went up to 30-50% (meaning up to half the cores were at 
>> 100%), which is really high considering that each server has only 16 disks 
>> at the moment and is sized to handle up to 100. CPU power could become a 
>> bottle for us neck in the future.
>>
>> These are some general observations and do not replace benchmarks for 
>> specific use cases. I was hunting for a specific performance pattern, which 
>> might not be what you want to optimize for. I would recommend to run 
>> extensive benchmarks if you have to live with a configuration for a long 
>> time - EC profiles cannot be changed.
>>
>> We settled on 8+2 and 6+2 pools with jerasure and object size 4M. We also 
>> use bluestore compression. All meta data pools are on SSD, only very little 
>> SSD space is required. This choice works well for the majority of our use 
>> cases. We can still build small expensive pools to accommodate special 
>> performance requests.
>>
>> Best regards,
>>
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of David 
>> <xiaomajia...@gmail.com>
>> Sent: 07 July 2019 20:01:18
>> To: ceph-users@lists.ceph.com
>> Subject: [ceph-users]  What's the best practice for Erasure Coding
>>
>> Hi Ceph-Users,
>>
>> I'm working with a  Ceph cluster (about 50TB, 28 OSDs, all Bluestore on lvm).
>> Recently, I'm trying to use the Erasure Code pool.
>> My question is "what's the best practice for using EC pools ?".
>> More specifically, which plugin (jerasure, isa, lrc, shec or  clay) should I 
>> adopt, and how to choose the combinations of (k,m) (e.g. (k=3,m=2), 
>> (k=6,m=3) ).
>>
>> Does anyone share some experience?
>>
>> Thanks for any help.
>>
>> Regards,
>> David
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] What's the best practice for Erasure Coding

Reply via email to