Hi Nathan, its just a hypothesis. I did not check what the algorithm does.
The reasoning is this. Bluestore and modern disks have preferred read/write sizes that are quite large for large drives. These are usually powers of 2. If you use a k+m EC profile, any read/write is split into k fragments. What I observe is, that throughput seems best if these fragments are multiples of the preferred read/write sizes. Any prime factor other than 2 will imply split-ups that don't fit perfectly. The mismatch tends to be worse the larger a prime factor and the smaller the object size. At least this is a correlation I observed in benchmarks. Since correlation does not mean causation, I will not claim that my hypothesis is an explanation of the observation. Nevertheless, bluestore has default alloc sizes and just for storage efficiency I would try to achieve aim for alloc_size=object_size/k. Coincidentally, for spinning disks this also seems to imply best performance. If this is wrong, maybe a disk IO expert can provide a better explanation as a guide for EC profile choices? Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of Nathan Fish <lordci...@gmail.com> Sent: 08 July 2019 18:07:25 To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] What's the best practice for Erasure Coding This is very interesting, thank you. I'm curious, what is the reason for avoiding k's with large prime factors? If I set k=5, what happens? On Mon, Jul 8, 2019 at 8:56 AM Lei Liu <liul.st...@gmail.com> wrote: > > Hi Frank, > > Thanks for sharing valuable experience. > > Frank Schilder <fr...@dtu.dk> 于2019年7月8日周一 下午4:36写道: >> >> Hi David, >> >> I'm running a cluster with bluestore on raw devices (no lvm) and all >> journals collocated on the same disk with the data. Disks are spinning >> NL-SAS. Our goal was to build storage at lowest cost, therefore all data on >> HDD only. I got a few SSDs that I'm using for FS and RBD meta data. All >> large pools are EC on spinning disk. >> >> I spent at least one month to run detailed benchmarks (rbd bench) depending >> on EC profile, object size, write size, etc. Results were varying a lot. My >> advice would be to run benchmarks with your hardware. If there was a single >> perfect choice, there wouldn't be so many options. For example, my tests >> will not be valid when using separate fast disks for WAL and DB. >> >> There are some results though that might be valid in general: >> >> 1) EC pools have high throughput but low IOP/s compared with replicated pools >> >> I see single-thread write speeds of up to 1.2GB (gigabyte) per second, which >> is probably the network limit and not the disk limit. IOP/s get better with >> more disks, but are way lower than what replicated pools can provide. On a >> cephfs with EC data pool, small-file IO will be comparably slow and eat a >> lot of resources. >> >> 2) I observe massive network traffic amplification on small IO sizes, which >> is due to the way EC overwrites are handled. This is one bottleneck for >> IOP/s. We have 10G infrastructure and use 2x10G client and 4x10G OSD >> network. OSD bandwidth at least 2x client network, better 4x or more. >> >> 3) k should only have small prime factors, power of 2 if possible >> >> I tested k=5,6,8,10,12. Best results in decreasing order: k=8, k=6. All >> other choices were poor. The value of m seems not relevant for performance. >> Larger k will require more failure domains (more hardware). >> >> 4) object size matters >> >> The best throughput (1M write size) I see with object sizes of 4MB or 8MB, >> with IOP/s getting somewhat better with slower object sizes but throughput >> dropping fast. I use the default of 4MB in production. Works well for us. >> >> 5) jerasure is quite good and seems most flexible >> >> jerasure is quite CPU efficient and can handle smaller chunk sizes than >> other plugins, which is preferrable for IOP/s. However, CPU usage can become >> a problem and a plugin optimized for specific values of k and m might help >> here. Under usual circumstances I see very low load on all OSD hosts, even >> under rebalancing. However, I remember that once I needed to rebuild >> something on all OSDs (I don't remember what it was, sorry). In this >> situation, CPU load went up to 30-50% (meaning up to half the cores were at >> 100%), which is really high considering that each server has only 16 disks >> at the moment and is sized to handle up to 100. CPU power could become a >> bottle for us neck in the future. >> >> These are some general observations and do not replace benchmarks for >> specific use cases. I was hunting for a specific performance pattern, which >> might not be what you want to optimize for. I would recommend to run >> extensive benchmarks if you have to live with a configuration for a long >> time - EC profiles cannot be changed. >> >> We settled on 8+2 and 6+2 pools with jerasure and object size 4M. We also >> use bluestore compression. All meta data pools are on SSD, only very little >> SSD space is required. This choice works well for the majority of our use >> cases. We can still build small expensive pools to accommodate special >> performance requests. >> >> Best regards, >> >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of David >> <xiaomajia...@gmail.com> >> Sent: 07 July 2019 20:01:18 >> To: ceph-users@lists.ceph.com >> Subject: [ceph-users] What's the best practice for Erasure Coding >> >> Hi Ceph-Users, >> >> I'm working with a Ceph cluster (about 50TB, 28 OSDs, all Bluestore on lvm). >> Recently, I'm trying to use the Erasure Code pool. >> My question is "what's the best practice for using EC pools ?". >> More specifically, which plugin (jerasure, isa, lrc, shec or clay) should I >> adopt, and how to choose the combinations of (k,m) (e.g. (k=3,m=2), >> (k=6,m=3) ). >> >> Does anyone share some experience? >> >> Thanks for any help. >> >> Regards, >> David >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com