Hi Frank, Thanks for sharing valuable experience.
Frank Schilder <fr...@dtu.dk> 于2019年7月8日周一 下午4:36写道: > Hi David, > > I'm running a cluster with bluestore on raw devices (no lvm) and all > journals collocated on the same disk with the data. Disks are spinning > NL-SAS. Our goal was to build storage at lowest cost, therefore all data on > HDD only. I got a few SSDs that I'm using for FS and RBD meta data. All > large pools are EC on spinning disk. > > I spent at least one month to run detailed benchmarks (rbd bench) > depending on EC profile, object size, write size, etc. Results were varying > a lot. My advice would be to run benchmarks with your hardware. If there > was a single perfect choice, there wouldn't be so many options. For > example, my tests will not be valid when using separate fast disks for WAL > and DB. > > There are some results though that might be valid in general: > > 1) EC pools have high throughput but low IOP/s compared with replicated > pools > > I see single-thread write speeds of up to 1.2GB (gigabyte) per second, > which is probably the network limit and not the disk limit. IOP/s get > better with more disks, but are way lower than what replicated pools can > provide. On a cephfs with EC data pool, small-file IO will be comparably > slow and eat a lot of resources. > > 2) I observe massive network traffic amplification on small IO sizes, > which is due to the way EC overwrites are handled. This is one bottleneck > for IOP/s. We have 10G infrastructure and use 2x10G client and 4x10G OSD > network. OSD bandwidth at least 2x client network, better 4x or more. > > 3) k should only have small prime factors, power of 2 if possible > > I tested k=5,6,8,10,12. Best results in decreasing order: k=8, k=6. All > other choices were poor. The value of m seems not relevant for performance. > Larger k will require more failure domains (more hardware). > > 4) object size matters > > The best throughput (1M write size) I see with object sizes of 4MB or 8MB, > with IOP/s getting somewhat better with slower object sizes but throughput > dropping fast. I use the default of 4MB in production. Works well for us. > > 5) jerasure is quite good and seems most flexible > > jerasure is quite CPU efficient and can handle smaller chunk sizes than > other plugins, which is preferrable for IOP/s. However, CPU usage can > become a problem and a plugin optimized for specific values of k and m > might help here. Under usual circumstances I see very low load on all OSD > hosts, even under rebalancing. However, I remember that once I needed to > rebuild something on all OSDs (I don't remember what it was, sorry). In > this situation, CPU load went up to 30-50% (meaning up to half the cores > were at 100%), which is really high considering that each server has only > 16 disks at the moment and is sized to handle up to 100. CPU power could > become a bottle for us neck in the future. > > These are some general observations and do not replace benchmarks for > specific use cases. I was hunting for a specific performance pattern, which > might not be what you want to optimize for. I would recommend to run > extensive benchmarks if you have to live with a configuration for a long > time - EC profiles cannot be changed. > > We settled on 8+2 and 6+2 pools with jerasure and object size 4M. We also > use bluestore compression. All meta data pools are on SSD, only very little > SSD space is required. This choice works well for the majority of our use > cases. We can still build small expensive pools to accommodate special > performance requests. > > Best regards, > > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of David < > xiaomajia...@gmail.com> > Sent: 07 July 2019 20:01:18 > To: ceph-users@lists.ceph.com > Subject: [ceph-users] What's the best practice for Erasure Coding > > Hi Ceph-Users, > > I'm working with a Ceph cluster (about 50TB, 28 OSDs, all Bluestore on > lvm). > Recently, I'm trying to use the Erasure Code pool. > My question is "what's the best practice for using EC pools ?". > More specifically, which plugin (jerasure, isa, lrc, shec or clay) should > I adopt, and how to choose the combinations of (k,m) (e.g. (k=3,m=2), > (k=6,m=3) ). > > Does anyone share some experience? > > Thanks for any help. > > Regards, > David > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com