Hi Greg, On 10/17/2017 11:49 PM, Gregory Farnum wrote: > On Tue, Oct 17, 2017 at 12:42 PM Jiri Horky <jiri.ho...@gmail.com > <mailto:jiri.ho...@gmail.com>> wrote: > > Hi list, > > we are thinking of building relatively big CEPH-based object > storage for > storage of our sample files - we have about 700M files ranging > from very > small (1-4KiB) files to pretty big ones (several GiB). Median of file > size is 64KiB. Since the required space is relatively large (1PiB of > usable storage), we are thinking of utilizing erasure coding for this > case. On the other hand, we need to achieve at least 1200MiB/s > throughput on reads. The working assumption is 4+2 EC (thus 50% > overhead). > > Since the EC is per-object, the small objects will be stripped to even > smaller ones. With 4+2 EC, one needs (at least) 4 IOs to read a single > object in this scenario -> number of required IOPS when using EC is > relatively high. Some vendors (such as Hitachi, but I believe EMC as > well) do offline, predefined-chunk size EC instead. The idea is to > first > write objects with replication factor of 3, wait for enough objects to > fill 4x 64MiB chunks and only do EC on that. This not only makes > the EC > less computationally intensive, and repairs much faster, but it also > allows reading majority of the small objects directly by reading just > part of one of the chunk from it (assuming non degraded state) - one > chunk actually contains the whole object. > I wonder if something similar is already possible with CEPH and/or is > planned. For our use case of very small objects, it would mean > near 3-4x > performance boosts in terms of required IOPS performance. > > Another option how to get out of this situation is to be able to > specify > different storage pools/policies based on file size - i.e. to do 3x > replication of the very small files and only use EC for bigger files, > where the performance hit with 4x IOPS won't be that painful. But > I I am > afraid this is not possible... > > > Unfortunately any logic like this would need to be handled in your > application layer. Raw RADOS does not do object sharding or > aggregation on its own. > CERN did contribute the libradosstriper, which will break down your > multi-gigabyte objects into more typical sizes, but a generic system > for packing many small objects into larger ones is tough — the choices > depend so much on likely access patterns and such. > > I would definitely recommend working out something like that, though! > -Greg this is unfortunate. I believe that for storage of small objects, this would be a deal breaker. Hitachi claims they can do 20+6 erasure coding when using predefined-size EC, which is something hardly imaginable on the current CEPH implementation. Actually, for us, I am afraid that lack of this feature actually mean we would buy an object store instead of building it on open source technology :-/
From technical side, I don't see why the access pattern of such objects would change the storage strategy. If you would leave the bulk blocksize configurable, it should be enough, shouldn't it? Regards Jiri Horky > > > Any other hint is sincerely welcome. > > Thank you > Jiri Horky > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com