Hi Greg,

On 10/17/2017 11:49 PM, Gregory Farnum wrote:
> On Tue, Oct 17, 2017 at 12:42 PM Jiri Horky <jiri.ho...@gmail.com
> <mailto:jiri.ho...@gmail.com>> wrote:
>
>     Hi list,
>
>     we are thinking of building relatively big CEPH-based object
>     storage for
>     storage of our sample files - we have about 700M files ranging
>     from very
>     small (1-4KiB) files to pretty big ones (several GiB). Median of file
>     size is 64KiB. Since the required space is relatively large (1PiB of
>     usable storage), we are thinking of utilizing erasure coding for this
>     case. On the other hand, we need to achieve at least 1200MiB/s
>     throughput on reads. The working assumption is 4+2 EC (thus 50%
>     overhead).
>
>     Since the EC is per-object, the small objects will be stripped to even
>     smaller ones. With 4+2 EC, one needs (at least) 4 IOs to read a single
>     object in this scenario -> number of required IOPS when using EC is
>     relatively high. Some vendors (such as Hitachi, but I believe EMC as
>     well) do offline, predefined-chunk size EC instead. The idea is to
>     first
>     write objects with replication factor of 3, wait for enough objects to
>     fill 4x 64MiB chunks and only do EC on that. This not only makes
>     the EC
>     less computationally intensive, and repairs much faster, but it also
>     allows reading majority of the small objects directly by reading just
>     part of one of the chunk from it (assuming non degraded state) - one
>     chunk actually contains the whole object.
>     I wonder if something similar is already possible with CEPH and/or is
>     planned. For our use case of very small objects, it would mean
>     near 3-4x
>     performance boosts in terms of required IOPS performance.
>
>     Another option how to get out of this situation is to be able to
>     specify
>     different storage pools/policies based on file size - i.e. to do 3x
>     replication of the very small files and only use EC for bigger files,
>     where the performance hit with 4x IOPS won't be that painful. But
>     I I am
>     afraid this is not possible...
>
>
> Unfortunately any logic like this would need to be handled in your
> application layer. Raw RADOS does not do object sharding or
> aggregation on its own.
> CERN did contribute the libradosstriper, which will break down your
> multi-gigabyte objects into more typical sizes, but a generic system
> for packing many small objects into larger ones is tough — the choices
> depend so much on likely access patterns and such.
>
> I would definitely recommend working out something like that, though!
> -Greg
this is unfortunate. I believe that for storage of small objects, this
would be a deal breaker. Hitachi claims they can do 20+6 erasure coding
when using predefined-size EC, which is something hardly imaginable on
the current CEPH implementation. Actually, for us, I am afraid that lack
of this feature actually mean we would buy an object store instead of
building it on open source technology :-/

From technical side, I don't see why the access pattern of such objects
would change the storage strategy. If you would leave the bulk blocksize
configurable, it should be enough, shouldn't it?

Regards
Jiri Horky



>  
>
>     Any other hint is sincerely welcome.
>
>     Thank you
>     Jiri Horky
>
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to