On Mon, Oct 23, 2017 at 9:37 AM Jiri Horky <jiri.ho...@gmail.com> wrote:

> Hi Greg,
>
>
> On 10/17/2017 11:49 PM, Gregory Farnum wrote:
>
> On Tue, Oct 17, 2017 at 12:42 PM Jiri Horky <jiri.ho...@gmail.com> wrote:
>
>> Hi list,
>>
>> we are thinking of building relatively big CEPH-based object storage for
>> storage of our sample files - we have about 700M files ranging from very
>> small (1-4KiB) files to pretty big ones (several GiB). Median of file
>> size is 64KiB. Since the required space is relatively large (1PiB of
>> usable storage), we are thinking of utilizing erasure coding for this
>> case. On the other hand, we need to achieve at least 1200MiB/s
>> throughput on reads. The working assumption is 4+2 EC (thus 50% overhead).
>>
>> Since the EC is per-object, the small objects will be stripped to even
>> smaller ones. With 4+2 EC, one needs (at least) 4 IOs to read a single
>> object in this scenario -> number of required IOPS when using EC is
>> relatively high. Some vendors (such as Hitachi, but I believe EMC as
>> well) do offline, predefined-chunk size EC instead. The idea is to first
>> write objects with replication factor of 3, wait for enough objects to
>> fill 4x 64MiB chunks and only do EC on that. This not only makes the EC
>> less computationally intensive, and repairs much faster, but it also
>> allows reading majority of the small objects directly by reading just
>> part of one of the chunk from it (assuming non degraded state) - one
>> chunk actually contains the whole object.
>> I wonder if something similar is already possible with CEPH and/or is
>> planned. For our use case of very small objects, it would mean near 3-4x
>> performance boosts in terms of required IOPS performance.
>>
>> Another option how to get out of this situation is to be able to specify
>> different storage pools/policies based on file size - i.e. to do 3x
>> replication of the very small files and only use EC for bigger files,
>> where the performance hit with 4x IOPS won't be that painful. But I I am
>> afraid this is not possible...
>>
>>
> Unfortunately any logic like this would need to be handled in your
> application layer. Raw RADOS does not do object sharding or aggregation on
> its own.
> CERN did contribute the libradosstriper, which will break down your
> multi-gigabyte objects into more typical sizes, but a generic system for
> packing many small objects into larger ones is tough — the choices depend
> so much on likely access patterns and such.
>
> I would definitely recommend working out something like that, though!
> -Greg
>
> this is unfortunate. I believe that for storage of small objects, this
> would be a deal breaker. Hitachi claims they can do 20+6 erasure coding
> when using predefined-size EC, which is something hardly imaginable on the
> current CEPH implementation. Actually, for us, I am afraid that lack of
> this feature actually mean we would buy an object store instead of building
> it on open source technology :-/
>
> From technical side, I don't see why the access pattern of such objects
> would change the storage strategy. If you would leave the bulk blocksize
> configurable, it should be enough, shouldn't it?
>

Well, there's two different things. If you're doing replicated writes and
then erasure coding data, you assume the data changes slowly enough for
that to work, or at least that the cost of erasure coding it is worthwhile.

That's not a bad bet, but the RADOS architecture simply doesn't support
doing anything like that internally; all decisions about replication versus
erasure coding and data placement happen on the level of a pool, not on
objects inside of them. So bulk packing of objects isn't really possible
for RADOS to do on its own, and the application has to drive any data
movement. That requires understanding patterns to select the right coding
chunks (so that objects tend to exist in one chunk), to know when is a good
time to physically read and write the data, etc.

This use case you're describing is certainly useful, but so far as I know
it's not implemented in any open-source storage solutions because it's
pretty specialized and requires a lot of backend investment that doesn't
pay off incrementally.
-Greg


>
> Regards
>
> Jiri Horky
>
>
>
>
>
>
>> Any other hint is sincerely welcome.
>>
>> Thank you
>> Jiri Horky
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to