Hi Nathan,

Good thinking :-) The names of the objects are indeed the SHA256 of their 
content, which provides deduplication.

Cheers

On 17/02/2021 18:04, Nathan Fish wrote:
> I'm not much of a programmer, but as soon as I hear "immutable
> objects" I think "content-addressed". I don't know if you have many
> duplicate objects in this set, but content-addressing gives you
> object-level dedup for free. Do you have to preserve some meaningful
> object names from the original dataset, or just do you just need some
> kind of ID?
>
> On Wed, Feb 17, 2021 at 11:37 AM Loïc Dachary <l...@dachary.org> wrote:
>> Bonjour,
>>
>> TL;DR: Is it more advisable to work on Ceph internals to make it friendly to 
>> this particular workload or write something similar to EOS[0] (i.e Rocksdb + 
>> Xrootd + RBD)?
>>
>> This is a followup of two previous mails[1] sent while researching this 
>> topic. In a nutshell, the Software Heritage project[1] currently has ~750TB 
>> and 10 billions objects, 75% of which have a size smaller than 16KB and 50% 
>> have a size smaller than 4KB. But they only account for ~5% of the 750TB: 
>> 25% of the objects have a size > 16KB and total ~700TB. The objects can be 
>> compressed by ~50% and 750TB only needs 350TB of actual storage. (if you're 
>> interested in the details see [2]).
>>
>> Let say those 10 billions objects are stored in a single 4+2 erasure coded 
>> pool with bluestore compression set for objects that have a size > 32KB and 
>> the smallest allocation size for bluestore set to 4KB[3]. The 750TB won't 
>> use the expected 350TB but about 30% more, i.e. ~450TB (see [4] for the 
>> maths). This space amplification is because storing a 1 byte object uses the 
>> same space as storing a 16KB object (see [5] to repeat the experience at 
>> home). In a 4+2 erasure coded pool, each of the 6 chunks will use no less 
>> than 4KB because that's the smallest allocation size for bluestore. That's 4 
>> * 4KB = 16KB even when all that is needed is 1 byte.
>>
>> It was suggested[6] to have two different pools: one with a 4+2 erasure pool 
>> and compression for all objects with a size > 32KB that are expected to 
>> compress to 16KB. And another with 3 replicas for the smaller objects to 
>> reduce space amplification to a minimum without compromising on durability. 
>> A client looking for the object could make two simultaneous requests to the 
>> two pools. They would get 404 from one of them and the object from the other.
>>
>> Another workaround, is best described in the "Finding a needle in Haystack: 
>> Facebook’s photo storage"[9] paper and essentially boils down to using a 
>> database to store a map between the object name and its location. That does 
>> not scale out (writing the database index is the bottleneck) but it's simple 
>> enough and is successfully implemented in EOS[0] with >200PB worth of data 
>> and in seaweedfs[10], another promising object store software based on the 
>> same idea.
>>
>> Instead of working around the problem, maybe Ceph could be modified to make 
>> better use of the immutability of these objects[7], a hint that is 
>> apparently only used to figure out how to best compress it and for checksum 
>> calculation[8]. I honestly have not clue how difficult it would be. All I 
>> know is that it's not easy otherwise it would have been done already: there 
>> seem to be a general need for efficiently (space wise and performance wise) 
>> storing large quantities of objects smaller than 4KB.
>>
>> Is it more advisable to:
>>
>>   * work on Ceph internals to make it friendly to this particular workload 
>> or,
>>   * write another implementation of "Finding a needle in Haystack: 
>> Facebook’s photo storage"[9] based on RBD[11]?
>>
>> I'm currently leaning toward working on Ceph internals but there are pros 
>> and cons to both approaches[12]. And since all this is still very new to me, 
>> there also is the possibility that I'm missing something. Maybe it's *super* 
>> difficult  to improve Ceph in this way. I should try to figure that out 
>> sooner rather than later.
>>
>> I realize it's a lot to take in and unless you're facing the exact same 
>> problem there is very little chance you read that far :-) But if you did... 
>> I'm *really* interested to hear what yout think. In any case I'll report 
>> back to this thread once a decision has been made.
>>
>> Cheers
>>
>> [0] https://eos-web.web.cern.ch/eos-web/
>> [1] 
>> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/
>>  
>> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/RHQ5ZCHJISXIXOJSH3TU7DLYVYHRGTAT/
>> [2] https://forge.softwareheritage.org/T3054
>> [3] 
>> https://github.com/ceph/ceph/blob/3f5e778ad6f055296022e8edabf701b6958fb602/src/common/options.cc#L4326-L4330
>> [4] https://forge.softwareheritage.org/T3052#58864
>> [5] https://forge.softwareheritage.org/T3052#58917
>> [6] https://forge.softwareheritage.org/T3052#58876
>> [7] 
>> https://docs.ceph.com/en/latest/rados/api/librados/#c.@3.LIBRADOS_ALLOC_HINT_FLAG_IMMUTABLE
>> [8] https://forge.softwareheritage.org/T3055
>> [9] https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf
>> [10] https://github.com/chrislusf/seaweedfs/wiki/Components
>> [11] https://forge.softwareheritage.org/T3049
>> [12] https://forge.softwareheritage.org/T3054#58977
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io

-- 
Loïc Dachary, Artisan Logiciel Libre


Attachment: OpenPGP_signature
Description: OpenPGP digital signature

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to