[ceph-users] Re: Storing 20 billions of immutable objects in Ceph, 75% <16KB

Loïc Dachary Sat, 20 Feb 2021 04:31:53 -0800

I did not know it was possible to buy such a machine, very impressive. How much 
does that cost? I thought 18TB was the current maximum size for HDD :-P


On 18/02/2021 04:37, Serkan Çoban wrote:
> I still prefer the simplest solution. There are 4U servers with 110 x
> 20TB disks on the market.
> After raid you get 1.5PiB per server. This is 30 months of data.
> 2 such servers will hold 5 years of data with minimal problems.
> If you need backup; then buy 2 more sets and just send zfs snapshot
> diffs to this set.
>
>
> On Wed, Feb 17, 2021 at 11:15 PM Loïc Dachary <l...@dachary.org> wrote:
>>
>>
>> On 17/02/2021 18:27, Serkan Çoban wrote:
>>> Why not put all the data to a zfs pool with 3-4 levels deep directory
>>> structure each directory named with 2 byte in range 00-FF?
>>> Four levels deep, you get 255^4=4B folders with 3-4 objects per folder
>>> or three levels deep you get 255^3=16M folders with ~1000 objects
>>> each.
>> It is more or less the current setup :-) I should have mentioned that there 
>> currently are ~750TB and 10 billions objects. But it's growing by 50TB every 
>> month and it will keep growing indefinitely. Reason why a solution that 
>> scales out is desirable.
>>> On Wed, Feb 17, 2021 at 8:14 PM Loïc Dachary <l...@dachary.org> wrote:
>>>> Hi Nathan,
>>>>
>>>> Good thinking :-) The names of the objects are indeed the SHA256 of their 
>>>> content, which provides deduplication.
>>>>
>>>> Cheers
>>>>
>>>> On 17/02/2021 18:04, Nathan Fish wrote:
>>>>> I'm not much of a programmer, but as soon as I hear "immutable
>>>>> objects" I think "content-addressed". I don't know if you have many
>>>>> duplicate objects in this set, but content-addressing gives you
>>>>> object-level dedup for free. Do you have to preserve some meaningful
>>>>> object names from the original dataset, or just do you just need some
>>>>> kind of ID?
>>>>>
>>>>> On Wed, Feb 17, 2021 at 11:37 AM Loïc Dachary <l...@dachary.org> wrote:
>>>>>> Bonjour,
>>>>>>
>>>>>> TL;DR: Is it more advisable to work on Ceph internals to make it 
>>>>>> friendly to this particular workload or write something similar to 
>>>>>> EOS[0] (i.e Rocksdb + Xrootd + RBD)?
>>>>>>
>>>>>> This is a followup of two previous mails[1] sent while researching this 
>>>>>> topic. In a nutshell, the Software Heritage project[1] currently has 
>>>>>> ~750TB and 10 billions objects, 75% of which have a size smaller than 
>>>>>> 16KB and 50% have a size smaller than 4KB. But they only account for ~5% 
>>>>>> of the 750TB: 25% of the objects have a size > 16KB and total ~700TB. 
>>>>>> The objects can be compressed by ~50% and 750TB only needs 350TB of 
>>>>>> actual storage. (if you're interested in the details see [2]).
>>>>>>
>>>>>> Let say those 10 billions objects are stored in a single 4+2 erasure 
>>>>>> coded pool with bluestore compression set for objects that have a size > 
>>>>>> 32KB and the smallest allocation size for bluestore set to 4KB[3]. The 
>>>>>> 750TB won't use the expected 350TB but about 30% more, i.e. ~450TB (see 
>>>>>> [4] for the maths). This space amplification is because storing a 1 byte 
>>>>>> object uses the same space as storing a 16KB object (see [5] to repeat 
>>>>>> the experience at home). In a 4+2 erasure coded pool, each of the 6 
>>>>>> chunks will use no less than 4KB because that's the smallest allocation 
>>>>>> size for bluestore. That's 4 * 4KB = 16KB even when all that is needed 
>>>>>> is 1 byte.
>>>>>>
>>>>>> It was suggested[6] to have two different pools: one with a 4+2 erasure 
>>>>>> pool and compression for all objects with a size > 32KB that are 
>>>>>> expected to compress to 16KB. And another with 3 replicas for the 
>>>>>> smaller objects to reduce space amplification to a minimum without 
>>>>>> compromising on durability. A client looking for the object could make 
>>>>>> two simultaneous requests to the two pools. They would get 404 from one 
>>>>>> of them and the object from the other.
>>>>>>
>>>>>> Another workaround, is best described in the "Finding a needle in 
>>>>>> Haystack: Facebook’s photo storage"[9] paper and essentially boils down 
>>>>>> to using a database to store a map between the object name and its 
>>>>>> location. That does not scale out (writing the database index is the 
>>>>>> bottleneck) but it's simple enough and is successfully implemented in 
>>>>>> EOS[0] with >200PB worth of data and in seaweedfs[10], another promising 
>>>>>> object store software based on the same idea.
>>>>>>
>>>>>> Instead of working around the problem, maybe Ceph could be modified to 
>>>>>> make better use of the immutability of these objects[7], a hint that is 
>>>>>> apparently only used to figure out how to best compress it and for 
>>>>>> checksum calculation[8]. I honestly have not clue how difficult it would 
>>>>>> be. All I know is that it's not easy otherwise it would have been done 
>>>>>> already: there seem to be a general need for efficiently (space wise and 
>>>>>> performance wise) storing large quantities of objects smaller than 4KB.
>>>>>>
>>>>>> Is it more advisable to:
>>>>>>
>>>>>>   * work on Ceph internals to make it friendly to this particular 
>>>>>> workload or,
>>>>>>   * write another implementation of "Finding a needle in Haystack: 
>>>>>> Facebook’s photo storage"[9] based on RBD[11]?
>>>>>>
>>>>>> I'm currently leaning toward working on Ceph internals but there are 
>>>>>> pros and cons to both approaches[12]. And since all this is still very 
>>>>>> new to me, there also is the possibility that I'm missing something. 
>>>>>> Maybe it's *super* difficult  to improve Ceph in this way. I should try 
>>>>>> to figure that out sooner rather than later.
>>>>>>
>>>>>> I realize it's a lot to take in and unless you're facing the exact same 
>>>>>> problem there is very little chance you read that far :-) But if you 
>>>>>> did... I'm *really* interested to hear what yout think. In any case I'll 
>>>>>> report back to this thread once a decision has been made.
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> [0] https://eos-web.web.cern.ch/eos-web/
>>>>>> [1] 
>>>>>> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/
>>>>>>  
>>>>>> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/RHQ5ZCHJISXIXOJSH3TU7DLYVYHRGTAT/
>>>>>> [2] https://forge.softwareheritage.org/T3054
>>>>>> [3] 
>>>>>> https://github.com/ceph/ceph/blob/3f5e778ad6f055296022e8edabf701b6958fb602/src/common/options.cc#L4326-L4330
>>>>>> [4] https://forge.softwareheritage.org/T3052#58864
>>>>>> [5] https://forge.softwareheritage.org/T3052#58917
>>>>>> [6] https://forge.softwareheritage.org/T3052#58876
>>>>>> [7] 
>>>>>> https://docs.ceph.com/en/latest/rados/api/librados/#c.@3.LIBRADOS_ALLOC_HINT_FLAG_IMMUTABLE
>>>>>> [8] https://forge.softwareheritage.org/T3055
>>>>>> [9] 
>>>>>> https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf
>>>>>> [10] https://github.com/chrislusf/seaweedfs/wiki/Components
>>>>>> [11] https://forge.softwareheritage.org/T3049
>>>>>> [12] https://forge.softwareheritage.org/T3054#58977
>>>>>>
>>>>>> --
>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>>
>>

-- 
Loïc Dachary, Artisan Logiciel Libre

OpenPGP_signature
Description: OpenPGP digital signature

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Storing 20 billions of immutable objects in Ceph, 75% <16KB

Reply via email to