On Wed, Jun 27, 2018 at 2:32 AM Nicolas Dandrimont < ol...@softwareheritage.org> wrote:
> Hi, > > I would like to use ceph to store a lot of small objects. Our current usage > pattern is 4.5 billion unique objects, ranging from 0 to 100MB, with a > median > size of 3-4kB. Overall, that's around 350 TB of raw data to store, which > isn't > much, but that's across a *lot* of tiny files. > > We expect a growth pattern of around at third per year, and the object size > distribution to sensibly stay the same (it's been stable for the past three > years, and we don't see that changing). > > Our object access pattern is a very simple key -> value store, where the > key > happens to be the sha1 of the content we're storing. Any metadata are > stored > externally and we really only need a dumb object storage. > > Our redundancy requirement is to be able to withstand the loss of 2 OSDs. > > After looking at our options for storage in Ceph, I dismissed (perhaps > hastily) > RGW for its metadata overhead, and went straight to plain RADOS. I've > setup an > erasure coded storage pool, with default settings, with k=5 and m=2 > (expecting > a 40% increase in storage use over plain contents). > > After storing objects in the pool, I see a storage usage of 700% instead of > 140%. My understanding of the erasure code profile docs[1] is that objects > that > are below the stripe width (k * stripe_unit, which in my case is 20KB) > can't be > chunked for erasure coding, which makes RADOS fall back to plain object > copying, with k+m copies. > > [1] > http://docs.ceph.com/docs/master/rados/operations/erasure-code-profile/ > > Is my understanding correct? Does anyone have experience with this kind of > storage workload in Ceph? That’s close but not *quite* right. It’s not that Ceph will explicitly “fall back” to replication. In most (though perhaps not all) erasure codes, what you’ll see is full sized parity blocks, a full store of the data (in the default reed-Solomon that will just be full-sized chunks up to however many are needed to store it fully in a single copy), and the remaining data chunks (out of the k) will have no data. *But* Ceph will keep the “object info” metadata in each shard, so all the OSDs in a PG will still witness all the writes. > If my understanding is correct, I'll end up adding size tiering on my > object > storage layer, shuffling objects in two pools with different settings > according > to their size. That's not too bad, but I'd like to make sure I'm not > completely > misunderstanding something. > That’s probably a reasonable response, especially if you are already maintaining an index for other purposes! -Greg > Thanks! > -- > Nicolas Dandrimont > Backend Engineer, Software Heritage > > BOFH excuse #170: > popper unable to process jumbo kernel > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com