>> 
>> Which version of Ceph was this deployed on? Did you use the default
>> value for bluestore_min_alloc_size(_hdd)? If it's before Pacific and
>> you used the default, then the min alloc size is 64KiB for HDDs, which
>> could be causing quite a bit of usage inflation depending on the sizes
>> of objects involved.
>> 
> 
> Is it recommended that if you have a pre-pacific cluster you change this now 
> before upgrading?

It’s baked into a given OSD at creation time.  Changing after the fact should 
have no effect unless you rebuild affected OSDs.

As noted above, significant space amplification can happen with RGW when 
storing a significant fraction of relatively small objects.

This sheet quantifies and visualizes this phenomenon nicely: 

https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI

If your OSDs were deployed with bluestore_min_alloc_size=16KB, S3/Swift objects 
that aren’t roughly an even multiple of 16KB in size will allocate unused 
space.  Think remainder in a modulus operation.  Eg., if you write a 1KB 
object, BlueStore will store 16KB and you’ll waste 15KB.  If you write a 15KB 
object, the percentage is much lower.  If you write a 17KB object, the space 
amplification ratchets up somewhat — 17 mod 16 => remainder 1KB, but in that 
case you’ve also stored a full 17KB object so the _percentage_ of stranded 
space is lower.  This rapidly becomes insigificant as S3 object size increases.

Note that this multiplied by replication.  With 3R, the total stranded space 
will be 3x the remainder.  With EC, depending on K and M, the total is 
potentially much larger since the client object is shared over a larger number 
of RADOS objects and thus OSDs.

There is a doc PR already in progress that explains this phenomenon.

If your population / distribution of objects is rich in relatively small 
objects, you can reclaim space by iteratively destroying and redeploying OSDs 
that were created with the larger value.

RBD volumes tend to be much larger than min_alloc_size*, so this phenomenon is 
generally not significant for RBD pools.


Other factors that may be at play here:

* Your OSDs at 600MB are small by Ceph standards, we’ve seen in the past that 
this can result in a relatively large ratio of overhead to raw / payload 
capacity.

* ISTR having read that versioned objects / buckets and resharding operations 
can in some situations leave orphaned RADOS objects 

* If you’ve ever run  `rados bench` bench against any of your pools, there may 
be a bunch of leftover RADOS objects laying around taking up space.  By default 
something like `rados ls -p my pool | egrep ‘^bench.*$` will show these.  Note 
that this may take a long time to run, and if the `rados bench` invocation 
specified a non-default job name the pattern may be different.

— aad

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to