Thanks for all this information.

We are running version 16.2.7, but we also had this issue before upgrading to 
pacific.

We are using the default value for bluestore_min_alloc_size(_hdd) and are 
currently redeploying every osd with a new 4TB hdd.

The confusing part is that the space was already used before we actively 
started using the S3 service.

> * If you’ve ever run  `rados bench` bench against any of your pools, there 
> may be a bunch of leftover RADOS objects laying around taking up space.  By 
> default something like `rados ls -p my pool | egrep ‘^bench.*$` will show 
> these.  Note that this may take a long time to run, and if the `rados bench` 
> invocation specified a non-default job name the pattern may be different.


I did run rados bench in the past but I cannot find any leftovers. In the past 
I changed many things as I was playing around with the cluster.

Wouldn’t all those described issues lead to the usage being displayed in ceph 
df? I have 20TiB used as of now but all pools combined only use a little more 
than 16TiB.

Thanks,

Hendrik

> On 10. Apr 2022, at 10:17, Anthony D'Atri <anthony.da...@gmail.com> wrote:
> 
>>> 
>>> Which version of Ceph was this deployed on? Did you use the default
>>> value for bluestore_min_alloc_size(_hdd)? If it's before Pacific and
>>> you used the default, then the min alloc size is 64KiB for HDDs, which
>>> could be causing quite a bit of usage inflation depending on the sizes
>>> of objects involved.
>>> 
>> 
>> Is it recommended that if you have a pre-pacific cluster you change this now 
>> before upgrading?
> 
> It’s baked into a given OSD at creation time.  Changing after the fact should 
> have no effect unless you rebuild affected OSDs.
> 
> As noted above, significant space amplification can happen with RGW when 
> storing a significant fraction of relatively small objects.
> 
> This sheet quantifies and visualizes this phenomenon nicely: 
> 
> https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI
> 
> If your OSDs were deployed with bluestore_min_alloc_size=16KB, S3/Swift 
> objects that aren’t roughly an even multiple of 16KB in size will allocate 
> unused space.  Think remainder in a modulus operation.  Eg., if you write a 
> 1KB object, BlueStore will store 16KB and you’ll waste 15KB.  If you write a 
> 15KB object, the percentage is much lower.  If you write a 17KB object, the 
> space amplification ratchets up somewhat — 17 mod 16 => remainder 1KB, but in 
> that case you’ve also stored a full 17KB object so the _percentage_ of 
> stranded space is lower.  This rapidly becomes insigificant as S3 object size 
> increases.
> 
> Note that this multiplied by replication.  With 3R, the total stranded space 
> will be 3x the remainder.  With EC, depending on K and M, the total is 
> potentially much larger since the client object is shared over a larger 
> number of RADOS objects and thus OSDs.
> 
> There is a doc PR already in progress that explains this phenomenon.
> 
> If your population / distribution of objects is rich in relatively small 
> objects, you can reclaim space by iteratively destroying and redeploying OSDs 
> that were created with the larger value.
> 
> RBD volumes tend to be much larger than min_alloc_size*, so this phenomenon 
> is generally not significant for RBD pools.
> 
> 
> Other factors that may be at play here:
> 
> * Your OSDs at 600MB are small by Ceph standards, we’ve seen in the past that 
> this can result in a relatively large ratio of overhead to raw / payload 
> capacity.
> 
> * ISTR having read that versioned objects / buckets and resharding operations 
> can in some situations leave orphaned RADOS objects 
> 
> * If you’ve ever run  `rados bench` bench against any of your pools, there 
> may be a bunch of leftover RADOS objects laying around taking up space.  By 
> default something like `rados ls -p my pool | egrep ‘^bench.*$` will show 
> these.  Note that this may take a long time to run, and if the `rados bench` 
> invocation specified a non-default job name the pattern may be different.
> 
> — aad
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to