[ceph-users] Re: Request for Info: bluestore_compression_mode?

Mark Nelson Thu, 18 Aug 2022 15:17:53 -0700

Hi Frank,

Thank you for the incredibly detailed reply!  Will respond inline.


On 8/17/22 7:06 AM, Frank Schilder wrote:

Hi Mark,

please find below a detailed report with data and observations from our 
production
system. The ceph version is mimic-latest and some ways of configuring 
compression
or interpreting settings may or may not have changed already. As far as I know 
its
still pretty much the same.

Ok, this is good to know. While the compression settings haven'tchanged much, there definitely have been changes regarding min_allocsize and allocator code.

First thing I would like to mention is that you will not be able to get rid of
osd blob compression. This is the only way to preserve reasonable performance
on applications like ceph FS and RBD. If you are thinking about 
application-level
full file- or object compression, this would probably be acceptable for upload-
download only applications. For anything else, something like this would degrade
performance to unacceptable levels.

I'm curious how much we actually gain by compressing at the blob levelvs object level in practice. Obviously it could mean a lot more workwhen doing small IOs, but I'm curious how much latency it actually addswhen only compressing the blob vs the whole object. Also for somethinglike RBD I wonder if a simplified blob structure with smaller(compressed) objects might be a bigger win? CephFS is a good pointthough. Right now there is likely some advantage by compressing at theblob level, especially if we are talking about small writes to huge objects.

Second thing is that the current bluestore compression could be much more 
effective
if the problem of small objects could be addressed. This might happen on 
application
level, for example, implementing tail-merging for ceph fs. This would come with
dramatic improvements, because the largest amount of over-allocation does not 
come
from uncompressed but from many small objects (even if compressed). I mentioned
this already in our earlier conversation and I will include you in a new thread
specifically about my observations in this direction.

Indeed.  No argument on the impact here.

As one of the important indicators of ineffective compression and huge 
over-allocation
due to small objects on ceph-fs you asked for, please see the output of ceph df 
below.
The pool con-fs2-meta2 is the primary data pool of an FS where the root of the 
file
system is assigned to another data pool con-fs2-data2 - the so-called 3-pool FS 
layout.

As you can see, con-fs2-meta2 contains 50% of all FS objects, yet they are all 
of size 0.
One could say "perfectly compressed" but they all require a 
min_alloc_size*replication_factor
(in our case, 16K*4 on the meta2-pool and 64K*11 on the data2-pool !) 
allocation on disk.
Together with having hundreds of millions of small files on the file system, 
which also
require such a minimum allocation each, a huge waste of raw capacity results. 
I'm just
lucky I don't have the con-fs2-meta2 in the main pool. Its also a huge pain for 
recovery.

I take it those size-0 objects are just metadata? It's prettyunfortunate if we end up allocating min_alloc just for the header/footeron all EC shards. At least in more modern versions of ceph themin_alloc size is 4K in all cases, so this gets better but doesn'ttotally go away.

I'm pretty sure the same holds for RGW with small objects. The only application 
that
does not have this problem is RBD with its fixed uniform object size.

This will not change by application level compression. This requires merging 
small
objects into large ones. I consider this to be currently the major factor for 
excess
raw usage and any improvements of some percent with compression will have only
very small effects on a global scale. Looking at the stat numbers below from a 
real-life
HPC system, you can simulate how much one could at best get out of more/better
compression.

Sounds like you have a lot of small objects? Back when EC wasimplemented I recall we were really thinking about it in terms of largeobject use cases (and primarily for RGW). Over time we've gotten a lotmore interest from people wanting to use EC with CephFS and RBD, andalso with smaller object sizes. It's definitely a lot trickier gettingthose right imho.

For example, on our main bulk data pool, compressed allocated is only 13%. Even 
if
compression could compress this to size 0, the overall gain would at best be 
13%.
On the other hand, the actual compression rate of 2.9 is actually quite good. 
If *all*
data was merged into blobs of a minimum size that allowed to save this amount of
allocation by compression, one could improve storage capacity by a factor of 
about
2.5 (250% !). With the current implementation of compression.

Consequently, my personal opinion is, that it is not interesting to spend much 
time
on better compression if the small object min_allocation problem is not 
addressed first.
A simplification of the interplay of the current parameters and removal of 
essentially
redundant ones could be interesting just for the purpose of making 
configuration of
compression easier. As you can see in this report, I also got it wrong. The 
current
way is a bit too complicated.

Yeah, as we talk about this it's clear that this is not nearly asstraightforward to configure as it should be.


Observations on out production cluster
======================================

It seems that the way compression is configured is a bit more complicated/messy
than I thought. In our earlier conversation I gave a matrix using all 
combinations
of three blue- and pool- compression_mode options: none, passive and aggressive.
Apparently, the possibility "not set" adds yet another row+column to the table.
I thought "not set" is equal to "none", but it isn't - with counter-intuitive
results.

Due to this, I have a pool compressed that I didn't want compressed. Well, I
can fix that. There is a very strange observation with raw usage on this pool
though, reported at the very end of the stats report below.

A second strange observation is that even though 
bluestore_compression_min_blob_size_ssd
is smaller than bluestore_min_alloc_size_ssd data is compressed on SSD OSDs.
According to the info I got such a setting should result in no compression at 
all
(well, no allocation savings) because the compressed blob size is always smaller
than min_alloc_size and will cause a full allocation of min_alloc_size. Yet, 
there
is a tiny amount of less allocated than stored reported and I'm wondering what 
is
happening here.

Pools
=====

POOLS:
     NAME                     ID     USED        %USED     MAX AVAIL     OBJECTS
     sr-rbd-meta-one          1       90 GiB      0.45        20 TiB         
33332
     sr-rbd-data-one          2       71 TiB     55.52        57 TiB      
25603605
     sr-rbd-one-stretch       3      222 GiB      1.09        20 TiB         
68813
     con-rbd-meta-hpc-one     7       52 KiB         0       1.1 TiB            
61
     con-rbd-data-hpc-one     8       36 GiB         0       4.9 PiB          
9418
     sr-rbd-data-one-hdd      11     121 TiB     42.07       167 TiB      
32346929
     con-fs2-meta1            12     463 MiB      0.05       854 GiB      
40314740
     con-fs2-meta2            13         0 B         0       854 GiB     
408055310
     con-fs2-data             14     1.1 PiB     17.79       4.9 PiB     
407608732
     con-fs2-data-ec-ssd      17     274 GiB      9.10       2.7 TiB       
3649114
     ms-rbd-one               18     378 GiB      1.85        20 TiB        
159631
     con-fs2-data2            19     1.3 PiB     23.18       4.5 PiB     
589024561
     sr-rbd-data-one-perf     20     3.1 TiB     51.88       2.9 TiB        
806440

For the effect of compression, the ceph fs layout is important:

+---------------------+----------+-------+-------+
|         Pool        |   type   |  used | avail |
+---------------------+----------+-------+-------+
|    con-fs2-meta1    | metadata |  492M |  853G |
|    con-fs2-meta2    |   data   |    0  |  853G |
|     con-fs2-data    |   data   | 1086T | 5021T |
| con-fs2-data-ec-ssd |   data   |  273G | 2731G |
|    con-fs2-data2    |   data   | 1377T | 4565T |
+---------------------+----------+-------+-------+

We have both, the fs meta-data- and the primary data pool on replicated pools 
on SSD.
The data pool con-fs2-data2 is attached to the root of the file system. The 
data pool
con-fs2-data used to be the root and is not attached to any fs path. We changed 
the
bulk data pool from EC 8+2 (con-fs2-data) to 8+3 (con-fs2-data2) and 
con-fs2-data
contains all "old" files on the 8+2 pool. The small pool con-fs2-data-ec-ssd is 
attached
to an apps path for heavily accessed small files.

Whether or not the primary data pool of an FS is a separate data pool or not 
will have a
large influence on how effective compression can be. Its a huge amount of small 
objects
that will never be compressed due to their size=0.

In general, due to the absence of tail merging, file systems with many small 
files will
suffer from massive over-allocation as well as many blobs being too small for 
compression.

Pools with compression enabled
==============================

Keys in output below

n   : pool_name
cm  : options.compression_mode
sz  : size
msz : min_size
ec  : erasure_code_profile


EC RBD data pools
-----------------

{"n":"sr-rbd-data-one","cm":"aggressive","sz":8,"msz":6,"ec":"sr-ec-6-2-hdd"}
{"n":"con-rbd-data-hpc-one","cm":"aggressive","sz":10,"msz":9,"ec":"con-ec-8-2-hdd"}
{"n":"sr-rbd-data-one-hdd","cm":"aggressive","sz":8,"msz":7,"ec":"sr-ec-6-2-hdd"}

Replicated RBD pools
--------------------

{"n":"sr-rbd-one-stretch","cm":"aggressive","sz":3,"msz":2,"ec":""}
{"n":"ms-rbd-one","cm":"aggressive","sz":3,"msz":2,"ec":""}

EC FS data pools
----------------

{"n":"con-fs2-data","cm":"aggressive","sz":10,"msz":9,"ec":"con-ec-8-2-hdd"}
{"n":"con-fs2-data-ec-ssd","cm":"aggressive","sz":10,"msz":9,"ec":"con-ec-8-2-ssd"}
{"n":"con-fs2-data2","cm":"aggressive","sz":11,"msz":9,"ec":"con-ec-8-3-hdd"}

Relevant OSD settings
=====================

bluestore_compression_mode = aggressive

bluestore_compression_min_blob_size_hdd = 262144
bluestore_min_alloc_size_hdd = 65536

bluestore_compression_min_blob_size_ssd = 8192 *** Dang! ***
bluestore_min_alloc_size_ssd = 16384

Just noticed that I forgot to set bluestore_compression_min_blob_size_ssd to a 
value
that is a multiple of bluestore_min_alloc_size_ssd. I wanted to use 65536, now 
expected
result on SSD pools is no compression at all :(

There was a ticket on these defaults and they were set to useful values 
starting with
nautilus. Will look into that at some point.

Some compression stats
======================

HDD OSDs in the FS bulk data pool(s)
------------------------------------

These are the most relevant for us. They contain the bulk data. I picked stats 
from 2 hosts,
1 OSD each. Should be representative for all OSDs. The disks are 18TB with 160 
and 153 PGs:

         "compress_success_count": 72044,
         "compress_rejected_count": 231282,
         "bluestore_allocated": 5433725812736,
         "bluestore_stored": 5788240735652,
         "bluestore_compressed": 483906706661,
         "bluestore_compressed_allocated": 699385708544,
         "bluestore_compressed_original": 1510040834048,
         "bluestore_extent_compress": 125924,

         "compress_success_count": 68618,
         "compress_rejected_count": 221980,
         "bluestore_allocated": 5101829226496,
         "bluestore_stored": 5427515391325,
         "bluestore_compressed": 451595891951,
         "bluestore_compressed_allocated": 652442533888,
         "bluestore_compressed_original": 1407811862528,
         "bluestore_extent_compress": 121594,

The success rate is not very high, almost certainly due to many small files on 
the system.
Some people are also using compressed data stores, which will also lead to 
rejected blobs.
Tail-merging could probably improve on that a lot. Also, not having the 
FS-backtrace objects
on this pool prevents a lot of (near-)empty allocations.

SSD OSDs with in the small SSD FS data pool
-------------------------------------------

Hmm, contrary to expectation, compression actually seems to happen. Maybe my 
interpretation
of min_alloc_size and compression_min_blob_size is wrong? Might be another 
reason to simplify
the compression parameters and explain better how they actually work. Again 
OSDs picked from
2 hosts, 1 OSD each, 87 and 118 PGs:

         "compress_success_count": 374638,
         "compress_rejected_count": 19386,
         "bluestore_allocated": 6909100032,
         "bluestore_stored": 4087158731,
         "bluestore_compressed": 306412527,
         "bluestore_compressed_allocated": 468615168,
         "bluestore_compressed_original": 937230336,
         "bluestore_extent_compress": 952258,

         "compress_success_count": 387489,
         "compress_rejected_count": 21764,
         "bluestore_allocated": 11573510144,
         "bluestore_stored": 6847593045,
         "bluestore_compressed": 552832088,
         "bluestore_compressed_allocated": 844693504,
         "bluestore_compressed_original": 1689387008,
         "bluestore_extent_compress": 950922,

SSD OSDs in the RBD pool, rep and EC are collocated on the same OSDs
--------------------------------------------------------------------

OSDs picked from 2 hosts, 1 OSD each, 161 and 178 PGs:

         "compress_success_count": 38835730,
         "compress_rejected_count": 58506800,
         "bluestore_allocated": 1064052097024,
         "bluestore_stored": 1322947371131,
         "bluestore_compressed": 68165358846,
         "bluestore_compressed_allocated": 114503401472,
         "bluestore_compressed_original": 289775203840,
         "bluestore_extent_compress": 61265761,

         "compress_success_count": 76647709,
         "compress_rejected_count": 85273926,
         "bluestore_allocated": 1081196380160,
         "bluestore_stored": 1399985201821,
         "bluestore_compressed": 83058256649,
         "bluestore_compressed_allocated": 139784241152,
         "bluestore_compressed_original": 350485362688,
         "bluestore_extent_compress": 86168422,

SSD OSD in a rep RBD pool without compression
---------------------------------------------

This is a pool with accidentally enabled compression. This pool has SSDs in a
special device class exclusively to itself, hence, collective OSD compression
matches pool data compresion 1:1. The stats are:

         "compress_success_count": 41071482,
         "compress_rejected_count": 1895058,
         "bluestore_allocated": 171709562880,
         "bluestore_stored": 304405529094,
         "bluestore_compressed": 30506295169,
         "bluestore_compressed_allocated": 132702666752,
         "bluestore_compressed_original": 265405333504,
         "bluestore_extent_compress": 48908699,

Compression mode (cm) is unset (NULL):

   {"n":"sr-rbd-data-one-perf","cm":null,"sz":3,"msz":2,"ec":""}

What is really strange here is that raw allocation matches the size of
uncompressed data times replication factor. With the quite high resulting
compression rate raw used should be much smaller than that. What is
going on here?

I need to read through all of this a couple of more times (read throughit once so far), but I've always been a little suspicious of some of ourspace accounting (especially in older versions like mimic!). I'm hopingthis might be a display issue, but can't dig into it right now. Adamand I might be able to discuss next week though. In any event, it'sreally interesting to see a real use case here of CephFS + EC +compression, especially on an older release like mimic. That's exactlythe kind of real-world example I was looking for so we can get a senseof what kinds of impacts making changes here might have.


Mark

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder
Sent: 12 August 2022 14:06
To: Mark Nelson; ceph-users@ceph.io
Subject: Re: [ceph-users] Request for Info: bluestore_compression_mode?

Hi Mark,

ha ha ha, this is a brilliant misunderstanding :) I was under the impression 
that since mimic all ceph developers were instructed never to mention the 
ceph.conf file again and only ever talk about the ceph config data base 
instead. The only allowed options in a config file are the monitor addresses 
(well, a few more but the idea is a minimal config file). And that's how my 
config file looks like.

OK, I think we do mean the same thing. There are currently 2 sets of 
compression options, the bluestore and the pool options. All have 3 values and 
depending on which combination of values is active for a PG, a certain result 
becomes the final. I believe the actual compression option applied to data was 
defined by a matrix like that:

bluestore opt | pool option
     | n | p | a
n | n | n | n
p | n | p | p
a | n | p | a

I think the bluestore option is redundant, I set these on OSD level:

   bluestore_compression_min_blob_size_hdd 262144
   bluestore_compression_mode aggressive

I honestly don't see any use of the bluestore options and neither do I see any 
use case for mode=passive. Simplifying this matrix to a simple per-pool 
compression on/off -flag with an option to choose the algorithm per pool as 
well seems a good idea and might even be a low-hanging fruit.

I need to collect some performance data from our OSDs for answering your 
questions about higher-level compression possibilities. I was a bit busy today 
with other stuff. I should have something for you next week.

Best regards and a nice weekend,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Mark Nelson <mnel...@redhat.com>
Sent: 11 August 2022 16:51:03
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Request for Info: bluestore_compression_mode?

On 8/11/22 03:20, Frank Schilder wrote:

Hi Mark,

I'm preparing a response with some data from our production system and will 
also open a new thread on the tail merging topic. Both topics are quite large 
in themselves. Just a quick question for understanding:

I was in fact referring to the yaml config and pool options ...

I don't know of a yaml file in ceph. Do you mean the ceph-adm spec file?

oh!  We switched the conf parsing over to using yaml templates:

https://github.com/ceph/ceph/blob/main/src/common/options/global.yaml.in#L4428-L4452

sorry for being unclear here, I just meant the
bluestore_compression_mode option you specify in the ceph.conf file.


Mark

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Mark Nelson <mnel...@redhat.com>
Sent: 10 August 2022 22:28
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Request for Info: bluestore_compression_mode?

On 8/10/22 10:08, Frank Schilder wrote:

Hi Mark.

I actually had no idea that you needed both the yaml option
and the pool option configured

[...]

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Request for Info: bluestore_compression_mode?

Reply via email to