Re: [ceph-users] Ceph block storage - block.db useless?

Mark Nelson Tue, 12 Mar 2019 06:20:05 -0700


On 3/12/19 7:24 AM, Benjamin Zapiec wrote:

Hello,


i was wondering about ceph block.db to be nearly empty and I started
to investigate.

The recommendations from ceph are that block.db should be at least
4% the size of block. So my OSD configuration looks like this:

wal.db   - not explicit specified
block.db - 250GB of SSD storage
block    - 6TB

By default we currently use 4 256MB WAL buffers. 2GB should be enough,though in most cases you are better off just leaving it on block.db asyou did below.

Since wal is written to block.db if not available i didn't configured
wal. With the size of 250GB we are slightly above 4%.



WAL will only use about 1GB of that FWIW


So everything should be "fine". But the block.db only contains
about 10GB of data.

If this is an RBD workload, that's quite possible as RBD tends to usefar less metadata than RGW.


If figured out that an object in block.db gets "amplified" so
the space consumption is much higher than the object itself
would need.

Data in the DB in general will suffer space amplification and it getsworse the more levels in rocksdb you have as multiple levels may havecopies of the same data at different points in time. The bigger issueis that currently an entire level has to fit on the DB device. IE iflevel 0 takes 1GB, level 1 takes 10GB, level 2 takes 100GB, and level 3takes 1000GB, you will only get 0, 1 and 2 on block.db with 250GB.


I'm using ceph as storage backend for openstack and raw images
with a size of 10GB and more are common. So if i understand
this correct i have to consider that a 10GB images may
consume 100GB of block.db.

The DB holds metadata for the images (and some metadata for bluestore). This is going to be a very small fraction of the overall data size butis really important. Whenever we do a write to an object we first tryto read some metadata about it (if it exists). Having those readattempts happen quickly is really important to make sure that the writehappens quickly.


Beside the facts that the image may have a size of 100G and
they are only used for initial reads unitl all changed
blocks gets written to a SSD-only pool i was question me
if i need a block.db and if it would be better to
save the amount of SSD space used for block.db and just
create a 10GB wal.db?

See above. Also, rocksdb periodically has to compact data and with lotsof metadata (and as a result lots of levels) it can get pretty slow. Having rocksdb on fast storage helps speed that process up and avoidwrite stalls due to level0 compaction (higher level compaction canhappen in alternate threads).


Has anyone done this before? Anyone who had sufficient SSD space
but stick with wal.db to save SSD space?

If i'm correct the block.db will never be used for huge images.
And even though it may be used for one or two images does this make
sense? The images are used initially to read all unchanged blocks from
it. After a while each VM should access the images pool less and
less due to the changes made in the VM.

The DB is there primarily to store metadata. RBD doesn't use a lot ofspace but may do a lot of reads from the DB if it can't keep all of thebluestore onodes in it's own in-memory cache (the kv cache). RGW usesthe DB much more heavily and in some cases you may see 40-50% spaceusage if you have tiny RGW objects (~4KB). See this spreadsheet formore info:



https://drive.google.com/file/d/1Ews2WR-y5k3TMToAm0ZDsm7Gf_fwvyFw/view?usp=sharing


Mark



Any thoughts about this?


Best regards


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph block storage - block.db useless?

Reply via email to