[ceph-users] Re: Move block.db to new ssd

Frédéric Nass Tue, 12 Nov 2024 13:58:37 -0800

Yep, we're using RocksDB compression with Pacific since a few month. It helped 
a lot.


Since we're talking overspilling... Despite using 
bluestore_volume_selection_policy=use_some_extra with resharded RocksDB 
databases we can still observe many OSDs overspilling from time to time 
(approximately every month and a half). 
When this happens: 

- almost all OSDs overspill one after the other over 2-3 days. They all get 
detected and compacted thanks to a cron job, then it's completely quiet again 
for another month and a half, and then it comes back. This phenomenon repeats 
cyclically. 
- 'ceph health detail' shows figures similar to thoses reported in [1] that [2] 
is supposed to have fixed (if I'm not mistaken): 

=== Full health status === 
[WARN] BLUEFS_SPILLOVER: 8 OSD(s) experiencing BlueFS spillover 
osd.337 spilled over 12 GiB metadata from 'db' device (12 GiB used of 124 GiB) 
to slow device 
osd.352 spilled over 12 GiB metadata from 'db' device (687 MiB used of 124 GiB) 
to slow device 
osd.353 spilled over 12 GiB metadata from 'db' device (152 MiB used of 124 GiB) 
to slow device 
osd.357 spilled over 12 GiB metadata from 'db' device (960 MiB used of 124 GiB) 
to slow device 
osd.359 spilled over 1.9 GiB metadata from 'db' device (12 GiB used of 124 GiB) 
to slow device 

Has anyone ever experienced this? 

Cheers, 
Frédéric. 

[1] [ https://tracker.ceph.com/issues/38745 | 
https://tracker.ceph.com/issues/38745 ] 
[2] [ https://github.com/ceph/ceph/pull/29687 | 
https://github.com/ceph/ceph/pull/29687 ] 

----- Le 12 Nov 24, à 17:36, Anthony D'Atri <anthony.da...@gmail.com> a écrit : 

> Yes, it improves the dynamic where only ~3, 30, 300, etc. GB of DB space can 
> be
> used, and thus mitigates spillover. Previously a, say, 29GB DB 
> device/partition
> would be like 85% unused.
> With recent releases one can also turn on DB compression, which should have a
> similar benefit.

>> On Nov 12, 2024, at 11:25 AM, Frédéric Nass <frederic.n...@univ-lorraine.fr>
>> wrote:

>> Hi Anthony,

>> Did the RocksDB sharding end up improving the overspilling situation related 
>> to
>> the level thresholds? I had only anticipated that it would reduce the impact 
>> of
>> compaction.

>> We reshared our OSD's RocksDBs a long time ago (after upgrading to Pacific 
>> IIRC)
>> and I think we could still observe overspilling at the layer levels 
>> sometimes,
>> if I'm not mistaken.

>> Cheers,
>> Frédéric.

>> PS: It seems that the document you referred to is not accessible from the
>> Internet.

>> ----- Le 12 Nov 24, à 15:11, Anthony D'Atri <anthony.da...@gmail.com> a 
>> écrit :

>>> RocksDB column sharding came a while ago. It should be enabled on your OSDs,
>>> provided they weren’t built on a much older release. If they were you can
>>> update them.

>>> [
>>> https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf
>>> ] [
>>> https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf
>>> | rocksdb_in_ceph ]
>>> [
>>> https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf
>>> | PDF Document · 512 KB ]

>>> [
>>> https://www.ibm.com/docs/en/storage-ceph/7.1?topic=bluestore-resharding-rocksdb-database
>>> ] [
>>> https://www.ibm.com/docs/en/storage-ceph/7.1?topic=bluestore-resharding-rocksdb-database
>>> | IBM Storage Ceph – Administration, Resharding RocksDB database reshard
>>> RocksDB database ]
>>> [
>>> https://www.ibm.com/docs/en/storage-ceph/7.1?topic=bluestore-resharding-rocksdb-database
>>> | ibm.com ]
>>> [
>>> https://www.ibm.com/docs/en/storage-ceph/7.1?topic=bluestore-resharding-rocksdb-database
>>> ]

>>>> On Nov 12, 2024, at 8:02 AM, Alexander Patrakov <patra...@gmail.com> wrote:

>>>> Yes, that is correct.

>>>> On Tue, Nov 12, 2024 at 8:51 PM Frédéric Nass
>>>> <frederic.n...@univ-lorraine.fr> wrote:

>>>>> Hello Alexander,

>>>>> Thank you for clarifying this point. The documentation was not very clear 
>>>>> about
>>>>> the 'improvements'.

>>>>> Does that mean that in the latest releases overspilling no longer occurs 
>>>>> between
>>>>> the two thresholds of 30GB and 300GB? Meaning block.db can be 80GB in size
>>>>> without overspilling, for example?

>>>>> Cheers,

>>>>> Frédéric.

>>>>> ----- Le 12 Nov 24, à 13:32, Alexander Patrakov patra...@gmail.com a 
>>>>> écrit :

>>>>>> Hello Frédéric,

>>>>>> The advice regarding 30/300 GB DB sizes is no longer valid. Since Ceph

>>>>>> 15.2.8, due to the new default (bluestore_volume_selection_policy =

>>>>>> use_some_extra), it no longer wastes the extra capacity of the DB

>>>>>> device.

>>>>>> On Tue, Nov 12, 2024 at 5:52 PM Frédéric Nass

>>>>>> <frederic.n...@univ-lorraine.fr> wrote:

>>>>>>> ----- Le 12 Nov 24, à 8:51, Roland Giesler rol...@giesler.za.net a 
>>>>>>> écrit :

>>>>>>>> On 2024/11/12 04:54, Alwin Antreich wrote:

>>>>>>>>> Hi Roland,

>>>>>>>>> On Mon, Nov 11, 2024, 20:16 Roland Giesler <rol...@giesler.za.net> 
>>>>>>>>> wrote:

>>>>>>>>>> I have ceph 17.2.6 on a proxmox cluster and want to replace some 
>>>>>>>>>> ssd's

>>>>>>>>>> who are end of life. I have some spinners who have their journals on

>>>>>>>>>> SSD. Each spinner has a 50GB SSD LVM partition and I want to move 
>>>>>>>>>> those

>>>>>>>>>> each to new corresponding partitions.

>>>>>>>>>> The new 4TB SSD's I have split into volumes with:

>>>>>>>>>> # lvcreate -n NodeA-nvme-LV-RocksDB1 -L 47.69g NodeA-nvme0

>>>>>>>>>> # lvcreate -n NodeA-nvme-LV-RocksDB2 -L 47.69g NodeA-nvme0

>>>>>>>>>> # lvcreate -n NodeA-nvme-LV-RocksDB3 -L 47.69g NodeA-nvme0

>>>>>>>>>> # lvcreate -n NodeA-nvme-LV-RocksDB4 -L 47.69g NodeA-nvme0

>>>>>>>>>> # lvcreate -n NodeA-nvme-LV-data -l 100%FREE NodeA-nvme1

>>>>>>>>>> # lvcreate -n NodeA-nvme-LV-data -l 100%FREE NodeA-nvme0

>>>>>>>>> I caution the mix of DB/WAL partitions with other applications. The

>>>>>>>>> performance profile may not be suited for shared use. And depending 
>>>>>>>>> on the

>>>>>>>>> use case the ~48GB might not be big enough to hinder DB spillover. 
>>>>>>>>> See the

>>>>>>>>> current size when querying the OSD.

>>>>>>>> I see relatively small RocksDB and not WAL?

>>>>>>>> ceph daemon osd.4 perf dump

>>>>>>>> <snip>

>>>>>>>> "bluefs": {

>>>>>>>> "db_total_bytes": 45025845248,

>>>>>>>> "db_used_bytes": 2131755008,

>>>>>>>> "wal_total_bytes": 0,

>>>>>>>> "wal_used_bytes": 0,

>>>>>>>> </snip>

>>>>>>>> I have been led to understand that 4% is die high end and only on very 
>>>>>>>> busy

>>>>>>>> systems is that reached, if ever?

>>>>>>> Hi Roland,

>>>>>>> This is generally true but it depends on what your cluster is used for.

>>>>>>> If your cluster is used for block (RBD) storage then 1%-2% should be 
>>>>>>> enough. If

>>>>>>> your cluster is used for file (cephfs) and S3 (RGW) storage then you'd 
>>>>>>> rather

>>>>>>> stay on the safe size and respect the 4% recommendation as these 
>>>>>>> workloads make

>>>>>>> heavy use of block.db to store metadata.

>>>>>>> Now percentage is one thing, level size is another. To avoid 
>>>>>>> overspilling when

>>>>>>> block.db size approaches 30GB you'd better choose a block.db size of 
>>>>>>> 300GB+

>>>>>>> whatever the percentage of block size this is, if you don't want to 
>>>>>>> play with

>>>>>>> rocksdb level size and multiplier, which you probably don't.

>>>>>>> Regards,

>>>>>>> Frédéric.

>>>>>>> [1]

>>>>>>> https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing

>>>>>>> [2]

>>>>>>> https://www.ibm.com/docs/en/storage-ceph/7.1?topic=bluestore-sizing-considerations

>>>>>>> [3] https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide

>>>>>>>>>> What am I missing to get these changes to be permanent?

>>>>>>>>> Likely just an issue with the order of execution. But there is an 
>>>>>>>>> easier

>>>>>>>>> way to do the move. See:

>>>>>>>>> https://docs.ceph.com/en/quincy/ceph-volume/lvm/migrate/

>>>>>>>> Ah, excellent! I didn't find that in my searches. Will try that now.

>>>>>>>> regards

>>>>>>>> Roland

>>>>>>>>> Cheers,

>>>>>>>>> Alwin

>>>>>>>>> --

>>>>>>>>>> Alwin Antreich

>>>>>>>>> Head of Training and Proxmox Services

>>>>>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich

>>>>>>>>> CEO: Martin Verges, Andy Muthmann - VAT-ID: DE310638492

>>>>>>>>> Com. register: Amtsgericht Munich HRB 231263

>>>>>>>>> Web: https://croit.io/

>>>>>>>>> _______________________________________________

>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io

>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io

>>>>>>>> _______________________________________________

>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io

>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io

>>>>>>> _______________________________________________

>>>>>>> ceph-users mailing list -- ceph-users@ceph.io

>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io

>>>>>> --

>>>>>> Alexander Patrakov

>>>> --
>>>> Alexander Patrakov
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Move block.db to new ssd

Reply via email to