[ceph-users] Re: Move block.db to new ssd

Anthony D'Atri Tue, 12 Nov 2024 08:38:18 -0800

Yes, it improves the dynamic where only ~3, 30, 300, etc. GB of DB space can be 
used, and thus mitigates spillover.  Previously a, say, 29GB DB 
device/partition would be like 85% unused.


With recent releases one can also turn on DB compression, which should have a 
similar benefit.

> On Nov 12, 2024, at 11:25 AM, Frédéric Nass <frederic.n...@univ-lorraine.fr> 
> wrote:
> 
> Hi Anthony,
> 
> Did the RocksDB sharding end up improving the overspilling situation related 
> to the level thresholds? I had only anticipated that it would reduce the 
> impact of compaction.
> 
> We reshared our OSD's RocksDBs a long time ago (after upgrading to Pacific 
> IIRC) and I think we could still observe overspilling at the layer levels 
> sometimes, if I'm not mistaken.
> 
> Cheers,
> Frédéric.
> 
> PS: It seems that the document you referred to is not accessible from the 
> Internet.
> 
> ----- Le 12 Nov 24, à 15:11, Anthony D'Atri <anthony.da...@gmail.com> a écrit 
> :
> RocksDB column sharding came a while ago.  It should be enabled on your OSDs, 
> provided they weren’t built on a much older release.  If they were you can 
> update them.  
> 

> rocksdb_in_ceph
> PDF Document · 512 KB 
> <https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf>
> 
> 
> IBM Storage Ceph – Administration, Resharding RocksDB database reshard 
> RocksDB database
> ibm.com 
> <https://www.ibm.com/docs/en/storage-ceph/7.1?topic=bluestore-resharding-rocksdb-database>
>  
> <https://www.ibm.com/docs/en/storage-ceph/7.1?topic=bluestore-resharding-rocksdb-database>
> 
> 
> 
> 
> On Nov 12, 2024, at 8:02 AM, Alexander Patrakov <patra...@gmail.com> wrote:
> 
> Yes, that is correct.
> 
> On Tue, Nov 12, 2024 at 8:51 PM Frédéric Nass
> <frederic.n...@univ-lorraine.fr> wrote:
> 
> Hello Alexander,
> 
> Thank you for clarifying this point. The documentation was not very clear 
> about the 'improvements'.
> 
> Does that mean that in the latest releases overspilling no longer occurs 
> between the two thresholds of 30GB and 300GB? Meaning block.db can be 80GB in 
> size without overspilling, for example?
> 
> Cheers,
> Frédéric.
> 
> ----- Le 12 Nov 24, à 13:32, Alexander Patrakov patra...@gmail.com a écrit :
> 
> Hello Frédéric,
> 
> The advice regarding 30/300 GB DB sizes is no longer valid. Since Ceph
> 15.2.8, due to the new default (bluestore_volume_selection_policy =
> use_some_extra), it no longer wastes the extra capacity of the DB
> device.
> 
> On Tue, Nov 12, 2024 at 5:52 PM Frédéric Nass
> <frederic.n...@univ-lorraine.fr> wrote:
> 
> 
> 
> ----- Le 12 Nov 24, à 8:51, Roland Giesler rol...@giesler.za.net a écrit :
> 
> On 2024/11/12 04:54, Alwin Antreich wrote:
> Hi Roland,
> 
> On Mon, Nov 11, 2024, 20:16 Roland Giesler <rol...@giesler.za.net> wrote:
> 
> I have ceph 17.2.6 on a proxmox cluster and want to replace some ssd's
> who are end of life.  I have some spinners who have their journals on
> SSD.  Each spinner has a 50GB SSD LVM partition and I want to move those
> each to new corresponding partitions.
> 
> The new 4TB SSD's I have split into volumes with:
> 
> # lvcreate -n NodeA-nvme-LV-RocksDB1 -L 47.69g NodeA-nvme0
> # lvcreate -n NodeA-nvme-LV-RocksDB2 -L 47.69g NodeA-nvme0
> # lvcreate -n NodeA-nvme-LV-RocksDB3 -L 47.69g NodeA-nvme0
> # lvcreate -n NodeA-nvme-LV-RocksDB4 -L 47.69g NodeA-nvme0
> # lvcreate -n NodeA-nvme-LV-data -l 100%FREE NodeA-nvme1
> # lvcreate -n NodeA-nvme-LV-data -l 100%FREE NodeA-nvme0
> 
> I caution the mix of DB/WAL partitions with other applications. The
> performance profile may not be suited for shared use. And depending on the
> use case the ~48GB might not be big enough to hinder DB spillover. See the
> current size when querying the OSD.
> 
> I see relatively small RocksDB and not WAL?
> 
> ceph daemon osd.4 perf dump
> <snip>
>    "bluefs": {
>        "db_total_bytes": 45025845248,
>        "db_used_bytes": 2131755008,
>        "wal_total_bytes": 0,
>        "wal_used_bytes": 0,
> </snip>
> 
> I have been led to understand that 4% is die high end and only on very busy
> systems is that reached, if ever?
> 
> Hi Roland,
> 
> This is generally true but it depends on what your cluster is used for.
> 
> If your cluster is used for block (RBD) storage then 1%-2% should be enough. 
> If
> your cluster is used for file (cephfs) and S3 (RGW) storage then you'd rather
> stay on the safe size and respect the 4% recommendation as these workloads 
> make
> heavy use of block.db to store metadata.
> 
> Now percentage is one thing, level size is another. To avoid overspilling when
> block.db size approaches 30GB you'd better choose a block.db size of 300GB+
> whatever the percentage of block size this is, if you don't want to play with
> rocksdb level size and multiplier, which you probably don't.
> 
> Regards,
> Frédéric.
> 
> [1]
> https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing
> [2]
> https://www.ibm.com/docs/en/storage-ceph/7.1?topic=bluestore-sizing-considerations
> [3] https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide
> 
> 
> What am I missing to get these changes to be permanent?
> 
> Likely just an issue with the order of execution. But there is an easier
> way to do the move. See:
> https://docs.ceph.com/en/quincy/ceph-volume/lvm/migrate/
> 
> Ah, excellent!  I didn't find that in my searches.  Will try that now.
> 
> regards
> 
> Roland
> 
> 
> 
> Cheers,
> Alwin
> 
> --
> 
> Alwin Antreich
> Head of Training and Proxmox Services
> 
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges, Andy Muthmann - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io/
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> 
> --
> Alexander Patrakov
> 
> 
> 
> -- 
> Alexander Patrakov
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Move block.db to new ssd

Reply via email to