Hi,

A few months ago after seeing a some OSDs being OOM-killed, we were pointed to 2 references [1][2]  dating from the last Nautilus version mentionning the best config for an OSD in modern Ceph versions was with bluefs_buffered_io=true (default since this last Nautilus version) and system swap disabled. We did it on all our OSD servers and we have seen a huge diminution of the allocated memory per OSD (factor ~2)

Best regards,

Michel

[1]https://docs.ceph.com/en/latest/releases/nautilus/#notable-changes
[2]https://github.com/ceph/ceph/pull/38044

Le 23/05/2025 à 14:52, Anthony D'Atri a écrit :

Swap can possibly reduce your clusters performance, not? Osd-processes that 
swap data will result in supplementary and unwanted disk I/o
Absolutely. Moreover I’ve seen modern(ish) Linux systems anomalously using swap 
space when there is available physmem, in excess of vm.min_free_kybytes.

I've got 10 OSD's per host and my memory consumption of ceph is typically 70GiB 
per host... Each host has about 40GiB available memory which is sufficient (for 
my setup) except one time I ran out of memory deleting old snapshots. But 8GiB 
wouldn't have helped...
Exactly.  Filesystem swap can be a useful emergency tool, but in 2025 it should 
not be routine.  In 1985 a diskless Sun 2/50 with 3MB of physmem (yes, MB) 
needed swap, and that was SUPER fun over 10GE ethernet against a Fuji Eagle.

In years beginning with a 2, DRAM prices are such that if disabling swap causes 
a problem, then that’s a sign that you really, really need more physmem.

Swap is 12% of your virtual memory right now.  If you run hotter than 84% usage 
then really you need more.

By default the osd_memory_target autotune should be enabled, see what values it 
is setting.  By default it will divide 70% of physmem by however many OSDs are 
placed on a host:

# ceph config dump | grep osd_memory_target
osd                                       host:x  basic     osd_memory_target   
                       12060218196
osd                                       host:y  basic     osd_memory_target   
                       12062644163
osd                                       host:z  basic     osd_memory_target   
                       6614520783

Yes, the above cluster has lots of physmem, which is very very fortunate 
because it’s based on large HDDs and otherwise would have fallen over (if I 
told you the details, you wouldn’t sleep at night).  The money would have been 
better spent on QLC, but I digress.

If autotune isn’t on, the default osd_memory_target is 4GB.  Remember that it’s 
a target not a limit.  The docs advise a 20% headroom of available physmem, 
having suffered a few things I like to advise 50% at least, plus margin to run 
mons/mgrs/mds/etc.

OSD containers consumes reasonable amount of RAM (~2.6GB - ~3.6 GB):
Actually that’s another sign that you may be starved unless these OSDs are 
rather idle.  Are you using cephadm, or something else to manage the 
containers?  Is it enforcing an artificial limit on them?

With 64GB of physmem cephadm’s autotuning would assign an osd_memory_target of 
4.5GB.

Memory allocation practice and accounting vary across kernel revisions, which 
may be a factor here.

What model of chassis are these?  Adding even 4x8GB super cheap DIMMs to each 
would do you a world of good, with more of course even better.  Be sure to not 
mix SKUs within a bank, and populate slots according to your motherboard’s 
documentation.




-----Oorspronkelijk bericht-----
Van: Dmitrijs Demidovs <dmitrijs.demid...@carminered.eu>
Verzonden: vrijdag 23 mei 2025 10:16
Aan: ceph-users@ceph.io
Onderwerp: [ceph-users] Re: SWAP usage 100% on OSD hosts after
migration to Rocky Linux 9 (Ceph 16.2.15)

Hi Anthony.

Yes we have swap enabled. Old Rocky 8 and new Rocky 9 OSD hosts both
configured with 8G of swap.

I will try to disable swap, but I guess what we will get a lot of Out Of Memory
messages on OSD hosts.



= old:
[root@ceph-osd11 ~]# free -h
                total        used        free      shared buff/cache
available
Mem:           62Gi        30Gi       1.2Gi       2.1Gi 30Gi        29Gi
Swap:         8.0Gi       2.8Gi       5.2Gi

= new:
[root@ceph-osd17 ~]# free -h
                 total        used        free      shared buff/cache
available
Mem:            62Gi        26Gi       1.0Gi       1.0Gi 36Gi        36Gi
Swap:          8.0Gi       8.0Gi       7.0Mi






OSD containers consumes reasonable amount of RAM (~2.6GB - ~3.6 GB):


[root@ceph-osd17 ~]# docker stats --no-stream
CONTAINER ID   NAME
            CPU %     MEM USAGE / LIMIT     MEM %     NET I/O   BLOCK
I/O         PIDS
5cc58e4a77b2   ceph-7e8bff5c-2761-11ec-9bb0-000c29ebc936-osd-52
            0.28%     3.576GiB / 62.28GiB   5.74%     0B / 0B   3.9TB /
975GB     62
3a60fecf648d   ceph-7e8bff5c-2761-11ec-9bb0-000c29ebc936-osd-50
            0.28%     2.912GiB / 62.28GiB   4.68%     0B / 0B   100TB /
45.7TB    62
9c20407e79eb   ceph-7e8bff5c-2761-11ec-9bb0-000c29ebc936-osd-49
            0.28%     2.905GiB / 62.28GiB   4.66%     0B / 0B   93TB /
35.8TB     62
9deadafef9dd   ceph-7e8bff5c-2761-11ec-9bb0-000c29ebc936-osd-48
            0.56%     3.624GiB / 62.28GiB   5.82%     0B / 0B   102TB /
39.2TB    62
fcfe62a25fd9   ceph-7e8bff5c-2761-11ec-9bb0-000c29ebc936-osd-55
            0.40%     2.968GiB / 62.28GiB   4.77%     0B / 0B   83.2TB /
34.8TB   62
38d2d96cc491   ceph-7e8bff5c-2761-11ec-9bb0-000c29ebc936-osd-51
            1.42%     2.666GiB / 62.28GiB   4.28%     0B / 0B   105TB /
38.1TB    62
e29c6bbc1ae7   ceph-7e8bff5c-2761-11ec-9bb0-000c29ebc936-osd-54
            2.01%     3.687GiB / 62.28GiB   5.92%     0B / 0B   106TB /
44.6TB    62
40346a7a45ea   ceph-7e8bff5c-2761-11ec-9bb0-000c29ebc936-osd-53
            0.69%     2.748GiB / 62.28GiB   4.41%     0B / 0B   103TB /
41.4TB    62
43c3e3a65531
ceph-7e8bff5c-2761-11ec-9bb0-000c29ebc936-crash-ceph-osd17
0.00%     3.73MiB / 62.28GiB    0.01%     0B / 0B   567MB / 18MB      2
d9e436f9788c
ceph-7e8bff5c-2761-11ec-9bb0-000c29ebc936-node-exporter-ceph-osd17
15.04%    30.25MiB / 62.28GiB   0.05%     0B / 0B   410MB / 14.6MB    61





But they also are biggest swap consumers:

[root@ceph-osd17 ~]# for file in /proc/*/status; do awk
'/VmSwap|Name/{printf $2 " " $3}END{ print ""}' $file; done | sort -k 2
-n -r | more
ceph-osd 1553520 kB
ceph-osd 1447728 kB
ceph-osd 1218768 kB
ceph-osd 1117536 kB
ceph-osd 1026548 kB
ceph-osd 641632 kB
ceph-osd 495080 kB
ceph-osd 424392 kB
firewalld 26880 kB
dockerd 20352 kB
containerd 11136 kB
docker 6144 kB
docker 6144 kB
docker 5952 kB
docker 5952 kB
docker 5952 kB
docker 5952 kB
docker 5952 kB
docker 5760 kB
(sd-pam) 5184 kB
ceph-crash 4416 kB
python3 4224 kB
docker 4032 kB
systemd-udevd 3264 kB






On 22.05.2025 18:34, Anthony D'Atri wrote:
Problem:

After migration to Rocky 9 (and new version of Docker) we see what our
OSD hosts consumes 100% of SWAP space! It takes approximately one week
to fill SWAP from 0% to 100%.
Why do you have swap configured at all?  I suggest disabling swap in fstab
and rebooting serially.

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to