Hi,
A few months ago after seeing a some OSDs being OOM-killed, we were
pointed to 2 references [1][2] dating from the last Nautilus version
mentionning the best config for an OSD in modern Ceph versions was with
bluefs_buffered_io=true (default since this last Nautilus version) and
system swap disabled. We did it on all our OSD servers and we have seen
a huge diminution of the allocated memory per OSD (factor ~2)
Best regards,
Michel
[1]https://docs.ceph.com/en/latest/releases/nautilus/#notable-changes
[2]https://github.com/ceph/ceph/pull/38044
Le 23/05/2025 à 14:52, Anthony D'Atri a écrit :
Swap can possibly reduce your clusters performance, not? Osd-processes that
swap data will result in supplementary and unwanted disk I/o
Absolutely. Moreover I’ve seen modern(ish) Linux systems anomalously using swap
space when there is available physmem, in excess of vm.min_free_kybytes.
I've got 10 OSD's per host and my memory consumption of ceph is typically 70GiB
per host... Each host has about 40GiB available memory which is sufficient (for
my setup) except one time I ran out of memory deleting old snapshots. But 8GiB
wouldn't have helped...
Exactly. Filesystem swap can be a useful emergency tool, but in 2025 it should
not be routine. In 1985 a diskless Sun 2/50 with 3MB of physmem (yes, MB)
needed swap, and that was SUPER fun over 10GE ethernet against a Fuji Eagle.
In years beginning with a 2, DRAM prices are such that if disabling swap causes
a problem, then that’s a sign that you really, really need more physmem.
Swap is 12% of your virtual memory right now. If you run hotter than 84% usage
then really you need more.
By default the osd_memory_target autotune should be enabled, see what values it
is setting. By default it will divide 70% of physmem by however many OSDs are
placed on a host:
# ceph config dump | grep osd_memory_target
osd host:x basic osd_memory_target
12060218196
osd host:y basic osd_memory_target
12062644163
osd host:z basic osd_memory_target
6614520783
Yes, the above cluster has lots of physmem, which is very very fortunate
because it’s based on large HDDs and otherwise would have fallen over (if I
told you the details, you wouldn’t sleep at night). The money would have been
better spent on QLC, but I digress.
If autotune isn’t on, the default osd_memory_target is 4GB. Remember that it’s
a target not a limit. The docs advise a 20% headroom of available physmem,
having suffered a few things I like to advise 50% at least, plus margin to run
mons/mgrs/mds/etc.
OSD containers consumes reasonable amount of RAM (~2.6GB - ~3.6 GB):
Actually that’s another sign that you may be starved unless these OSDs are
rather idle. Are you using cephadm, or something else to manage the
containers? Is it enforcing an artificial limit on them?
With 64GB of physmem cephadm’s autotuning would assign an osd_memory_target of
4.5GB.
Memory allocation practice and accounting vary across kernel revisions, which
may be a factor here.
What model of chassis are these? Adding even 4x8GB super cheap DIMMs to each
would do you a world of good, with more of course even better. Be sure to not
mix SKUs within a bank, and populate slots according to your motherboard’s
documentation.
-----Oorspronkelijk bericht-----
Van: Dmitrijs Demidovs <dmitrijs.demid...@carminered.eu>
Verzonden: vrijdag 23 mei 2025 10:16
Aan: ceph-users@ceph.io
Onderwerp: [ceph-users] Re: SWAP usage 100% on OSD hosts after
migration to Rocky Linux 9 (Ceph 16.2.15)
Hi Anthony.
Yes we have swap enabled. Old Rocky 8 and new Rocky 9 OSD hosts both
configured with 8G of swap.
I will try to disable swap, but I guess what we will get a lot of Out Of Memory
messages on OSD hosts.
= old:
[root@ceph-osd11 ~]# free -h
total used free shared buff/cache
available
Mem: 62Gi 30Gi 1.2Gi 2.1Gi 30Gi 29Gi
Swap: 8.0Gi 2.8Gi 5.2Gi
= new:
[root@ceph-osd17 ~]# free -h
total used free shared buff/cache
available
Mem: 62Gi 26Gi 1.0Gi 1.0Gi 36Gi 36Gi
Swap: 8.0Gi 8.0Gi 7.0Mi
OSD containers consumes reasonable amount of RAM (~2.6GB - ~3.6 GB):
[root@ceph-osd17 ~]# docker stats --no-stream
CONTAINER ID NAME
CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK
I/O PIDS
5cc58e4a77b2 ceph-7e8bff5c-2761-11ec-9bb0-000c29ebc936-osd-52
0.28% 3.576GiB / 62.28GiB 5.74% 0B / 0B 3.9TB /
975GB 62
3a60fecf648d ceph-7e8bff5c-2761-11ec-9bb0-000c29ebc936-osd-50
0.28% 2.912GiB / 62.28GiB 4.68% 0B / 0B 100TB /
45.7TB 62
9c20407e79eb ceph-7e8bff5c-2761-11ec-9bb0-000c29ebc936-osd-49
0.28% 2.905GiB / 62.28GiB 4.66% 0B / 0B 93TB /
35.8TB 62
9deadafef9dd ceph-7e8bff5c-2761-11ec-9bb0-000c29ebc936-osd-48
0.56% 3.624GiB / 62.28GiB 5.82% 0B / 0B 102TB /
39.2TB 62
fcfe62a25fd9 ceph-7e8bff5c-2761-11ec-9bb0-000c29ebc936-osd-55
0.40% 2.968GiB / 62.28GiB 4.77% 0B / 0B 83.2TB /
34.8TB 62
38d2d96cc491 ceph-7e8bff5c-2761-11ec-9bb0-000c29ebc936-osd-51
1.42% 2.666GiB / 62.28GiB 4.28% 0B / 0B 105TB /
38.1TB 62
e29c6bbc1ae7 ceph-7e8bff5c-2761-11ec-9bb0-000c29ebc936-osd-54
2.01% 3.687GiB / 62.28GiB 5.92% 0B / 0B 106TB /
44.6TB 62
40346a7a45ea ceph-7e8bff5c-2761-11ec-9bb0-000c29ebc936-osd-53
0.69% 2.748GiB / 62.28GiB 4.41% 0B / 0B 103TB /
41.4TB 62
43c3e3a65531
ceph-7e8bff5c-2761-11ec-9bb0-000c29ebc936-crash-ceph-osd17
0.00% 3.73MiB / 62.28GiB 0.01% 0B / 0B 567MB / 18MB 2
d9e436f9788c
ceph-7e8bff5c-2761-11ec-9bb0-000c29ebc936-node-exporter-ceph-osd17
15.04% 30.25MiB / 62.28GiB 0.05% 0B / 0B 410MB / 14.6MB 61
But they also are biggest swap consumers:
[root@ceph-osd17 ~]# for file in /proc/*/status; do awk
'/VmSwap|Name/{printf $2 " " $3}END{ print ""}' $file; done | sort -k 2
-n -r | more
ceph-osd 1553520 kB
ceph-osd 1447728 kB
ceph-osd 1218768 kB
ceph-osd 1117536 kB
ceph-osd 1026548 kB
ceph-osd 641632 kB
ceph-osd 495080 kB
ceph-osd 424392 kB
firewalld 26880 kB
dockerd 20352 kB
containerd 11136 kB
docker 6144 kB
docker 6144 kB
docker 5952 kB
docker 5952 kB
docker 5952 kB
docker 5952 kB
docker 5952 kB
docker 5760 kB
(sd-pam) 5184 kB
ceph-crash 4416 kB
python3 4224 kB
docker 4032 kB
systemd-udevd 3264 kB
On 22.05.2025 18:34, Anthony D'Atri wrote:
Problem:
After migration to Rocky 9 (and new version of Docker) we see what our
OSD hosts consumes 100% of SWAP space! It takes approximately one week
to fill SWAP from 0% to 100%.
Why do you have swap configured at all? I suggest disabling swap in fstab
and rebooting serially.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io