[ceph-users] Re: OSDs ignore memory limit

Anthony D'Atri Wed, 09 Apr 2025 09:44:39 -0700


>  But checking the top output etc. doesn't confirm those values.


I suspect a startup peak that subsides for steady-state operation.  I observed 
this with mons back in Luminous.  A cluster had been expanded considerably 
without restarting mons, so when they tried to restart there wasn’t enough 
memory.  The prom process collector might capture this, or frequent sampling of 
the OSDs’ admin socket for TCMALLOC heap stats.  Logs should also have a 
profile of processes with info at the time of OOMkilling.

> I don't really know where they come from, tbh.
> Can you confirm that those are actually OSD processes filling up the RAM?
> 
> Zitat von Jonas Schwab <jonas.sch...@uni-wuerzburg.de>:

Please set your MUA to not wrap.

> 
>> Hello everyone,
>> 
>> I recently have many problems with OSDs using much more memory than they
>> are supposed to (> 10GB), leading to the node running out of memory and
>> killing processes. Does someone have ideas why the daemons seem
>> to completely ignore the set memory limits?

Remember that osd_memory_target is a TARGET not a LIMIT.  Upstream docs suggest 
an aggregate 20% headroom, personally I like 100% headroom, but that’s informed 
by some prior experiences that likely are much less of a concern these days.



>> 
>> See e.g. the following:
>> 
>> $ ceph orch ps ceph2-03
>> NAME                    HOST      PORTS   STATUS REFRESHED  AGE  MEM USE  
>> MEM LIM  VERSION  IMAGE ID CONTAINER ID

I had not noticed the MEM LIM column.  Digging through the source I don’t 
immediately see where that comes from, but I suspect that in a non-Rook 
environment, at least, it reflects osd_memory_target.

Your nodes each have 11 OSDs, and at least some have a mon or other daemons?

How much RAM do the nodes have?

`ceph osd dump | grep pool`
`ceph status`
`ceph config dump | grep osd_memory_target`
`ceph config dump | grep osd_memory_target_autotune`

Do the nodes run anything that isn’t Ceph?  Do you have cron jobs or playbook 
runs or something that might cause an ephemeral yet hungry process to run at 
times?

If MEM LIM is indeed the osd_memory_target, your nodes would seem to be light 
on RAM.  The default osd_memory_target is 4GB.

I’m going to SWAG that this node has 52GB of physmem?  If so that IMHO is way 
too low.  I would suggest at least 128GB for 11 OSDs.

Are any of these OSDs legacy Filestore?



>> mon.ceph2-03            ceph2-03          running (3h)       1s ago  2y     
>> 501M    2048M  19.2.1   f2efb0401a30  d876fc30f741
>> node-exporter.ceph2-03  ceph2-03  *:9100  running (3h)       1s ago 17M    
>> 46.5M        -  1.7.0    72c9c2088986  d32ec4d266ea
>> osd.4                   ceph2-03          running (26m)      1s ago  2y    
>> 10.2G    3310M  19.2.1   f2efb0401a30  b712a86dacb2
>> osd.11                  ceph2-03          running (5m)       1s ago  2y    
>> 3458M    3310M  19.2.1   f2efb0401a30  f3d7705325b4
>> osd.13                  ceph2-03          running (3h)       1s ago  6d    
>> 2059M    3310M  19.2.1   f2efb0401a30  980ee7e11252
>> osd.17                  ceph2-03          running (114s)     1s ago  2y    
>> 3431M    3310M  19.2.1   f2efb0401a30  be7319fda00b
>> osd.23                  ceph2-03          running (30m)      1s ago  2y    
>> 10.4G    3310M  19.2.1   f2efb0401a30  9cfb86c4b34a
>> osd.29                  ceph2-03          running (8m)       1s ago  2y    
>> 4923M    3310M  19.2.1   f2efb0401a30  d764930bb557
>> osd.35                  ceph2-03          running (14m)      1s ago  2y    
>> 7029M    3310M  19.2.1   f2efb0401a30  6a4113adca65
>> osd.59                  ceph2-03          running (2m)       1s ago  2y    
>> 2821M    3310M  19.2.1   f2efb0401a30  8871d6d4f50a
>> osd.61                  ceph2-03          running (49s)      1s ago  2y    
>> 1090M    3310M  19.2.1   f2efb0401a30  3f7a0ed17ac2
>> osd.67                  ceph2-03          running (7m)       1s ago  2y    
>> 4541M    3310M  19.2.1   f2efb0401a30  eea0a6bcefec
>> osd.75                  ceph2-03          running (3h)       1s ago  2y    
>> 1239M    3310M  19.2.1   f2efb0401a30  5a801902340d
>> 
>> Best regards,
>> Jonas
>> 
>> --
>> Jonas Schwab
>> 
>> Research Data Management, Cluster of Excellence ct.qmat
>> https://data.ctqmat.de | datamanagement.ct.q...@listserv.dfn.de
>> Email: jonas.sch...@uni-wuerzburg.de
>> Tel: +49 931 31-84460
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: OSDs ignore memory limit

Reply via email to