Hi Henning, I think the increasing strays_created is normal. This is a counter that is monotonically increasing when any file is deleted. And is only reset when the MDS is restarted.
The num_strays is the actual number of strays in your system, and they are not necessarily reside in memory. Weiwen Hu > 在 2023年5月24日,20:22,Henning Achterrath <ach...@uni-bonn.de> 写道: > > Hello again, > > In two days, the number has increased by about one and a half million and the > ram usage of mds remains high by about 50G. We are very unsure if this is a > normal behavior. > > Today: > "num_strays": 53695, > "num_strays_delayed": 4, > "num_strays_enqueuing": 0, > "strays_created": 3618390, > "strays_enqueued": 3943542, > "strays_reintegrated": 144545, > "strays_migrated": 38, > > On 22.05.23 > > ceph daemon mds.0 perf dump | grep stray > "num_strays": 49846, > "num_strays_delayed": 21, > "num_strays_enqueuing": 0, > "strays_created": 2042124, > "strays_enqueued": 2396076, > "strays_reintegrated": 44207, > "strays_migrated": 38, > > Maybe someone can explain to us what these counters mean in detail. The perf > schema is not very revealing. > > Our idea is to add a standbye-replay (hot-standbye mds) temporary, to ensure > the journal is replayable before we resume the upgrade. > > I would be grateful for any advise. > > best regards > Henning > >> On 23.05.23 17:24, Henning Achterrath wrote: >> In addition, i would like to mention that the number of "strays_created" >> also increases after this action, but the number of num_strays is lower now. >> If desired, we can provide debug logs from mds at the time the mds was in >> stopping state and we did a systemctl restart mds1. >> The only active mds server has a ram usage of about 50G. The memory limit is >> 32G, but we get no warnings about that. Maybe the separate purge_queue is >> consuming a lot of RAM and it does not count for the limit? Usually we get >> notified when the mds is behind the memory limit. >> thank you >>> On 22.05.23 15:23, t.kulschew...@uni-bonn.de wrote: >>> Hi Venky, >>> >>> thank you for your help. We managed to shut down mds.1: >>> We set "ceph fs set max_mds 1" and waited for about 30 minutes. In the >>> first couple minutes, strays were migrated from mds.1 to mds.0. After this, >>> the stray export hung. The mds.1 remained in the state_stopping. After >>> about 30 minutes, we restarted mds.1. This resulted in one active mds and >>> two standby mds. However, we are not sure, if the remaining strays could be >>> migrated. >>> >>> When we had a closer look at the perf counter of the mds, we realized that >>> the number of strays_enqueued is quite high and constantly increasing. Is >>> this to be expected? What does the counter "strays_enqueued" mean in detail? >>> >>> ceph daemon mds.0 perf dump | grep stray >>> "num_strays": 49846, >>> "num_strays_delayed": 21, >>> "num_strays_enqueuing": 0, >>> "strays_created": 2042124, >>> "strays_enqueued": 2396076, >>> "strays_reintegrated": 44207, >>> "strays_migrated": 38, >>> >>> Would it be safe to perform "ceph orch upgrade resume" at this point? At >>> the moment, the MONs and OSDs are running 17.2.6, while the MDSs and RGWs >>> are running 17.2.5. So we have to upgrade the MDS and RGW eventually. >>> >>> Best, Tobias >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@ceph.io >>> To unsubscribe send an email to ceph-users-le...@ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io