In 2017, Oleg gave a talk at ORNL's Lustre conference about LDLM,
including references to ldlm.lock_limit _mb and
ldlm.lock_reclaim_threshold_mb.
https://lustre.ornl.gov/ecosystem-2017/documents/Day-2_Tutorial-4_Drokin.pdf
The apparent defaults back then in Lustre 2.8 for those two parameters
were 30MB and 20MB, respectively. On my 2.15 servers with 256GB and no
changes from us, I'm seeing numbers of 77244MB and 51496MB,
respectively. We recently got ourselves into a situation where a subset
of MDTs appeared to be entirely overwhelmed trying to cancel locks, with
~500K locks in the request queue but a request wait time of 6000
seconds. So, we're looking at potentially limiting the locks on the
servers.
What's the formula for appropriately sizing ldlm.lock_limit _mb and
ldlm.lock_reclaim_threshold_mb in 2.15 (I don't think node memory
amounts have increased 20000X in 7 years)?
Thanks!
Cameron Harr
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org