In 2017, Oleg gave a talk at ORNL's Lustre conference about LDLM, including references to ldlm.lock_limit _mb and ldlm.lock_reclaim_threshold_mb. https://lustre.ornl.gov/ecosystem-2017/documents/Day-2_Tutorial-4_Drokin.pdf

The apparent defaults back then in Lustre 2.8 for those two parameters were 30MB and 20MB, respectively.  On my 2.15 servers with 256GB and no changes from us, I'm seeing numbers of 77244MB and 51496MB, respectively. We recently got ourselves into a situation where a subset of MDTs appeared to be entirely overwhelmed trying to cancel locks, with ~500K locks in the request queue but a request wait time of 6000 seconds. So, we're looking  at potentially limiting the locks on the servers.

What's the formula for appropriately sizing ldlm.lock_limit _mb and ldlm.lock_reclaim_threshold_mb in 2.15 (I don't think node memory amounts have increased 20000X in 7 years)?

Thanks!

Cameron Harr

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to