Hi! We tried that (storing software on Lustre) and found it it doesn't really work.
For supplying software installations CVMFS is a MUCH better choice. At least as long as you have node local disk. If you need more details about our CVMFS setup, let me know. ________________________________________ From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of Laifer, Roland (SCC) via lustre-discuss <lustre-discuss@lists.lustre.org> Sent: Thursday, September 12, 2024 12:41 To: lustre-discuss@lists.lustre.org Subject: [lustre-discuss] Huge amounts of reads caused by shared library access Dear Lustre admins, I wanted the share an issue which we see since about two years. Maybe the issue also exists at your site or you can provide hints how the issue can be alleviated. The issue is that we have huge amounts of read operations on servers which seem to be caused by shared libraries stored on Lustre. Apparently the Lustre client cache does not work here as expected for many different applications. Note that we have installed most software packages on Lustre and if you don't do that you might not be affected. Of course we have reported the issue to DDN support a long time ago. They found an issue which might be causing it, see https://jira.whamcloud.com/browse/LU-17463. But the patch is under development since many months and I'm not sure if it will really fix it. Some more details: The affected system has nearly 1000 nodes, is used by more than 1000 active users and there are many small jobs which share the same node. The Lustre version on clients and servers is 2.12.9 with patches from DDN. The issue is currently causing multiple GB/s throughout and more than 100 K IOPS on the affected file system. With Lustre jobstats we saw that some jobs were creating hundreds of millions read opertions. Other similar jobs did not have the issue, i.e. the problem is not easily reproducible. We have a complicated reproducer which works in most cases even on our test system. Several users reported that they were only using software on the affected file system. The command "lctl get_param llite.<fs_name>*.stats" showed huge amounts of page_fault entries and there were indeed many page faults for shared libraries stored on the affected file system. We also had discussions with another site where Lustre is provided from another vendor and they are seeing the same issue. Regards, Roland _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org