Hi!

We tried that (storing software on Lustre) and found it it doesn't really work.
For supplying software installations CVMFS is a MUCH better choice.
At least as long as you have node local disk.

If you need more details about our CVMFS setup, let me know.

________________________________________
From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of 
Laifer, Roland (SCC) via lustre-discuss <lustre-discuss@lists.lustre.org>
Sent: Thursday, September 12, 2024 12:41
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] Huge amounts of reads caused by shared library access

Dear Lustre admins,

I wanted the share an issue which we see since about two years. Maybe
the issue also exists at your site or you can provide hints how the
issue can be alleviated.

The issue is that we have huge amounts of read operations on servers
which seem to be caused by shared libraries stored on Lustre. Apparently
the Lustre client cache does not work here as expected for many
different applications. Note that we have installed most software
packages on Lustre and if you don't do that you might not be affected.

Of course we have reported the issue to DDN support a long time ago.
They found an issue which might be causing it, see
https://jira.whamcloud.com/browse/LU-17463. But the patch is under
development since many months and I'm not sure if it will really fix it.

Some more details:

The affected system has nearly 1000 nodes, is used by more than 1000
active users and there are many small jobs which share the same node.
The Lustre version on clients and servers is 2.12.9 with patches from
DDN. The issue is currently causing multiple GB/s throughout and more
than 100 K IOPS on the affected file system.

With Lustre jobstats we saw that some jobs were creating hundreds of
millions read opertions. Other similar jobs did not have the issue, i.e.
the problem is not easily reproducible. We have a complicated reproducer
which works in most cases even on our test system.

Several users reported that they were only using software on the
affected file system. The command "lctl get_param
llite.<fs_name>*.stats" showed huge amounts of page_fault entries and
there were indeed many page faults for shared libraries stored on the
affected file system.

We also had discussions with another site where Lustre is provided from
another vendor and they are seeing the same issue.

Regards,
   Roland
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to