Hi Thomas,
It sounds like you are running into this issue:
https://jira.whamcloud.com/browse/LU-14121
I think I ran into the same issue as you or at least something similar
on our slurm cluster using Lustre 2.15.x (servers and clients).
As I haven't had the spare cycles or equipment to dig into what was
going on, I have been using admin=1 and the legacy root squash mechanism
for our cluster nodes as mentioned in the jira ticket.
Thanks,
David
On 1/9/2025 12:58 PM, Thomas Roth wrote:
Ja ja,
I have an Admin nodemap comprising all Lustre servers and a handful of
administrative clients, and this nodemap has both admin and trusted
set to 1.
No, by now I rather think that because the Slurm demon, slurmstepd, is
running as root, it comes in as user 99 on the batch nodes, and when
the job wants to write output to, say, /lustre/A/B/C/, and A,B,C are
not world-readable (actually octal '5'), slurmstepd can't step into
the output directory and the job will fail.
Regards,
Thomas
On 1/9/25 1:10 PM, Sebastien Buisson wrote:
Hi,
As explained in the Lustre Operations Manual in this section:
https://urldefense.com/v3/__https://doc.lustre.org/lustre_manual.xhtml*idm139831573757696__;Iw!!PvDODwlR4mBZyAb0!SF1EJmHJokm42L888JwiZfsoKpgqKkTF25wvx8PcIkUgF3OktC0ll3zzI-gYrNeFHg_bhBFf2L6C2aLMG0NZ8acRKQ$
it is required to define a nodemap that matches all server nodes,
with admin and trusted to 1.
Have you?
Cheers,
Sebastien.
Le 9 janv. 2025 à 13:03, Thomas Roth <[email protected]> a
écrit :
[Vous ne recevez pas souvent de courriers de
[email protected]. D?couvrez pourquoi ceci est important ?
https://urldefense.com/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!PvDODwlR4mBZyAb0!SF1EJmHJokm42L888JwiZfsoKpgqKkTF25wvx8PcIkUgF3OktC0ll3zzI-gYrNeFHg_bhBFf2L6C2aLMG0O_RiJ-1g$
]
Hi all,
we have just switched on nodemap on our 2.12 cluster, with all batch
clients being trusted=1 but admin=0, so bascially root-squashing.
The batch system is done by Slurm.
Now all jobs are failing, when the user's directory on Lustre is not
world-readable ("permission denied").
RW - Access in the shell is not a problem.
Any site running Slurm and having encountered a similar issue?
Regards,
Thomas
Perhaps I should add that I have used the default nodemap for this,
to avoid having to specify many hundreds of non-contiguous batch node
IP ranges.
_______________________________________________
lustre-discuss mailing list
[email protected]
https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!PvDODwlR4mBZyAb0!SF1EJmHJokm42L888JwiZfsoKpgqKkTF25wvx8PcIkUgF3OktC0ll3zzI-gYrNeFHg_bhBFf2L6C2aLMG0MB_ZRRXw$
_______________________________________________
lustre-discuss mailing list
[email protected]
https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!PvDODwlR4mBZyAb0!SF1EJmHJokm42L888JwiZfsoKpgqKkTF25wvx8PcIkUgF3OktC0ll3zzI-gYrNeFHg_bhBFf2L6C2aLMG0MB_ZRRXw$
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org