Hi Thomas,

It sounds like you are running into this issue:
https://jira.whamcloud.com/browse/LU-14121

I think I ran into the same issue as you or at least something similar on our slurm cluster using Lustre 2.15.x (servers and clients). As I haven't had the spare cycles or equipment to dig into what was going on, I have been using admin=1 and the legacy root squash mechanism for our cluster nodes as mentioned in the jira ticket.


Thanks,
David


On 1/9/2025 12:58 PM, Thomas Roth wrote:
Ja ja,
I have an Admin nodemap comprising all Lustre servers and a handful of administrative clients, and this nodemap has both admin and trusted set to 1.

No, by now I rather think that because the Slurm demon, slurmstepd, is running as root, it comes in as user 99 on the batch nodes, and when the job wants to write output to, say, /lustre/A/B/C/, and A,B,C are not world-readable (actually octal '5'), slurmstepd can't step into the output directory and the job will fail.


Regards,
Thomas

On 1/9/25 1:10 PM, Sebastien Buisson wrote:
Hi,

As explained in the Lustre Operations Manual in this section:
https://urldefense.com/v3/__https://doc.lustre.org/lustre_manual.xhtml*idm139831573757696__;Iw!!PvDODwlR4mBZyAb0!SF1EJmHJokm42L888JwiZfsoKpgqKkTF25wvx8PcIkUgF3OktC0ll3zzI-gYrNeFHg_bhBFf2L6C2aLMG0NZ8acRKQ$ it is required to define a nodemap that matches all server nodes, with admin and trusted to 1.
Have you?

Cheers,
Sebastien.

Le 9 janv. 2025 à 13:03, Thomas Roth <[email protected]> a écrit :

[Vous ne recevez pas souvent de courriers de [email protected]. D?couvrez pourquoi ceci est important ? https://urldefense.com/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!PvDODwlR4mBZyAb0!SF1EJmHJokm42L888JwiZfsoKpgqKkTF25wvx8PcIkUgF3OktC0ll3zzI-gYrNeFHg_bhBFf2L6C2aLMG0O_RiJ-1g$ ]

Hi all,

we have just switched on nodemap on our 2.12 cluster, with all batch clients being trusted=1 but admin=0, so bascially root-squashing.

The batch system is done by Slurm.

Now all jobs are failing, when the user's directory on Lustre is not world-readable ("permission denied").

RW - Access in the shell is not a problem.



Any site running Slurm and having encountered a similar issue?


Regards,
Thomas


Perhaps I should add that I have used the default nodemap for this, to avoid having to specify many hundreds of non-contiguous batch node IP ranges.
_______________________________________________
lustre-discuss mailing list
[email protected]
https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!PvDODwlR4mBZyAb0!SF1EJmHJokm42L888JwiZfsoKpgqKkTF25wvx8PcIkUgF3OktC0ll3zzI-gYrNeFHg_bhBFf2L6C2aLMG0MB_ZRRXw$

_______________________________________________
lustre-discuss mailing list
[email protected]
https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!PvDODwlR4mBZyAb0!SF1EJmHJokm42L888JwiZfsoKpgqKkTF25wvx8PcIkUgF3OktC0ll3zzI-gYrNeFHg_bhBFf2L6C2aLMG0MB_ZRRXw$

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to