[slurm-users] selecting job-specific log messages

2024-02-03 Thread urbanjost via slurm-users
log files use many strings to identify job, including not having a jobID in the 
message
NUMBER=$SLURM_JOBID
egrep "\.\<$NUMBER\>\] |\<$NUMBER\>\.batch|jobid 
\<$NUMBER\>|JObId=\<$NUMBER\>|job id \<$NUMBER\>|job\.\<$NUMBER\>|job 
\<$NUMBER\>|jobid \[\<$NUMBER\>\]|task_p_slurmd_batch_request: \<$NUMBER\>" 
/var/log/slurm*

Even that misses cruciall data that does not even contain the jobid

[2024-02-03T11:50:33.052] _get_user_env: get env for user jsu here
[2024-02-03T11:52:33.152] timeout waiting for /bin/su to complete
[2024-02-03T11:52:34.152] error: Failed to load current user environment 
variables
[2024-02-03T11:52:34.153] error: _get_user_env: Unable to get user's local 
environment, running only with passed environment

It would be very useful if all messages related to a job had a consistent 
string in them for grepping the log files;
even better might be a command like "scontrol show jobid= log_messages

But I could not find what I wanted (an easy way to find all daemon log messages 
related to a specific job). I would find it particularly useful if there were a 
way to automatically append such information to the stdout of the job at job 
termination so users would automatically get information about job failures or 
warnings.

Is there such a feature available I have missed?

Sent with [Proton Mail](https://proton.me/) secure email.
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] timeout of get_user_env does not obey time limit

2024-02-03 Thread urbanjost via slurm-users
If I use the sbatch(1) option --export=NONE or wipe the environment with "env 
-i /usr/bin/sbatch ..." or use --export=NIL then the environment is not 
properly constructed and I see
the message in the /var/log/*slurm* files:

[2024-02-03T11:50:33.052] _get_user_env: get env for user jsu here
[2024-02-03T11:52:33.152] timeout waiting for /bin/su to complete
[2024-02-03T11:52:34.152] error: Failed to load current user environment 
variables
[2024-02-03T11:52:34.153] error: _get_user_env: Unable to get user's local 
environment, running only with passed environment

This occurs at 120 seconds not matter if I add --get-user-env=3600 or adjust 
many slurm.conf time-related parameters. It is easy to reproduce by adding 
"sleep 100" into a .cshrc file and sbatch(1) the file

#!/bin/csh
#SBATCH --export=NONE --propagate=NONE --get-user-env=3600L
printenv HOME
printenv USER
printenv PATH
env

I have adjust MANY time-related limits in the slurm.conf file to no avail. When 
the system is unresponsive or heavily loaded or users have prologues that set 
up complex environments via module commands (which can be notoriously slow) the 
jobs are failing or producing errors.

If I configure Slurm so that jobs that timeout requeue instead of running then 
a user with a slow login setup can submit a large number of jobs and basically 
close down a cluster because this option not only requeues jobs that fail but 
puts the node it occurred on in a DRAIN state.

We see this as very dangerous as by defaults jobs proceed to execute even when 
their environment is not properly constructed.

I can see that "slurmrestd getenv" and the procedure get_user_env(3c) are 
involved, but a preliminary scan of the code looked like the 
--get-user-env= value was being parsed,
and I did not see a reason the setup always times out at 120 seconds (at least 
on my system).

Does anyone know how to get the time allowed to get the default user 
environment to use the value on the --get-user-env option when no environment 
is being exported to a job?

This is showing up sporadically and causing intermittent failures that are very 
confusing and disturbing to the users it occurs with.

Sent with [Proton Mail](https://proton.me/) secure email.
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] SLURM configuration for LDAP users

2024-02-03 Thread Richard Chang via slurm-users

Hi,

I am a little new to this, so please pardon my ignorance.

I have configured slurm in my cluster and it works fine with local 
users. But I am not able to get it working with LDAP/SSSD authentication.


User logins using ssh are working fine. An LDAP user can login to the 
login, slurmctld and compute nodes, but when they try to submit jobs, 
slurmctld logs an error about invalid account or partition for user.


Someone said we need to add the user manually into the database using 
the sacctmgr command. But I am not sure we need to do this for each and 
every LDAP user. Yes, it does work if we add the LDAP user manually 
using sacctmgr. But I am not convinced this manual way is the way to do.


The documentation is not very clear about using LDAP accounts.

Saw somewhere in the list about using UsePAM=1 and copying or creating a 
softlink for slurm PAM module under /etc/pam.d . But it didn't work for me.


Saw somewhere else that we need to specifying 
LaunchParameters=enable_nss_slurm in the slurm.conf file and put slurm 
keyword in passwd/group entry in the /etc/nsswitch.conf file. Did these, 
but didn't help either.


I am bereft of ideas at present. If anyone has real world experience and 
can advise, I will be grateful.


Thank you,

Richard

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com