[slurm-users] scancel gpu jobs when gpu is not requested

2021-08-24 Thread Ratnasamy, Fritz
Hello, I have written a script in my prolog.sh that cancels any slurm job if the parameter gres=gpu is not present. This is the script i added to my prolog.sh if [ $SLURM_JOB_PARTITION == "gpu" ]; then if [ ! -z "${GPU_DEVICE_ORDINAL}" ]; then echo "GPU ID used is ID: $GPU

Re: [slurm-users] EXTERNAL-Re: [External] scancel gpu jobs when gpu is not requested

2021-08-25 Thread Ratnasamy, Fritz
ially having to wait in the queue if the > cluster is busy and gets around having to cancel a running job. There is a > description and simple example at the bottom of this page: > https://slurm.schedmd.com/resource_limits.html > > > > Mike > > > > *From: *slurm-users

Re: [slurm-users] EXTERNAL-Re: [External] scancel gpu jobs when gpu is not requested

2021-08-30 Thread Ratnasamy, Fritz
your build of Slurm. I agree > that finding complete documentation for this feature is a little difficult. > > > > Mike > > > > *From: *slurm-users on behalf of > Ratnasamy, Fritz > *Date: *Wednesday, August 25, 2021 at 23:13 > *To: *Slurm User Community List &g

[slurm-users] Block jobs on GPU partition when GPU is not specified

2021-09-24 Thread Ratnasamy, Fritz
Hi, I would like to block jobs submitted in our GPU partition when gres=gpu:1 (or any number between 1 and 4) is not specified when submitting a job through sbatch or requesting an interactive session with srun. Currently, /etc/slurm/slurm.conf has JobSumitPlugins=lua commented. The liblua.so is n

Re: [slurm-users] EXTERNAL-Re: Block jobs on GPU partition when GPU is not specified

2021-09-27 Thread Ratnasamy, Fritz
slurm/blob/master/contribs/lua/job_submit.lua > > [2] https://gist.github.com/mikerenfro/df89fac5052a45cc2c1651b9a30978e0 > > > > *From: *slurm-users on behalf of > Ratnasamy, Fritz > *Date: *Saturday, September 25, 2021 at 12:23 AM > *To: *Slurm User Community List

Re: [slurm-users] EXTERNAL-Re: Block jobs on GPU partition when GPU is not specified

2021-09-27 Thread Ratnasamy, Fritz
oth School of Business 5807 S. Woodlawn Chicago, Illinois 60637 Phone: +(1) 773-834-4556 On Mon, Sep 27, 2021 at 1:40 PM Renfro, Michael wrote: > Might need a restart of slurmctld at most, I expect. > > > > *From: *slurm-users on behalf of > Ratnasamy, Fritz > *Date: *Monday

[slurm-users] Secondary Unix group id of users not being issued in interactive srun command

2022-01-27 Thread Ratnasamy, Fritz
Hi, I have a similar issue as described on the following link ( https://groups.google.com/g/slurm-users/c/6SnwFV-S_Nk) A machine had some existing local permissions. We have added it as a compute node to our cluster via Slurm. When running an srun interactive session on that server, it would see

Re: [slurm-users] Secondary Unix group id of users not being issued in interactive srun command

2022-01-28 Thread Ratnasamy, Fritz
: +(1) 773-834-4556 On Fri, Jan 28, 2022 at 2:04 AM Rémi Palancher wrote: > Le vendredi 28 janvier 2022 à 06:56, Ratnasamy, Fritz < > fritz.ratnas...@chicagobooth.edu> a écrit : > > > Hi, > > > > I have a similar issue as described on the following link ( > https

Re: [slurm-users] Secondary Unix group id of users not being issued in interactive srun command

2022-01-28 Thread Ratnasamy, Fritz
> Do you see the uid in /sys/fs/cgroup? (i.e. find /sys/fs/cgroup -name > "*71953*"). If not that could point to cgroup config. > > > From: slurm-users on behalf of > Ratnasamy, Fritz > Sent: Friday, January 28, 2022 11:13 AM &

[slurm-users] Allow SFTP on a specific compute node

2022-07-11 Thread Ratnasamy, Fritz
Hello, Currently, our cluster does not allow ssh to compute nodes for users unless they have a running job on that compute node. I believe a system admin has set up a PAM module that does the block. Whn trying ssh, this is the message returned: Access denied by pam_slurm_adopt: you have no active

Re: [slurm-users] Allow SFTP on a specific compute node

2022-07-13 Thread Ratnasamy, Fritz
cript was started on one of your ‘special’ machines, start the > > second instance of sshd…..if not, do nothing > > > > Hope that helps > > > >> On 12 Jul 2022, at 05:53, Ratnasamy, Fritz > >> wrote: > >> > >>  > >> Hello, > >>

[slurm-users] node health check

2023-01-30 Thread Ratnasamy, Fritz
Hi, Currently, some of our nodes are overloaded. The nhc installed used to check the load and drain the node when it is overloaded. However, for the past few days, it is not showing the state of the node. When I run /usr/sbin/nhc manually, it says 20230130 21:25:14 [slurm] /usr/libexec/nhc/node-

[slurm-users] Removing safely a node

2024-05-16 Thread Ratnasamy, Fritz via slurm-users
Hi, What is the "official" process to remove nodes safely? I have drained the nodes so jobs are completed and put them in down state after they are completely drained. I edited the slurm.conf file to remove the nodes. After some time, I can see that the nodes were removed from the partition with

[slurm-users] Not being able to ssh to node with running job

2024-06-06 Thread Ratnasamy, Fritz via slurm-users
As admin on the cluster, we do not observe any issue on our newly added gpu nodes. However, for regular users, they are not seeing their jobs running on these gpu nodes when running squeue -u ( it is however showing as running status with sacct) and they are not able to ssh to these newly added

[slurm-users] Re: Not being able to ssh to node with running job

2024-06-06 Thread Ratnasamy, Fritz via slurm-users
, *Fritz Ratnasamy* Data Scientist Information Technology On Thu, Jun 6, 2024 at 2:11 PM Ratnasamy, Fritz via slurm-users < slurm-users@lists.schedmd.com> wrote: > As admin on the cluster, we do not observe any issue on our newly added > gpu nodes. > However, for regular users, they

[slurm-users] Suspending jobs and resuming

2024-11-21 Thread Ratnasamy, Fritz via slurm-users
Hi, I am using an old slurm version 20.11.8 and we had to reboot our cluster today for maintenance. I suspended all the jobs on it with the command scontrol suspend list_job_ids and all the jobs paused and were suspended. However, when I tried to resume them after the reboot, scontrol resume did

[slurm-users] New version of slurm

2025-05-16 Thread Ratnasamy, Fritz via slurm-users
Hi, I am trying to install the new version of slurm. Do you know if there is a way to find out what support is compiled into the executables? For example, apache has httpd -L which shows all the loaded modules. See below result: [image: image.png] *Fritz Ratnasamy*Data Scientist Information Te

[slurm-users] sacctmgr: error: Sending PersistInit msg: Connection refused

2025-05-17 Thread Ratnasamy, Fritz via slurm-users
Hi, We are working on a test cluster with slurm 24.11.3 and I am getting this error message from the login or compute nodes (note that the error does not show when run from the controller node): sacctmgr list associations tree format=cluster,account,user,maxnodes sacctmgr: error: _open_persist_co

[slurm-users] Slurm accounts managed via ansible

2025-05-27 Thread Ratnasamy, Fritz via slurm-users
Hi, I was wondering whether there might be built in support for managing slurm accounts,users,associations in ansible. it would be nice to be able to organize accounts in a yaml style file and modify accounts settings via gitlab CI/CD. For example, in a format as such: accounts: - name: "Boss

[slurm-users] Account getting duplicated

2025-05-27 Thread Ratnasamy, Fritz via slurm-users
Hi, the slurm db duplicates all our account associations. one with cluster=cluster and another where cluster=venus (which is our actual cluster). is that intended? Or should I make any changes? *Fritz Ratnasamy*Data Scientist Information Technology -- slurm-users mailing list -- slurm-users@li

[slurm-users] Re: [EXT] Re: slurm_pam_adopt module not working

2025-06-16 Thread Ratnasamy, Fritz via slurm-users
allow anyone > to log in. > > Sean > ------ > *From:* Ratnasamy, Fritz via slurm-users > *Sent:* Tuesday, 17 June 2025 14:55 > *To:* Kevin Buckley > *Cc:* slurm-users@lists.schedmd.com > *Subject:* [EXT] [slurm-users] Re: slurm_pam_adopt module not working > > * External email:

[slurm-users] Re: slurm_pam_adopt module not working

2025-06-16 Thread Ratnasamy, Fritz via slurm-users
ers@lists.schedmd.com> wrote: > On 2025/06/11 12:46, Ratnasamy, Fritz via slurm-users wrote: > > > > We wanted to block users from ssh to a node where there are no jobs > > running however it looks like users are able to do so. We have installed > > the slurm_pam_adopt_module and

[slurm-users] Transfer from GPFS via slurm

2025-06-05 Thread Ratnasamy, Fritz via slurm-users
Hi, We were told by our hardware provider that large datasets copied from NFS location to GPFS could be conducted via slurm to monitor the transfer. I am not sure of this works as I could not find much online. Otherwise apart from globus and rsync, what would you suggest as a tool/command to cop

[slurm-users] slurm_pam_adopt module not working

2025-06-10 Thread Ratnasamy, Fritz via slurm-users
Hi, We wanted to block users from ssh to a node where there are no jobs running however it looks like users are able to do so. We have installed the slurm_pam_adopt_module and set up the slurm.conf accordingly (the same way we did on our first cluster where the pam module denies ssh access correc

[slurm-users] seff command not found

2025-06-06 Thread Ratnasamy, Fritz via slurm-users
Hi, We installed a new slurm version and it returns "command not found" for seff. I do not remember doing any manual installation for the previous versions, I thought it was coming with sacct, sbatch, ect... Any idea how I would need to set it up? I read online seff is actually a perl script. Bes