[slurm-users] Re: Running SLURM in a laptop

2025-02-19 Thread John Hearns via slurm-users
How about using cpusets Create a boot cpusets with the e cores and start slurm in the p cores Yeah showing my age by talking about cpusets On Wed, Feb 19, 2025, 6:05 PM Timo Rothenpieler via slurm-users < slurm-users@lists.schedmd.com> wrote: > On 19.02.2025 14:06, Luke Sudbery via slurm-users w

[slurm-users] Run only one time on a node

2025-02-18 Thread John Hearns via slurm-users
I am running single node tests on a cluster. I can select named nodes using the -2 flag with sbatch. However - if I want to submit perhaps 20 test jobs is there any smart way to run only one time on a node? I know I could touch a file with the hostname and test for that file. I am just wondering i

[slurm-users] Create filenames based on slurm hosts

2025-02-14 Thread John Hearns via slurm-users
I am working on power logging of a GPU cluster I am working with. I am running jobs on multiple hosts. I wanst to create a file , one for each host, which has a unique filename containing the host name. Something like clush -w $SLURM_JOB_NODELIST "touch file$(hostname)" My foo is weak today. Help

[slurm-users] Re: jobs getting stuck in CG

2025-02-10 Thread John Hearns via slurm-users
ps -eaf --forest is your friend with Slurm On Mon, Feb 10, 2025, 12:08 PM Michał Kadlof via slurm-users < slurm-users@lists.schedmd.com> wrote: > I observed similar symptoms when we had issues with the shared Lustre file > system. When the file system couldn't complete an I/O operation, the > pro

[slurm-users] Re: jobs getting stuck in CG

2025-02-10 Thread John Hearns via slurm-users
Belay that reply. Different issue. In that case salloc works OK but stun says user has no job on the node On Mon, Feb 10, 2025, 9:24 AM John Hearns wrote: > I have had something similar. > The fix was to run a > scontrol reconfig > Which causes a reread of the Slurmd config > Give that a try > >

[slurm-users] Re: jobs getting stuck in CG

2025-02-10 Thread John Hearns via slurm-users
I have had something similar. The fix was to run a scontrol reconfig Which causes a reread of the Slurmd config Give that a try It might be scontrol reread. Use the manual On Mon, Feb 10, 2025, 8:32 AM Ricardo Román-Brenes via slurm-users < slurm-users@lists.schedmd.com> wrote: > Hello everyone.

[slurm-users] Re: Installing slurm*

2025-02-04 Thread John Hearns via slurm-users
Steven, one tip if you are just starting with Slurm: "Use the logs Luke, Use the logs" By this I mean tail -f /var/log/slurmctl and restart the slurmctld service On a compute node tail -f /var/log/slurmd Oh, and you probably are going to set up Munge also - which is easy. On Tue, 4 Feb 2025

[slurm-users] Re: RHEL8.10 V slurmctld

2025-01-30 Thread John Hearns via slurm-users
Have you run id on a computer node? On Wed, Jan 29, 2025, 6:47 PM Steven Jones via slurm-users < slurm-users@lists.schedmd.com> wrote: > I am using Redhat's IdM/IPA for users > > Slurmctld is failing to run jobs and it is getting "invalid user id". > > "2025-01-28T21:48:50.271] sched: Allocate J

[slurm-users] Re: need help with seff script on ubuntu (slurm 21.08)

2025-01-09 Thread John Hearns via slurm-users
To debug shell scripts try running with the -x flag ??? On Thu, Jan 9, 2025, 10:51 AM Gérard Henry (AMU) via slurm-users < slurm-users@lists.schedmd.com> wrote: > Hello all and happy new year, > > i have installed slurm 21.08 on ubuntu 22 LTS, and database for > accounting on a remote machine run

[slurm-users] Re: launch failed requeued held

2025-01-08 Thread John Hearns via slurm-users
Generally, the troubleshooting steps which you should take for Slurm are: squeue to look at the list of running/queued or held jobs sinfo to show which nodes are idle, busy or down scontrol show node to get more detailed information on a node For problem nodes - indeed just log into any node t

[slurm-users] Re: launch failed requeued held

2025-01-07 Thread John Hearns via slurm-users
You need to find the node which the job started on. Then look at the slurmd log on that node. You may find an indication of the reason for the failure. On Tue, 7 Jan 2025 at 11:30, sportlecon sportlecon via slurm-users < slurm-users@lists.schedmd.com> wrote: > slurm 24.11 - squeue displays reaso

[slurm-users] Re: formatting node names

2025-01-07 Thread John Hearns via slurm-users
Davide, the 'nodeset' command can be used here nodeset -e -S '\n' node[03-04,12-22,27-32,36] On Mon, 6 Jan 2025 at 19:58, Davide DelVento via slurm-users < slurm-users@lists.schedmd.com> wrote: > Hi all, > I remember seeing on this list a slurm command to change a slurm-friendly > list suc

[slurm-users] Re: Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions

2025-01-04 Thread John Hearns via slurm-users
Output of sinfo and squeue Look at slurmd log in an example node also Tail -f is your friend On Sat, Jan 4, 2025, 8:13 AM sportlecon sportlecon via slurm-users < slurm-users@lists.schedmd.com> wrote: > JOBID PARTITION NAME USER ST TIME NODES > NODELIST(REASON) >

[slurm-users] Re: Slurm plugin for custom hardware allocation

2024-12-23 Thread John Hearns via slurm-users
I think this was discussed here recently. On Mon, Dec 23, 2024, 12:18 PM Laura Zharmukhametova via slurm-users < slurm-users@lists.schedmd.com> wrote: > Hello, > > Is there an existing Slurm plugin for FPGA allocation? If not, can someone > please point me in the right direction for how to approa

[slurm-users] Re: getting slurm going

2024-12-08 Thread John Hearns via slurm-users
Is your slurm.conf identical on all nodes? On Sun, Dec 8, 2024, 7:42 PM John Hearns wrote: > Tail -f on the slurm controller logs > > Log into a computer node and tail -f on slurmd log as the slurmd log is > started > Or start slurmd in the foreground and set debug flag > > On Sun, Dec 8, 2024,

[slurm-users] Re: getting slurm going

2024-12-08 Thread John Hearns via slurm-users
Tail -f on the slurm controller logs Log into a computer node and tail -f on slurmd log as the slurmd log is started Or start slurmd in the foreground and set debug flag On Sun, Dec 8, 2024, 7:37 PM Steven Jones via slurm-users < slurm-users@lists.schedmd.com> wrote: > Hi, > > I did that, > > [r

[slurm-users] Re: Non-Standard Mail Notification in Job

2024-12-05 Thread John Hearns via slurm-users
I used to configure Postfix on the head node. All compute nodes are then configured to use the head node as a relay. On Thu, Dec 5, 2024, 1:14 AM Kevin Buckley via slurm-users < slurm-users@lists.schedmd.com> wrote: > On 2024/12/05 05:37, Daniel Miliate via slurm-users wrote: > > > > I'm trying t

[slurm-users] Re: Access denied by pam_slurm_adopt

2024-11-10 Thread John Hearns via slurm-users
Forget what I just said. slurmctld had not been restarted in a month of Sundays and it was logging mismatched in the slurm.conf Slurm reconfig and a restart f all slurmd and problem looks fixed. On Sun, 10 Nov 2024 at 14:50, John Hearns wrote: > I have cluster which uses Slurm 23.11.6 > > When

[slurm-users] Access denied by pam_slurm_adopt

2024-11-10 Thread John Hearns via slurm-users
I have cluster which uses Slurm 23.11.6 When I submit a multi-node job and run something like clush -b -w $SLURM_JOB_NODELIST "date" very often the ssh command fails with: Access denied by pam_slurm_adopt: you have no active jobs on this node This will happen maybe on 50% of the nodes There is t