How about using cpusets
Create a boot cpusets with the e cores and start slurm in the p cores
Yeah showing my age by talking about cpusets
On Wed, Feb 19, 2025, 6:05 PM Timo Rothenpieler via slurm-users <
slurm-users@lists.schedmd.com> wrote:
> On 19.02.2025 14:06, Luke Sudbery via slurm-users w
I am running single node tests on a cluster.
I can select named nodes using the -2 flag with sbatch.
However - if I want to submit perhaps 20 test jobs is there any smart way
to run only one time on a node?
I know I could touch a file with the hostname and test for that file.
I am just wondering i
I am working on power logging of a GPU cluster I am working with.
I am running jobs on multiple hosts.
I wanst to create a file , one for each host, which has a unique filename
containing the host name.
Something like
clush -w $SLURM_JOB_NODELIST "touch file$(hostname)"
My foo is weak today. Help
ps -eaf --forest is your friend with Slurm
On Mon, Feb 10, 2025, 12:08 PM Michał Kadlof via slurm-users <
slurm-users@lists.schedmd.com> wrote:
> I observed similar symptoms when we had issues with the shared Lustre file
> system. When the file system couldn't complete an I/O operation, the
> pro
Belay that reply. Different issue.
In that case salloc works OK but stun says user has no job on the node
On Mon, Feb 10, 2025, 9:24 AM John Hearns wrote:
> I have had something similar.
> The fix was to run a
> scontrol reconfig
> Which causes a reread of the Slurmd config
> Give that a try
>
>
I have had something similar.
The fix was to run a
scontrol reconfig
Which causes a reread of the Slurmd config
Give that a try
It might be scontrol reread. Use the manual
On Mon, Feb 10, 2025, 8:32 AM Ricardo Román-Brenes via slurm-users <
slurm-users@lists.schedmd.com> wrote:
> Hello everyone.
Steven, one tip if you are just starting with Slurm: "Use the logs Luke,
Use the logs"
By this I mean tail -f /var/log/slurmctl and restart the slurmctld
service
On a compute node tail -f /var/log/slurmd
Oh, and you probably are going to set up Munge also - which is easy.
On Tue, 4 Feb 2025
Have you run id on a computer node?
On Wed, Jan 29, 2025, 6:47 PM Steven Jones via slurm-users <
slurm-users@lists.schedmd.com> wrote:
> I am using Redhat's IdM/IPA for users
>
> Slurmctld is failing to run jobs and it is getting "invalid user id".
>
> "2025-01-28T21:48:50.271] sched: Allocate J
To debug shell scripts try running with the -x flag ???
On Thu, Jan 9, 2025, 10:51 AM Gérard Henry (AMU) via slurm-users <
slurm-users@lists.schedmd.com> wrote:
> Hello all and happy new year,
>
> i have installed slurm 21.08 on ubuntu 22 LTS, and database for
> accounting on a remote machine run
Generally, the troubleshooting steps which you should take for Slurm are:
squeue to look at the list of running/queued or held jobs
sinfo to show which nodes are idle, busy or down
scontrol show node to get more detailed information on a node
For problem nodes - indeed just log into any node t
You need to find the node which the job started on.
Then look at the slurmd log on that node. You may find an indication of the
reason for the failure.
On Tue, 7 Jan 2025 at 11:30, sportlecon sportlecon via slurm-users <
slurm-users@lists.schedmd.com> wrote:
> slurm 24.11 - squeue displays reaso
Davide, the 'nodeset' command can be used here
nodeset -e -S '\n' node[03-04,12-22,27-32,36]
On Mon, 6 Jan 2025 at 19:58, Davide DelVento via slurm-users <
slurm-users@lists.schedmd.com> wrote:
> Hi all,
> I remember seeing on this list a slurm command to change a slurm-friendly
> list suc
Output of sinfo and squeue
Look at slurmd log in an example node also
Tail -f is your friend
On Sat, Jan 4, 2025, 8:13 AM sportlecon sportlecon via slurm-users <
slurm-users@lists.schedmd.com> wrote:
> JOBID PARTITION NAME USER ST TIME NODES
> NODELIST(REASON)
>
I think this was discussed here recently.
On Mon, Dec 23, 2024, 12:18 PM Laura Zharmukhametova via slurm-users <
slurm-users@lists.schedmd.com> wrote:
> Hello,
>
> Is there an existing Slurm plugin for FPGA allocation? If not, can someone
> please point me in the right direction for how to approa
Is your slurm.conf identical on all nodes?
On Sun, Dec 8, 2024, 7:42 PM John Hearns wrote:
> Tail -f on the slurm controller logs
>
> Log into a computer node and tail -f on slurmd log as the slurmd log is
> started
> Or start slurmd in the foreground and set debug flag
>
> On Sun, Dec 8, 2024,
Tail -f on the slurm controller logs
Log into a computer node and tail -f on slurmd log as the slurmd log is
started
Or start slurmd in the foreground and set debug flag
On Sun, Dec 8, 2024, 7:37 PM Steven Jones via slurm-users <
slurm-users@lists.schedmd.com> wrote:
> Hi,
>
> I did that,
>
> [r
I used to configure Postfix on the head node.
All compute nodes are then configured to use the head node as a relay.
On Thu, Dec 5, 2024, 1:14 AM Kevin Buckley via slurm-users <
slurm-users@lists.schedmd.com> wrote:
> On 2024/12/05 05:37, Daniel Miliate via slurm-users wrote:
> >
> > I'm trying t
Forget what I just said. slurmctld had not been restarted in a month of
Sundays and it was logging mismatched in the slurm.conf
Slurm reconfig and a restart f all slurmd and problem looks fixed.
On Sun, 10 Nov 2024 at 14:50, John Hearns wrote:
> I have cluster which uses Slurm 23.11.6
>
> When
I have cluster which uses Slurm 23.11.6
When I submit a multi-node job and run something like
clush -b -w $SLURM_JOB_NODELIST "date"
very often the ssh command fails with:
Access denied by pam_slurm_adopt: you have no active jobs on this node
This will happen maybe on 50% of the nodes
There is t
19 matches
Mail list logo