On 2/10/25 7:05 am, Michał Kadlof via slurm-users wrote:
I observed similar symptoms when we had issues with the shared Lustre
file system. When the file system couldn't complete an I/O operation,
the process in Slurm remained in the CG state until the file system
became responsive again. An a
On 2/3/25 2:33 pm, Steven Jones via slurm-users wrote:
Just built 4 x rocky9 nodes and I do not get that error (but I get
another I know how to fix, I think) so holistically I am thinking the
version difference is too large.
Oh I think I missed this - when you say version difference do you m
On 11/27/24 11:38 am, Kent L. Hanson via slurm-users wrote:
I have restarted the slurmctld and slurmd services several times. I
hashed the slurm.conf files. They are the same. I ran “sinfo -a” as root
with the same result.
Are your nodes in the `FUTURE` state perhaps? What does this show?
si
On 10/28/24 10:56 am, Bhaskar Chakraborty via slurm-users wrote:
Is there an option in slurm to launch a custom script at the time of job
submission through sbatch
or salloc? The script should run with submit user permission in submit area.
I think you are after the cli_filter functionality w
Hi Ole,
On 10/22/24 11:04 am, Ole Holm Nielsen via slurm-users wrote:
Some time ago it was recommended that UnkillableStepTimeout values above
127 (or 256?) should not be used, see https://support.schedmd.com/
show_bug.cgi?id=11103. I don't know if this restriction is still valid
with recent
On 10/21/24 4:35 am, laddaoui--- via slurm-users wrote:
It seems like there's an issue with the termination process on these nodes. Any
thoughts on what could be causing this?
That usually means processes wedged in the kernel for some reason, in an
uninterruptible sleep state. You can define
On 8/15/24 7:04 am, jpuerto--- via slurm-users wrote:
I am referring to the REST API. We have had it installed for a few years and have
recently upgraded it so that we can use v0.0.40. But this most recent version is missing
the "get_user_environment" field which existed in previous versions.
G'day Sid,
On 7/31/24 5:02 pm, Sid Young via slurm-users wrote:
I've been waiting for node to become idle before upgrading them however
some jobs take a long time. If I try to remove all the packages I assume
that kills the slurmstep program and with it the job.
Are you looking to do a Slurm
On 6/21/24 3:50 am, Arnuld via slurm-users wrote:
I have 3500+ GPU cores available. You mean each GPU job requires at
least one CPU? Can't we run a job with just GPU without any CPUs?
No, Slurm has to launch the batch script on compute node cores and it
then has the job of launching the users
On 6/17/24 7:24 am, Bjørn-Helge Mevik via slurm-users wrote:
Also, server must be newer than client.
This is the major issue for the OP - the version rule is:
slurmdbd >= slurmctld >= slurmd and clients
and no more than the permitted skew in versions.
Plus, of course, you have to deal with
On 5/22/24 3:33 pm, Brian Andrus via slurm-users wrote:
A simple example is when you have nodes with and without GPUs.
You can build slurmd packages without for those nodes and with for the
ones that have them.
FWIW we have both GPU and non-GPU nodes but we use the same RPMs we
build on both
Hi Jeff!
On 5/15/24 10:35 am, Jeffrey Layton via slurm-users wrote:
I have an Ubuntu 22.04 server where I installed Slurm from the Ubuntu
packages. I now want to install pyxis but it says I need the Slurm
sources. In Ubuntu 22.04, is there a package that has the source code?
How to download t
On 5/6/24 3:19 pm, Nuno Teixeira via slurm-users wrote:
Fixed with:
[...]
Thanks and sorry for the noise as I really missed this detail :)
So glad it helped! Best of luck with this work.
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
--
slurm-users mailing list -- slu
On 5/6/24 6:38 am, Nuno Teixeira via slurm-users wrote:
Any clues about "elf_aarch64" and "aarch64elf" mismatch?
As I mentioned I think this is coming from the FreeBSD patching that's
being done to the upstream Slurm sources, specifically it looks like
elf_aarch64 is being injected here:
/
On 5/4/24 4:24 am, Nuno Teixeira via slurm-users wrote:
Any clues?
> ld: error: unknown emulation: elf_aarch64
All I can think is that your ld doesn't like elf_aarch64, from the log
your posting it looks that's being injected from the FreeBSD ports
system. Looking at the man page for ld on
On 4/10/24 10:41 pm, archisman.pathak--- via slurm-users wrote:
In our case, that node has been removed from the cluster and cannot be
added back right now ( is being used for some other work ). What can we
do in such a case?
Mark the node as "DOWN" in Slurm, this is what we do when we get job
On 3/3/24 23:04, John Joseph via slurm-users wrote:
Is SWAP a mandatory requirement
All our compute nodes are diskless, so no swap on them.
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an e
Hi Robert,
On 2/23/24 17:38, Robert Kudyba via slurm-users wrote:
We switched over from using systemctl for tmp.mount and change to zram,
e.g.,
modprobe zram
echo 20GB > /sys/block/zram0/disksize
mkfs.xfs /dev/zram0
mount -o discard /dev/zram0 /tmp
[...]
> [2024-02-23T20:26:15.881] [530.exter
18 matches
Mail list logo