[slurm-users] Re: jobs getting stuck in CG

2025-02-10 Thread Christopher Samuel via slurm-users
On 2/10/25 7:05 am, Michał Kadlof via slurm-users wrote: I observed similar symptoms when we had issues with the shared Lustre file system. When the file system couldn't complete an I/O operation, the process in Slurm remained in the CG state until the file system became responsive again. An a

[slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld

2025-02-03 Thread Christopher Samuel via slurm-users
On 2/3/25 2:33 pm, Steven Jones via slurm-users wrote: Just built 4 x rocky9 nodes and I do not get that error (but I get another I know how to fix, I think) so holistically  I am thinking the version difference is too large. Oh I think I missed this - when you say version difference do you m

[slurm-users] Re: sinfo not listing any partitions

2024-11-27 Thread Christopher Samuel via slurm-users
On 11/27/24 11:38 am, Kent L. Hanson via slurm-users wrote: I have restarted the slurmctld and slurmd services several times. I hashed the slurm.conf files. They are the same. I ran “sinfo -a” as root with the same result. Are your nodes in the `FUTURE` state perhaps? What does this show? si

[slurm-users] Re: Job pre / post submit scripts

2024-10-28 Thread Christopher Samuel via slurm-users
On 10/28/24 10:56 am, Bhaskar Chakraborty via slurm-users wrote: Is there an option in slurm to launch a custom script at the time of job submission through sbatch or salloc? The script should run with submit user permission in submit area. I think you are after the cli_filter functionality w

[slurm-users] Re: Randomly draining nodes

2024-10-24 Thread Christopher Samuel via slurm-users
Hi Ole, On 10/22/24 11:04 am, Ole Holm Nielsen via slurm-users wrote: Some time ago it was recommended that UnkillableStepTimeout values above 127 (or 256?) should not be used, see https://support.schedmd.com/ show_bug.cgi?id=11103.  I don't know if this restriction is still valid with recent

[slurm-users] Re: Randomly draining nodes

2024-10-21 Thread Christopher Samuel via slurm-users
On 10/21/24 4:35 am, laddaoui--- via slurm-users wrote: It seems like there's an issue with the termination process on these nodes. Any thoughts on what could be causing this? That usually means processes wedged in the kernel for some reason, in an uninterruptible sleep state. You can define

[slurm-users] Re: REST API - get_user_environment

2024-08-15 Thread Christopher Samuel via slurm-users
On 8/15/24 7:04 am, jpuerto--- via slurm-users wrote: I am referring to the REST API. We have had it installed for a few years and have recently upgraded it so that we can use v0.0.40. But this most recent version is missing the "get_user_environment" field which existed in previous versions.

[slurm-users] Re: Upgrade node while jobs running

2024-08-02 Thread Christopher Samuel via slurm-users
G'day Sid, On 7/31/24 5:02 pm, Sid Young via slurm-users wrote: I've been waiting for node to become idle before upgrading them however some jobs take a long time. If I try to remove all the packages I assume that kills the slurmstep program and with it the job. Are you looking to do a Slurm

[slurm-users] Re: Can Not Use A Single GPU for Multiple Jobs

2024-06-21 Thread Christopher Samuel via slurm-users
On 6/21/24 3:50 am, Arnuld via slurm-users wrote: I have 3500+ GPU cores available. You mean each GPU job requires at least one CPU? Can't we run a job with just GPU without any CPUs? No, Slurm has to launch the batch script on compute node cores and it then has the job of launching the users

[slurm-users] Re: Unsupported RPC version by slurmctld 19.05.3 from client slurmd 22.05.11

2024-06-17 Thread Christopher Samuel via slurm-users
On 6/17/24 7:24 am, Bjørn-Helge Mevik via slurm-users wrote: Also, server must be newer than client. This is the major issue for the OP - the version rule is: slurmdbd >= slurmctld >= slurmd and clients and no more than the permitted skew in versions. Plus, of course, you have to deal with

[slurm-users] Re: Building Slurm debian package vs building from source

2024-05-23 Thread Christopher Samuel via slurm-users
On 5/22/24 3:33 pm, Brian Andrus via slurm-users wrote: A simple example is when you have nodes with and without GPUs. You can build slurmd packages without for those nodes and with for the ones that have them. FWIW we have both GPU and non-GPU nodes but we use the same RPMs we build on both

[slurm-users] Re: Location of Slurm source packages?

2024-05-15 Thread Christopher Samuel via slurm-users
Hi Jeff! On 5/15/24 10:35 am, Jeffrey Layton via slurm-users wrote: I have an Ubuntu 22.04 server where I installed Slurm from the Ubuntu packages. I now want to install pyxis but it says I need the Slurm sources. In Ubuntu 22.04, is there a package that has the source code? How to download t

[slurm-users] Re: FreeBSD/aarch64: ld: error: unknown emulation: elf_aarch64

2024-05-06 Thread Christopher Samuel via slurm-users
On 5/6/24 3:19 pm, Nuno Teixeira via slurm-users wrote: Fixed with: [...] Thanks and sorry for the noise as I really missed this detail :) So glad it helped! Best of luck with this work. -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slu

[slurm-users] Re: FreeBSD/aarch64: ld: error: unknown emulation: elf_aarch64

2024-05-06 Thread Christopher Samuel via slurm-users
On 5/6/24 6:38 am, Nuno Teixeira via slurm-users wrote: Any clues about "elf_aarch64" and "aarch64elf" mismatch? As I mentioned I think this is coming from the FreeBSD patching that's being done to the upstream Slurm sources, specifically it looks like elf_aarch64 is being injected here: /

[slurm-users] Re: FreeBSD/aarch64: ld: error: unknown emulation: elf_aarch64

2024-05-04 Thread Christopher Samuel via slurm-users
On 5/4/24 4:24 am, Nuno Teixeira via slurm-users wrote: Any clues? > ld: error: unknown emulation: elf_aarch64 All I can think is that your ld doesn't like elf_aarch64, from the log your posting it looks that's being injected from the FreeBSD ports system. Looking at the man page for ld on

[slurm-users] Re: Jobs of a user are stuck in Completing stage for a long time and cannot cancel them

2024-04-10 Thread Christopher Samuel via slurm-users
On 4/10/24 10:41 pm, archisman.pathak--- via slurm-users wrote: In our case, that node has been removed from the cluster and cannot be added back right now ( is being used for some other work ). What can we do in such a case? Mark the node as "DOWN" in Slurm, this is what we do when we get job

[slurm-users] Re: Is SWAP memory mandatory for SLURM

2024-03-04 Thread Christopher Samuel via slurm-users
On 3/3/24 23:04, John Joseph via slurm-users wrote: Is SWAP a mandatory requirement All our compute nodes are diskless, so no swap on them. -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an e

[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo

2024-02-23 Thread Christopher Samuel via slurm-users
Hi Robert, On 2/23/24 17:38, Robert Kudyba via slurm-users wrote: We switched over from using systemctl for tmp.mount and change to zram, e.g., modprobe zram echo 20GB > /sys/block/zram0/disksize mkfs.xfs /dev/zram0 mount -o discard /dev/zram0 /tmp [...] > [2024-02-23T20:26:15.881] [530.exter