[slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason
Hello, slurm 20.02.7 on FreeBSD. I have a couple of nodes stuck in the drain state. I have tried scontrol update nodename=node012 state=down reason="stuck in drain state" scontrol update nodename=node012 state=resume without success. I then tried /usr/local/sbin/slurmctld -c scontrol update

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Ole Holm Nielsen
On 5/25/23 13:59, Roger Mason wrote: slurm 20.02.7 on FreeBSD. Uh, that's old! I have a couple of nodes stuck in the drain state. I have tried scontrol update nodename=node012 state=down reason="stuck in drain state" scontrol update nodename=node012 state=resume without success. I then tr

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Doug Meyer
Could also review the node log in /varlog/slurm/ . Often sinfo -lR will tell you the cause, fro example mem not matching the config. Doug On Thu, May 25, 2023 at 5:32 AM Ole Holm Nielsen wrote: > On 5/25/23 13:59, Roger Mason wrote: > > slurm 20.02.7 on FreeBSD. > > Uh, that's old! > > > I hav

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason
Ole Holm Nielsen writes: > On 5/25/23 13:59, Roger Mason wrote: >> slurm 20.02.7 on FreeBSD. > > Uh, that's old! Yes. It is what is available in ports. > What's the output of "scontrol show node node012"? NodeName=node012 CoresPerSocket=2 CPUAlloc=0 CPUTot=4 CPULoad=N/A AvailableFeat

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason
Hello, Doug Meyer writes: > Could also review the node log in /varlog/slurm/ . Often sinfo -lR will tell > you the cause, fro example mem not matching the config. > REASON USER TIMESTAMP STATE NODELIST Low RealMemory slurm(468) 2023-05-25T09:26:59 drai

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Davide DelVento
Can you ssh into the node and check the actual availability of memory? Maybe there is a zombie process (or a healthy one with a memory leak bug) that's hogging all the memory? On Thu, May 25, 2023 at 7:31 AM Roger Mason wrote: > Hello, > > Doug Meyer writes: > > > Could also review the node log

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Ole Holm Nielsen
On 5/25/23 15:23, Roger Mason wrote: NodeName=node012 CoresPerSocket=2 CPUAlloc=0 CPUTot=4 CPULoad=N/A AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=node012 NodeHostName=node012 RealMemory=10193 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1 State=UNKN

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason
Hello, Davide DelVento writes: > Can you ssh into the node and check the actual availability of memory? > Maybe there is a zombie process (or a healthy one with a memory leak > bug) that's hogging all the memory? This is what top shows: last pid: 45688; load averages: 0.00, 0.00, 0.00

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason
Ole Holm Nielsen writes: > 1. Is slurmd running on the node? Yes. > 2. What's the output of "slurmd -C" on the node? NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=6097 > 3. Define State=UP in slurm.conf in stead of UNKNOWN Will do. > 4. Why h

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Brian Andrus
That output of slurmd -C is your answer. Slurmd only sees 6GB of memory and you are claiming it has 10GB. I would run some memtests, look at meminfo on the node, etc. Maybe even check that the type/size of memory in there is what you think it is. Brian Andrus On 5/25/2023 7:30 AM, Roger Mas

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Groner, Rob
A quick test to see if it's a configuration error is to set config_overrides in your slurm.conf and see if the node then responds to scontrol update. From: slurm-users on behalf of Brian Andrus Sent: Thursday, May 25, 2023 10:54 AM To: slurm-users@lists.schedmd

Re: [slurm-users] hi-priority partition and preemption

2023-05-25 Thread Reed Dier
After trying to approach this with preempt/partition_prio, we ended up moving to QOS based preemption due to some issues with suspend/requeue, and also wanting to use QOS for quicker/easier tweaks than changing partitions as a whole. > PreemptType=preempt/qos > PreemptMode=SUSPEND,GANG > Partit

[slurm-users] seff in slurm-23.02

2023-05-25 Thread David Gauchard
Hello, slurm-23.02 on ubuntu-20.04, seff is not working anymore: ``` # ./seff 4911385 Use of uninitialized value $FindBin::Bin in concatenation (.) or string at ./seff line 11. Name "FindBin::Bin" used only once: possible typo at ./seff line 11, line 602. perl: error: slurm_persist_conn_open:

Re: [slurm-users] [EXTERNAL] seff in slurm-23.02

2023-05-25 Thread Mike Robbert
How did you install seff? I don’t know exactly where this happens, but it looks like line 11 in the source file for seff is supposed to get transformed to include an actual path. I am running on CentOS and install Slurm by building the RPMs using the included spec files and here is a diff of th

Re: [slurm-users] seff in slurm-23.02

2023-05-25 Thread David Gauchard
Well, sorry, I indeed runned the raw script for this mail. Running the installed one by `make install`, which is setting line 11 path correctly: use lib qw(/usr/local/slurm-23.02.2/lib/x86_64-linux-gnu/perl/5.30.0); I get: perl: error: slurm_persist_conn_open: Something happened with t

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason
Hello, "Groner, Rob" writes: > A quick test to see if it's a configuration error is to set > config_overrides in your slurm.conf and see if the node then responds > to scontrol update. Thanks to all who helped. It turned out that memory was the issue. I have now reseated the RAM in the offend

[slurm-users] sbatch mem-per-gpu and gres interaction

2023-05-25 Thread christof . koehler
Hello everybody, I am observing an interaction between the --mem-per-gpu, --cpus-per-gpu and --gres settings in sbatch which I do not understand. Basically, if the job is submitted with --gres=gpu:2 the --mem-per-gpu and --cpus-per-gpu settings appear to be observed. If the job is submitted wit

[slurm-users] Temporary Stop User Submission

2023-05-25 Thread Markuske, William
Hello, I have a badly behaving user that I need to speak with and want to temporarily disable their ability to submit jobs. I know I can change their account settings to stop them. Is there another way to set a block on a specific username that I can lift later without removing the user/account

Re: [slurm-users] Temporary Stop User Submission

2023-05-25 Thread Doug Meyer
I always like Sacctmgr update user where user= set grpcpus=0 On Thu, May 25, 2023, 4:19 PM Markuske, William wrote: > Hello, > > I have a badly behaving user that I need to speak with and want to > temporarily disable their ability to submit jobs. I know I can change their > account settings to

Re: [slurm-users] Temporary Stop User Submission

2023-05-25 Thread Sean Crosby
Hi Willy, sacctmgr modify account slurmaccount user=baduser set maxjobs=0 Sean From: slurm-users on behalf of Markuske, William Sent: Friday, 26 May 2023 09:16 To: slurm-users@lists.schedmd.com Subject: [EXT] [slurm-users] Temporary Stop User Submission Ext

Re: [slurm-users] Temporary Stop User Submission

2023-05-25 Thread Ryan Novosielski
I tend not to let them login. It will get their attention, and prevent them from just running their work on the login node when they discover they can’t submit. But appreciate seeing the other options. Sent from my iPhone > On May 25, 2023, at 19:19, Markuske, William wrote: > >  Hello, > >

Re: [slurm-users] Temporary Stop User Submission

2023-05-25 Thread Markuske, William
Sean, I was just about to mention this wasn't working because I had thought of something similar. I tried 'sacctmgr modify user where name= set maxjobs=0' but that was still allowing the user to 'srun --pty bash'. Apparently doing it through modifying the account as you stated does work though

Re: [slurm-users] Temporary Stop User Submission

2023-05-25 Thread Markuske, William
I would but unfortunately they were creating 100s of TBs of data and I need them to log in and delete it but I don't want them creating more in the meantime. Regards, -- Willy Markuske HPC Systems Engineer SDSC - Research Data Services (619) 519-4435 wmarku...@sdsc.edu On May 25, 2023, at 16:

Re: [slurm-users] seff in slurm-23.02

2023-05-25 Thread Angel de Vicente
Hello, David Gauchard writes: > slurm-23.02 on ubuntu-20.04, > > seff is not working anymore: perhaps it is something specific to 20.04? I'm on Ubuntu 22.04 and slurm-23.02.1 here and no problems with seff, except that the memory efficiency part seems broken (I always seem to get 0.00% efficien