Hello,
slurm 20.02.7 on FreeBSD.
I have a couple of nodes stuck in the drain state. I have tried
scontrol update nodename=node012 state=down reason="stuck in drain state"
scontrol update nodename=node012 state=resume
without success.
I then tried
/usr/local/sbin/slurmctld -c
scontrol update
On 5/25/23 13:59, Roger Mason wrote:
slurm 20.02.7 on FreeBSD.
Uh, that's old!
I have a couple of nodes stuck in the drain state. I have tried
scontrol update nodename=node012 state=down reason="stuck in drain state"
scontrol update nodename=node012 state=resume
without success.
I then tr
Could also review the node log in /varlog/slurm/ . Often sinfo -lR will
tell you the cause, fro example mem not matching the config.
Doug
On Thu, May 25, 2023 at 5:32 AM Ole Holm Nielsen
wrote:
> On 5/25/23 13:59, Roger Mason wrote:
> > slurm 20.02.7 on FreeBSD.
>
> Uh, that's old!
>
> > I hav
Ole Holm Nielsen writes:
> On 5/25/23 13:59, Roger Mason wrote:
>> slurm 20.02.7 on FreeBSD.
>
> Uh, that's old!
Yes. It is what is available in ports.
> What's the output of "scontrol show node node012"?
NodeName=node012 CoresPerSocket=2
CPUAlloc=0 CPUTot=4 CPULoad=N/A
AvailableFeat
Hello,
Doug Meyer writes:
> Could also review the node log in /varlog/slurm/ . Often sinfo -lR will tell
> you the cause, fro example mem not matching the config.
>
REASON USER TIMESTAMP STATE NODELIST
Low RealMemory slurm(468) 2023-05-25T09:26:59 drai
Can you ssh into the node and check the actual availability of memory?
Maybe there is a zombie process (or a healthy one with a memory leak bug)
that's hogging all the memory?
On Thu, May 25, 2023 at 7:31 AM Roger Mason wrote:
> Hello,
>
> Doug Meyer writes:
>
> > Could also review the node log
On 5/25/23 15:23, Roger Mason wrote:
NodeName=node012 CoresPerSocket=2
CPUAlloc=0 CPUTot=4 CPULoad=N/A
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=node012 NodeHostName=node012
RealMemory=10193 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
State=UNKN
Hello,
Davide DelVento writes:
> Can you ssh into the node and check the actual availability of memory?
> Maybe there is a zombie process (or a healthy one with a memory leak
> bug) that's hogging all the memory?
This is what top shows:
last pid: 45688; load averages: 0.00, 0.00, 0.00
Ole Holm Nielsen writes:
> 1. Is slurmd running on the node?
Yes.
> 2. What's the output of "slurmd -C" on the node?
NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2
ThreadsPerCore=1 RealMemory=6097
> 3. Define State=UP in slurm.conf in stead of UNKNOWN
Will do.
> 4. Why h
That output of slurmd -C is your answer.
Slurmd only sees 6GB of memory and you are claiming it has 10GB.
I would run some memtests, look at meminfo on the node, etc.
Maybe even check that the type/size of memory in there is what you think
it is.
Brian Andrus
On 5/25/2023 7:30 AM, Roger Mas
A quick test to see if it's a configuration error is to set config_overrides in
your slurm.conf and see if the node then responds to scontrol update.
From: slurm-users on behalf of Brian
Andrus
Sent: Thursday, May 25, 2023 10:54 AM
To: slurm-users@lists.schedmd
After trying to approach this with preempt/partition_prio, we ended up moving
to QOS based preemption due to some issues with suspend/requeue, and also
wanting to use QOS for quicker/easier tweaks than changing partitions as a
whole.
> PreemptType=preempt/qos
> PreemptMode=SUSPEND,GANG
> Partit
Hello,
slurm-23.02 on ubuntu-20.04,
seff is not working anymore:
```
# ./seff 4911385
Use of uninitialized value $FindBin::Bin in concatenation (.) or string at
./seff line 11.
Name "FindBin::Bin" used only once: possible typo at ./seff line 11,
line 602.
perl: error: slurm_persist_conn_open:
How did you install seff? I don’t know exactly where this happens, but it
looks like line 11 in the source file for seff is supposed to get transformed
to include an actual path. I am running on CentOS and install Slurm by building
the RPMs using the included spec files and here is a diff of th
Well, sorry, I indeed runned the raw script for this mail.
Running the installed one by `make install`, which is setting line 11
path correctly:
use lib qw(/usr/local/slurm-23.02.2/lib/x86_64-linux-gnu/perl/5.30.0);
I get:
perl: error: slurm_persist_conn_open: Something happened with t
Hello,
"Groner, Rob" writes:
> A quick test to see if it's a configuration error is to set
> config_overrides in your slurm.conf and see if the node then responds
> to scontrol update.
Thanks to all who helped. It turned out that memory was the issue. I
have now reseated the RAM in the offend
Hello everybody,
I am observing an interaction between the --mem-per-gpu, --cpus-per-gpu
and --gres settings in sbatch which I do not understand.
Basically, if the job is submitted with --gres=gpu:2 the --mem-per-gpu
and --cpus-per-gpu settings appear to be observed. If the job is
submitted wit
Hello,
I have a badly behaving user that I need to speak with and want to temporarily
disable their ability to submit jobs. I know I can change their account
settings to stop them. Is there another way to set a block on a specific
username that I can lift later without removing the user/account
I always like
Sacctmgr update user where user= set grpcpus=0
On Thu, May 25, 2023, 4:19 PM Markuske, William wrote:
> Hello,
>
> I have a badly behaving user that I need to speak with and want to
> temporarily disable their ability to submit jobs. I know I can change their
> account settings to
Hi Willy,
sacctmgr modify account slurmaccount user=baduser set maxjobs=0
Sean
From: slurm-users on behalf of
Markuske, William
Sent: Friday, 26 May 2023 09:16
To: slurm-users@lists.schedmd.com
Subject: [EXT] [slurm-users] Temporary Stop User Submission
Ext
I tend not to let them login. It will get their attention, and prevent them
from just running their work on the login node when they discover they can’t
submit. But appreciate seeing the other options.
Sent from my iPhone
> On May 25, 2023, at 19:19, Markuske, William wrote:
>
> Hello,
>
>
Sean,
I was just about to mention this wasn't working because I had thought of
something similar. I tried 'sacctmgr modify user where name= set
maxjobs=0' but that was still allowing the user to 'srun --pty bash'.
Apparently doing it through modifying the account as you stated does work
though
I would but unfortunately they were creating 100s of TBs of data and I need
them to log in and delete it but I don't want them creating more in the
meantime.
Regards,
--
Willy Markuske
HPC Systems Engineer
SDSC - Research Data Services
(619) 519-4435
wmarku...@sdsc.edu
On May 25, 2023, at 16:
Hello,
David Gauchard writes:
> slurm-23.02 on ubuntu-20.04,
>
> seff is not working anymore:
perhaps it is something specific to 20.04? I'm on Ubuntu 22.04 and
slurm-23.02.1 here and no problems with seff, except that the memory
efficiency part seems broken (I always seem to get 0.00% efficien
24 matches
Mail list logo