[slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason
Hello,

slurm 20.02.7 on FreeBSD.

I have a couple of nodes stuck in the drain state.  I have tried

scontrol update nodename=node012 state=down reason="stuck in drain state"
scontrol update nodename=node012 state=resume

without success.

I then tried

/usr/local/sbin/slurmctld -c
scontrol update nodename=node012 state=idle

also without success.

Is there some other method I can use to get these nodes back up?

Thanks,
Roger



Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Ole Holm Nielsen

On 5/25/23 13:59, Roger Mason wrote:

slurm 20.02.7 on FreeBSD.


Uh, that's old!


I have a couple of nodes stuck in the drain state.  I have tried

scontrol update nodename=node012 state=down reason="stuck in drain state"
scontrol update nodename=node012 state=resume

without success.

I then tried

/usr/local/sbin/slurmctld -c
scontrol update nodename=node012 state=idle

also without success.

Is there some other method I can use to get these nodes back up?


What's the output of "scontrol show node node012"?

/Ole



Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Doug Meyer
Could also review the node log in /varlog/slurm/ .  Often sinfo -lR will
tell you the cause, fro example mem not matching the config.

Doug

On Thu, May 25, 2023 at 5:32 AM Ole Holm Nielsen 
wrote:

> On 5/25/23 13:59, Roger Mason wrote:
> > slurm 20.02.7 on FreeBSD.
>
> Uh, that's old!
>
> > I have a couple of nodes stuck in the drain state.  I have tried
> >
> > scontrol update nodename=node012 state=down reason="stuck in drain state"
> > scontrol update nodename=node012 state=resume
> >
> > without success.
> >
> > I then tried
> >
> > /usr/local/sbin/slurmctld -c
> > scontrol update nodename=node012 state=idle
> >
> > also without success.
> >
> > Is there some other method I can use to get these nodes back up?
>
> What's the output of "scontrol show node node012"?
>
> /Ole
>
>


Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason


Ole Holm Nielsen  writes:

> On 5/25/23 13:59, Roger Mason wrote:
>> slurm 20.02.7 on FreeBSD.
>
> Uh, that's old!

Yes.  It is what is available in ports.

> What's the output of "scontrol show node node012"?

NodeName=node012 CoresPerSocket=2 
   CPUAlloc=0 CPUTot=4 CPULoad=N/A
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=node012 NodeHostName=node012 
   RealMemory=10193 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
   State=UNKNOWN+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A 
MCS_label=N/A
   Partitions=macpro 
   BootTime=None SlurmdStartTime=None
   CfgTRES=cpu=4,mem=10193M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Low RealMemory [slurm@2023-05-25T09:26:59]

But the 'Low RealMemory' is incorrect.  The entry in slurm.conf for
node012 is:

NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2
ThreadsPerCore=1 RealMemory=10193  State=UNKNOWN

Thanks for the help.
Roger



Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason
Hello,

Doug Meyer  writes:

> Could also review the node log in /varlog/slurm/ .  Often sinfo -lR will tell 
> you the cause, fro example mem not matching the config.
>
REASON   USER TIMESTAMP   STATE  NODELIST 
Low RealMemory   slurm(468)   2023-05-25T09:26:59 drain* node012 
Not responding   slurm(468)   2023-05-25T09:30:31 down*
node[001-003,008]

But, as I sail in my response to Ole, the memory in slurm.conf and in
the 'show node' output match.

Many thanks for the help.

Roger



Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Davide DelVento
Can you ssh into the node and check the actual availability of memory?
Maybe there is a zombie process (or a healthy one with a memory leak bug)
that's hogging all the memory?

On Thu, May 25, 2023 at 7:31 AM Roger Mason  wrote:

> Hello,
>
> Doug Meyer  writes:
>
> > Could also review the node log in /varlog/slurm/ .  Often sinfo -lR will
> tell you the cause, fro example mem not matching the config.
> >
> REASON   USER TIMESTAMP   STATE  NODELIST
> Low RealMemory   slurm(468)   2023-05-25T09:26:59 drain* node012
> Not responding   slurm(468)   2023-05-25T09:30:31 down*
> node[001-003,008]
>
> But, as I sail in my response to Ole, the memory in slurm.conf and in
> the 'show node' output match.
>
> Many thanks for the help.
>
> Roger
>
>


Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Ole Holm Nielsen

On 5/25/23 15:23, Roger Mason wrote:

NodeName=node012 CoresPerSocket=2
CPUAlloc=0 CPUTot=4 CPULoad=N/A
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=node012 NodeHostName=node012
RealMemory=10193 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
State=UNKNOWN+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A 
MCS_label=N/A
Partitions=macpro
BootTime=None SlurmdStartTime=None
CfgTRES=cpu=4,mem=10193M,billing=4
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Low RealMemory [slurm@2023-05-25T09:26:59]

But the 'Low RealMemory' is incorrect.  The entry in slurm.conf for
node012 is:

NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2
ThreadsPerCore=1 RealMemory=10193  State=UNKNOWN


Thanks for the info.  Some questions arise:

1. Is slurmd running on the node?

2. What's the output of "slurmd -C" on the node?

3. Define State=UP in slurm.conf in stead of UNKNOWN

4. Why have you configured TmpDisk=0?  It should be the size of the /tmp 
filesystem.


Since you run Slurm 20.02, there are some suggestions in my Wiki page 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#compute-node-configuration 
where this might be useful:



Note for Slurm 20.02: The Boards=1 SocketsPerBoard=2 configuration gives error 
messages, see bug_9241 and bug_9233. Use Sockets= in stead:


I hope changing these slurm.conf parameters will help.

Best regards,
Ole






Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason
Hello,

Davide DelVento  writes:

> Can you ssh into the node and check the actual availability of memory?
> Maybe there is a zombie process (or a healthy one with a memory leak
> bug) that's hogging all the memory?

This is what top shows:

last pid: 45688;  load averages:  0.00,  0.00,  0.00
   up 0+03:56:52  11:58:13
26 processes:  1 running, 25 sleeping
CPU:  0.0% user,  0.0% nice,  0.1% system,  0.0% interrupt, 99.9% idle
Mem: 9452K Active, 69M Inact, 290M Wired, 287K Buf, 5524M Free
ARC: 125M Total, 37M MFU, 84M MRU, 168K Anon, 825K Header, 3476K Other
 36M Compressed, 89M Uncompressed, 2.46:1 Ratio
Swap: 10G Total, 10G Free

Thanks for the suggestion.

Roger



Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason


Ole Holm Nielsen  writes:

> 1. Is slurmd running on the node?
Yes.

> 2. What's the output of "slurmd -C" on the node?
NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2
ThreadsPerCore=1 RealMemory=6097

> 3. Define State=UP in slurm.conf in stead of UNKNOWN
Will do.

> 4. Why have you configured TmpDisk=0?  It should be the size of the
> /tmp filesystem.
I have not configured TmpDisk.  This the entry in slurm.conf for that
node:
NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2
ThreadsPerCore=1 RealMemory=10193  State=UNKNOWN

But I do notice that slurmd -C now says there is less memory than
configured.

Thanks again.

Roger



Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Brian Andrus

That output of slurmd -C is your answer.

Slurmd only sees 6GB of memory and you are claiming it has 10GB.

I would run some memtests, look at meminfo on the node, etc.

Maybe even check that the type/size of memory in there is what you think 
it is.


Brian Andrus

On 5/25/2023 7:30 AM, Roger Mason wrote:

Ole Holm Nielsen  writes:


1. Is slurmd running on the node?

Yes.


2. What's the output of "slurmd -C" on the node?

NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2
ThreadsPerCore=1 RealMemory=6097


3. Define State=UP in slurm.conf in stead of UNKNOWN

Will do.


4. Why have you configured TmpDisk=0?  It should be the size of the
/tmp filesystem.

I have not configured TmpDisk.  This the entry in slurm.conf for that
node:
NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2
ThreadsPerCore=1 RealMemory=10193  State=UNKNOWN

But I do notice that slurmd -C now says there is less memory than
configured.

Thanks again.

Roger





Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Groner, Rob
A quick test to see if it's a configuration error is to set config_overrides in 
your slurm.conf and see if the node then responds to scontrol update.


From: slurm-users  on behalf of Brian 
Andrus 
Sent: Thursday, May 25, 2023 10:54 AM
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] Nodes stuck in drain state

That output of slurmd -C is your answer.

Slurmd only sees 6GB of memory and you are claiming it has 10GB.

I would run some memtests, look at meminfo on the node, etc.

Maybe even check that the type/size of memory in there is what you think
it is.

Brian Andrus

On 5/25/2023 7:30 AM, Roger Mason wrote:
> Ole Holm Nielsen  writes:
>
>> 1. Is slurmd running on the node?
> Yes.
>
>> 2. What's the output of "slurmd -C" on the node?
> NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2
> ThreadsPerCore=1 RealMemory=6097
>
>> 3. Define State=UP in slurm.conf in stead of UNKNOWN
> Will do.
>
>> 4. Why have you configured TmpDisk=0?  It should be the size of the
>> /tmp filesystem.
> I have not configured TmpDisk.  This the entry in slurm.conf for that
> node:
> NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2
> ThreadsPerCore=1 RealMemory=10193  State=UNKNOWN
>
> But I do notice that slurmd -C now says there is less memory than
> configured.
>
> Thanks again.
>
> Roger
>



Re: [slurm-users] hi-priority partition and preemption

2023-05-25 Thread Reed Dier
After trying to approach this with preempt/partition_prio, we ended up moving 
to QOS based preemption due to some issues with suspend/requeue, and also 
wanting to use QOS for quicker/easier tweaks than changing partitions as a 
whole.

> PreemptType=preempt/qos
> PreemptMode=SUSPEND,GANG
> PartitionName=part-lopri Nodes=nodes[000-NNN]Default=NO  
> MaxTime=INFINITEOverSubscribe=FORCE:1   PriorityTier=10 State=UP
> PartitionName=part-hipir Nodes=nodes[000-NNN]Default=NO  
> MaxTime=INFINITEOverSubscribe=NOPriorityTier=100State=UP  
>   PreemptMode=OFF

We then have a few QOS that have different Priority values, as well as 
PreemptMode, QOS it can preempt, etc.
>   Name   PriorityPreempt PreemptMode
> -- -- -- ---
> rq 10requeue
>   susp 11suspend 
>  hipri100rq,susp cluster
>   test 50 rq requeue

The rq qos is stateless and can be requeued, susp qos is stateful and needs to 
be suspended.
Hipri can preempt rq and susp.
We also have a test qos with very strict limits (wall clock, job count, tres 
count, etc) that allows small jobs to jump the queue, for quick testing before 
submitting into the full queue.

The tricky part for us was that we have some stateful jobs that need to be 
suspended, and some stateless jobs that can just be requeued without issue.
But we want the hipri partition to take precedent, on the same hardware pool.
We also didn’t want gang scheduling to flip flop jobs running, which if memory 
serves me correctly, was how/why we ended up going with duplicative partitions 
for the purpose of priority, because we couldn’t get preemption to work 
intra-partition correctly.
In a perfect world, we would have just the single partition and everything 
handled in QOS, but it’s working, and that’s what mattered.

I’m not sure how any of this would work with FORCE:20 oversubscribe, but 
hopefully it offers something useful to try next.

Reed

> On May 24, 2023, at 8:42 AM, Groner, Rob  wrote:
> 
> What you are describing is definitely doable.  We have our system setup 
> similarly.  All nodes are in the "open" partition and "prio" partition, but a 
> job submitted to the "prio" partition will preempt the open jobs.
> 
> I don't see anything clearly wrong with your slurm.conf settings.  Ours are 
> very similar, though we use only FORCE:1 for oversubscribe.  You might try 
> that just to see if there's a difference.
> 
> What are the sbatch settings you are using when you submit the jobs?
> 
> Do you have PreemptExemptTime set to anything in slurm.conf?
> 
> What is the reason squeue gives for the high priority jobs to be pending?
> 
> For your "run regularly" goal, you might consider scrontab.  If we can figure 
> out priority and preemption, then that will start the job at a regular time.
> 
> Rob
> 
> From: slurm-users  > on behalf of Fabrizio Roccato 
> mailto:f.rocc...@isac.cnr.it>>
> Sent: Wednesday, May 24, 2023 7:17 AM
> To: slurm-users@lists.schedmd.com  
> mailto:slurm-users@lists.schedmd.com>>
> Subject: [slurm-users] hi-priority partition and preemption
>  
> [You don't often get email from f.rocc...@isac.cnr.it 
> . Learn why this is important at 
> https://aka.ms/LearnAboutSenderIdentification 
>  ]
> 
> Hi all,
> i'm trying to have two overlapping partition, say normal and hi-pri,
> so that when jobs are launched in the second one they can preempt the jobs
> allready running in the first one, automatically putting them in suspend
> state. After completition, the jobs in the normal partition must be
> automatically resumed.
> 
> here are my (relevant) slurm.conf settings:
> 
> > PreemptMode=suspend,gang
> > PreemptType=preempt/partition_prio
> >
> > PartitionName=normal Nodes=node0[01-08] MaxTime=1800 PriorityTier=100 
> > AllowAccounts=group1,group2 OverSubscribe=FORCE:20 PreemptMode=suspend
> > PartitionName=hi-pri Nodes=node0[01-08] MaxTime=360 PriorityTier=500 
> > AllowAccounts=group2 OverSubscribe=FORCE:20 PreemptMode=off
> 
> But so, jobs in the hi-pri partition where put in PD state and the ones
> allready running in the normal partition continue in their R status.
> What  i'm wrong? What i'm missing?
> 
> Since i have jobs thath must run at specific time and must have priority over
> all others, is this the correct way to do?
> 
> 
> Thanks
> 
> FR



[slurm-users] seff in slurm-23.02

2023-05-25 Thread David Gauchard
Hello,

slurm-23.02 on ubuntu-20.04,

seff is not working anymore:

```
# ./seff 4911385
Use of uninitialized value $FindBin::Bin in concatenation (.) or string at 
./seff line 11.
Name "FindBin::Bin" used only once: possible typo at ./seff line 11,  
line 602.
perl: error: slurm_persist_conn_open: Something happened with the 
receiving/processing of the persistent connection init message to 
localhost:6819: Failed to unpack
SLURM_PERSIST_INIT message
perl: error: Sending PersistInit msg: Message receive failure
Use of uninitialized value in subroutine entry at ./seff line 58,  line 
602.
perl: error: [...]
```


while using 
https://github.com/SchedMD/slurm/blob/ce7d569807c495516ebfa6fcef25ad36ccc76827/contribs/seff/seff#LL19C3-L19C124
 :

```
# sacct -P -n -a --format 
JobID,User,Group,State,Cluster,AllocCPUS,REQMEM,TotalCPU,Elapsed,MaxRSS,ExitCode,NNodes,NTasks
 -j 4911385
4911385|user|part|FAILED|hpc|1|2000M|00:23.041|00:00:31||0:9|1|
4911385.batch|||CANCELLED by 0|hpc|1||00:23.041|00:00:31|5936692K|0:9|1|1
```

I wonder whether this is an installation error and contrib/seff is working
for other 23.02 users.

Thanks



Re: [slurm-users] [EXTERNAL] seff in slurm-23.02

2023-05-25 Thread Mike Robbert
How did you install seff? I don’t know exactly where this happens, but it 
looks like line 11 in the source file for seff is supposed to get transformed 
to include an actual path. I am running on CentOS and install Slurm by building 
the RPMs using the included spec files and here is a diff of the file in the 
source tree and the file that got installed to /usr/bin/seff 

$ diff contribs/seff/seff /usr/bin/seff 
11c11 
< use lib "${FindBin::Bin}/../lib/perl"; 
--- 
> use lib qw(/usr/lib64/perl5); 

Mike Robbert 
Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced Research 
Computing 
Information and Technology Solutions (ITS) 
303-273-3786 | mrobb...@mines.edu  

Our values: Trust | Integrity | Respect | Responsibility 




From: slurm-users  on behalf of David 
Gauchard 
Date: Thursday, May 25, 2023 at 10:02
To: slurm-us...@schedmd.com 
Subject: [EXTERNAL] [slurm-users] seff in slurm-23.02 

CAUTION: This email originated from outside of the Colorado School of Mines 
organization. Do not click on links or open attachments unless you recognize 
the sender and know the content is safe.


Hello,

slurm-23.02 on ubuntu-20.04,

seff is not working anymore:

```
# ./seff 4911385
Use of uninitialized value $FindBin::Bin in concatenation (.) or string at 
./seff line 11.
Name "FindBin::Bin" used only once: possible typo at ./seff line 11,  
line 602.
perl: error: slurm_persist_conn_open: Something happened with the 
receiving/processing of the persistent connection init message to 
localhost:6819: Failed to unpack
SLURM_PERSIST_INIT message
perl: error: Sending PersistInit msg: Message receive failure
Use of uninitialized value in subroutine entry at ./seff line 58,  line 
602.
perl: error: [...]
```


while using 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSchedMD%2Fslurm%2Fblob%2Fce7d569807c495516ebfa6fcef25ad36ccc76827%2Fcontribs%2Fseff%2Fseff%23LL19C3-L19C124&data=05%7C01%7Cmrobbert%40mines.edu%7C2a6103be8f63448b670d08db5d396909%7C997209e009b346239a4d76afa44a675c%7C0%7C0%7C638206273390681941%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=lQO0KSMPkx%2BSzejwv0qJ7WqGI43tGQDkYkutW2ghByE%3D&reserved=0
 

 :

```
# sacct -P -n -a --format 
JobID,User,Group,State,Cluster,AllocCPUS,REQMEM,TotalCPU,Elapsed,MaxRSS,ExitCode,NNodes,NTasks
 -j 4911385
4911385|user|part|FAILED|hpc|1|2000M|00:23.041|00:00:31||0:9|1|
4911385.batch|||CANCELLED by 0|hpc|1||00:23.041|00:00:31|5936692K|0:9|1|1
```

I wonder whether this is an installation error and contrib/seff is working
for other 23.02 users.

Thanks 




smime.p7s
Description: S/MIME cryptographic signature


Re: [slurm-users] seff in slurm-23.02

2023-05-25 Thread David Gauchard

Well, sorry, I indeed runned the raw script for this mail.
Running the installed one by `make install`, which is setting line 11 
path correctly:

use lib qw(/usr/local/slurm-23.02.2/lib/x86_64-linux-gnu/perl/5.30.0);

I get:

perl: error: slurm_persist_conn_open: Something happened with the 
receiving/processing of the persistent connection init message to 
localhost:6819: Failed to unpack SLURM_PERSIST_INIT message

perl: error: Sending PersistInit msg: Message receive failure
Use of uninitialized value in subroutine entry at 
/usr/local/slurm/bin/seff line 57,  line 564.

perl: error: g_slurm_auth_pack: protocol_version 6500 not supported
perl: error: slurm_send_node_msg: g_slurm_auth_pack: 
REQUEST_PERSIST_INIT has  authentication error: No error
perl: error: slurm_persist_conn_open: failed to send persistent 
connection init message to localhost:6819

perl: error: Sending PersistInit msg: Protocol authentication error
perl: error: DBD_GET_JOBS_COND failure: Unspecified error
Job not found.

Slurm is otherwise running well after an update from 20.11 -> 21.08 -> 
23.02.


# sinfo -V
slurm 23.02.2
# sinfo -O nodehost,Version
HOSTNAMES   VERSION
x   23.02.2
x   23.02.2
x   23.02.2
x   23.02.2
x   23.02.2
x   23.02.2
x   23.02.2
x   23.02.2
x   23.02.2
x   23.02.2


On 5/25/23 18:33, Mike Robbert wrote:
How did you install seff? I don’t know exactly where this happens, but 
it looks like line 11 in the source file for seff is supposed to get 
transformed to include an actual path. I am running on CentOS and 
install Slurm by building the RPMs using the included spec files and 
here is a diff of the file in the source tree and the file that got 
installed to /usr/bin/seff


$ diff contribs/seff/seff /usr/bin/seff

11c11

< use lib "${FindBin::Bin}/../lib/perl";

---


use lib qw(/usr/lib64/perl5);


*Mike Robbert*

*Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced 
Research Computing*


Information and Technology Solutions (ITS)

303-273-3786 | mrobb...@mines.edu 

A close up of a sign Description automatically generated

*Our values:*Trust | Integrity | Respect | Responsibility

*From: *slurm-users  on behalf of 
David Gauchard 

*Date: *Thursday, May 25, 2023 at 10:02
*To: *slurm-us...@schedmd.com 
*Subject: *[EXTERNAL] [slurm-users] seff in slurm-23.02

CAUTION: This email originated from outside of the Colorado School of 
Mines organization. Do not click on links or open attachments unless you 
recognize the sender and know the content is safe.



Hello,

slurm-23.02 on ubuntu-20.04,

seff is not working anymore:

```
# ./seff 4911385
Use of uninitialized value $FindBin::Bin in concatenation (.) or string 
at ./seff line 11.
Name "FindBin::Bin" used only once: possible typo at ./seff line 11, 
 line 602.
perl: error: slurm_persist_conn_open: Something happened with the 
receiving/processing of the persistent connection init message to 
localhost:6819: Failed to unpack

SLURM_PERSIST_INIT message
perl: error: Sending PersistInit msg: Message receive failure
Use of uninitialized value in subroutine entry at ./seff line 58,  
line 602.

perl: error: [...]
```


while using 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSchedMD%2Fslurm%2Fblob%2Fce7d569807c495516ebfa6fcef25ad36ccc76827%2Fcontribs%2Fseff%2Fseff%23LL19C3-L19C124&data=05%7C01%7Cmrobbert%40mines.edu%7C2a6103be8f63448b670d08db5d396909%7C997209e009b346239a4d76afa44a675c%7C0%7C0%7C638206273390681941%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=lQO0KSMPkx%2BSzejwv0qJ7WqGI43tGQDkYkutW2ghByE%3D&reserved=0  :


```
# sacct -P -n -a --format 
JobID,User,Group,State,Cluster,AllocCPUS,REQMEM,TotalCPU,Elapsed,MaxRSS,ExitCode,NNodes,NTasks -j 4911385

4911385|user|part|FAILED|hpc|1|2000M|00:23.041|00:00:31||0:9|1|
4911385.batch|||CANCELLED by 0|hpc|1||00:23.041|00:00:31|5936692K|0:9|1|1
```

I wonder whether this is an installation error and contrib/seff is working
for other 23.02 users.

Thanks



Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason
Hello,

"Groner, Rob"  writes:

> A quick test to see if it's a configuration error is to set
> config_overrides in your slurm.conf and see if the node then responds
> to scontrol update.

Thanks to all who helped.  It turned out that memory was the issue.  I
have now reseated the RAM in the offending node and all seems well.

I have another node also stuck in drain that I will investigate.  I
picked up some useful tips from the replies, but if I can't get it back
on-line I hope the friendly people on this list will rescue me.

Thanks again,
Roger



[slurm-users] sbatch mem-per-gpu and gres interaction

2023-05-25 Thread christof . koehler
Hello everybody,

I am observing an interaction between the --mem-per-gpu, --cpus-per-gpu
and --gres settings in sbatch which I do not understand. 

Basically, if the job is submitted with --gres=gpu:2 the --mem-per-gpu 
and --cpus-per-gpu settings appear to be observed. If the job is
submitted with --gres=gpu:a100:2 the settings appear to be ignored and
partition defaults are used instead.

First the partition definition from slurm.conf (slurm 23.02.2) and then 
a demonstration:

PartitionName=gpu Nodes=gpu[001-004] MaxTime=24:00:00 DefMemPerGPU=124000  
 DefCpuPerGPU=12 
TRESBillingWeights="CPU=1.0,Mem=0.25G,GRES/gpu=50.0,Gres/gpu:a100=50.0" 
 State=UP

Submitting this jobscript:

#!/bin/bash
#SBATCH --cpus-per-gpu=4
#SBATCH --mem-per-gpu=5000M
#SBATCH --partition=gpu
#SBATCH --gres=gpu:2
#SBATCH --time=00:20:00
sleep 60

gives 

$ scontrol show job=812|grep -i tres
   ReqTRES=cpu=1,mem=1M,node=1,billing=103,gres/gpu=2
   AllocTRES=cpu=8,mem=1M,node=1,billing=210,gres/gpu=2,gres/gpu:a100=2
   CpusPerTres=gres:gpu:4
   MemPerTres=gres:gpu:5000
   TresPerNode=gres:gpu:2

Changing "--gres=gpu:2" to "--gres=gpu:a100:2" however gives

control show job=813|grep -i tres
   ReqTRES=cpu=1,mem=50M,node=1,billing=323,gres/gpu=2,gres/gpu:a100=2
   AllocTRES=cpu=24,mem=248000M,node=1,billing=284,gres/gpu=2,gres/gpu:a100=2
   CpusPerTres=gpu:12
   MemPerTres=gpu:124000
   TresPerNode=gres:gpu:a100:2

So, if "--gres=gpu:2" the settings fron --mem-per-gpu and --cpus-per-gpu
are use. But if "--gres=gpu:a100:2" the partition default values are
used.

If I change the partition definition to

PartitionName=gpu Nodes=gpu[001-004] MaxTime=24:00:00 DefMemPerCPU=4096
MaxMemPerCPU=10200 DefCpuPerGPU=12
TRESBillingWeights="CPU=1.0,Mem=0.25G,GRES/gpu=50.0,Gres/gpu:a100=50.0"
State=UP

we observe the same behaviour. With "--gres=gpu:a100:2" partition
default values for memory and number of cpus are used instead of
the values supplied in the jobscript.

I did not find anything describing such an interaction in the
documentation. Is what we observe the expected behaviour for some
reason? Or is there a problem with our configuration?

Best Regards

Christof

-- 
Dr. rer. nat. Christof Köhler   email: c.koeh...@uni-bremen.de
Universitaet Bremen/FB1/BCCMS   phone:  +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.06   fax: +49-(0)421-218-62770
28359 Bremen  



[slurm-users] Temporary Stop User Submission

2023-05-25 Thread Markuske, William
Hello,

I have a badly behaving user that I need to speak with and want to temporarily 
disable their ability to submit jobs. I know I can change their account 
settings to stop them. Is there another way to set a block on a specific 
username that I can lift later without removing the user/account associations?

Regards,

--
Willy Markuske

HPC Systems Engineer
SDSC - Research Data Services
(619) 519-4435
wmarku...@sdsc.edu



Re: [slurm-users] Temporary Stop User Submission

2023-05-25 Thread Doug Meyer
I always like

Sacctmgr update user where user= set grpcpus=0

On Thu, May 25, 2023, 4:19 PM Markuske, William  wrote:

> Hello,
>
> I have a badly behaving user that I need to speak with and want to
> temporarily disable their ability to submit jobs. I know I can change their
> account settings to stop them. Is there another way to set a block on a
> specific username that I can lift later without removing the user/account
> associations?
>
> Regards,
>
> --
> Willy Markuske
>
> HPC Systems Engineer
> SDSC - Research Data Services
> (619) 519-4435
> wmarku...@sdsc.edu
>
>


Re: [slurm-users] Temporary Stop User Submission

2023-05-25 Thread Sean Crosby
Hi Willy,

sacctmgr modify account slurmaccount user=baduser set maxjobs=0

Sean



From: slurm-users  on behalf of 
Markuske, William 
Sent: Friday, 26 May 2023 09:16
To: slurm-users@lists.schedmd.com 
Subject: [EXT] [slurm-users] Temporary Stop User Submission

External email: Please exercise caution


Hello,

I have a badly behaving user that I need to speak with and want to temporarily 
disable their ability to submit jobs. I know I can change their account 
settings to stop them. Is there another way to set a block on a specific 
username that I can lift later without removing the user/account associations?

Regards,

--
Willy Markuske

HPC Systems Engineer
SDSC - Research Data Services
(619) 519-4435
wmarku...@sdsc.edu



Re: [slurm-users] Temporary Stop User Submission

2023-05-25 Thread Ryan Novosielski
I tend not to let them login. It will get their attention, and prevent them 
from just running their work on the login node when they discover they can’t 
submit. But appreciate seeing the other options.

Sent from my iPhone

> On May 25, 2023, at 19:19, Markuske, William  wrote:
> 
>  Hello,
> 
> I have a badly behaving user that I need to speak with and want to 
> temporarily disable their ability to submit jobs. I know I can change their 
> account settings to stop them. Is there another way to set a block on a 
> specific username that I can lift later without removing the user/account 
> associations?
> 
> Regards,
> 
> --
> Willy Markuske
> 
> HPC Systems Engineer
> SDSC - Research Data Services
> (619) 519-4435
> wmarku...@sdsc.edu
> 


Re: [slurm-users] Temporary Stop User Submission

2023-05-25 Thread Markuske, William
Sean,

I was just about to mention this wasn't working because I had thought of 
something similar. I tried 'sacctmgr modify user where name= set 
maxjobs=0' but that was still allowing the user to 'srun --pty bash'. 
Apparently doing it through modifying the account as you stated does work 
though which is odd. Just have to modify each account the user has access too.

Thanks.

Regards,

--
Willy Markuske

HPC Systems Engineer
SDSC - Research Data Services
(619) 519-4435
wmarku...@sdsc.edu

On May 25, 2023, at 16:32, Sean Crosby  wrote:

Hi Willy,

sacctmgr modify account slurmaccount user=baduser set maxjobs=0

Sean



From: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 on behalf of Markuske, William mailto:wmarku...@sdsc.edu>>
Sent: Friday, 26 May 2023 09:16
To: slurm-users@lists.schedmd.com 
mailto:slurm-users@lists.schedmd.com>>
Subject: [EXT] [slurm-users] Temporary Stop User Submission

External email: Please exercise caution


Hello,

I have a badly behaving user that I need to speak with and want to temporarily 
disable their ability to submit jobs. I know I can change their account 
settings to stop them. Is there another way to set a block on a specific 
username that I can lift later without removing the user/account associations?

Regards,

--
Willy Markuske

HPC Systems Engineer
SDSC - Research Data Services
(619) 519-4435
wmarku...@sdsc.edu



Re: [slurm-users] Temporary Stop User Submission

2023-05-25 Thread Markuske, William
I would but unfortunately they were creating 100s of TBs of data and I need 
them to log in and delete it but I don't want them creating more in the 
meantime.

Regards,

--
Willy Markuske

HPC Systems Engineer
SDSC - Research Data Services
(619) 519-4435
wmarku...@sdsc.edu

On May 25, 2023, at 16:34, Ryan Novosielski  wrote:

I tend not to let them login. It will get their attention, and prevent them 
from just running their work on the login node when they discover they can’t 
submit. But appreciate seeing the other options.

Sent from my iPhone

On May 25, 2023, at 19:19, Markuske, William  wrote:

 Hello,

I have a badly behaving user that I need to speak with and want to temporarily 
disable their ability to submit jobs. I know I can change their account 
settings to stop them. Is there another way to set a block on a specific 
username that I can lift later without removing the user/account associations?

Regards,

--
Willy Markuske

HPC Systems Engineer
SDSC - Research Data Services
(619) 519-4435
wmarku...@sdsc.edu




Re: [slurm-users] seff in slurm-23.02

2023-05-25 Thread Angel de Vicente
Hello,

David Gauchard  writes:

> slurm-23.02 on ubuntu-20.04,
>
> seff is not working anymore:

perhaps it is something specific to 20.04? I'm on Ubuntu 22.04 and
slurm-23.02.1 here and no problems with seff, except that the memory
efficiency part seems broken (I always seem to get 0.00% efficiency)

,
| State: COMPLETED (exit code 0)
| Nodes: 1
| Cores per node: 20
| CPU Utilized: 05:50:07
| CPU Efficiency: 88.41% of 06:36:00 core-walltime
| Job Wall-clock time: 00:19:48
| Memory Utilized: 5.43 GB
| Memory Efficiency: 0.00% of 16.00 B
`

-- 
Ángel de Vicente
 Research Software Engineer (Supercomputing and BigData)
 Tel.: +34 922-605-747
 Web.: http://research.iac.es/proyecto/polmag/

 GPG: 0x8BDC390B69033F52


smime.p7s
Description: S/MIME cryptographic signature