[slurm-users] Re: _refresh_assoc_mgr_qos_list: no new list given back keeping cached one

2024-05-16 Thread J D via slurm-users
I figured out that the mailing list may not be appropriate for this message, so 
I've created a bug report instead: 
https://bugs.schedmd.com/show_bug.cgi?id=19894

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Executing srun -n X where X is greater than total CPU in entire cluster

2024-05-16 Thread Dan Healy via slurm-users
Hi there, SLURM community,

I swear I've done this before, but now it's failing on a new cluster I'm
deploying. We have 6 compute nodes with 64 cpu each (384 CPU total). When I
run `srun -n 500 hostname`, the task gets queued since there's not 500
available CPU.

Wasn't there an option that allows for this to be run where the first 384
tasks execute, and then the remaining execute when resources free up?

Here's my conf:

# Slurm Cgroup Configs used on controllers and
workersslurm_cgroup_config:  CgroupAutomount: yes  ConstrainCores: yes
 ConstrainRAMSpace: yes  ConstrainSwapSpace: yes  ConstrainDevices:
yes# Slurm conf file settingsslurm_config:  AccountingStorageType:
"accounting_storage/slurmdbd"  AccountingStorageEnforce: "limits"
AuthAltTypes: "auth/jwt"  ClusterName: "cluster"
AccountingStorageHost : "{{
hostvars[groups['controller'][0]].ansible_hostname }}"  DefMemPerCPU:
1024  InactiveLimit: 120  JobAcctGatherType: "jobacct_gather/cgroup"
JobCompType: "jobcomp/none"  MailProg: "/usr/bin/mail"  MaxArraySize:
4  MaxJobCount: 10  MinJobAge: 3600  ProctrackType:
"proctrack/cgroup"  ReturnToService: 2  SelectType: "select/cons_tres"
 SelectTypeParameters: "CR_Core_Memory"  SlurmctldTimeout: 30
SlurmctldLogFile: "/var/log/slurm/slurmctld.log"  SlurmdLogFile:
"/var/log/slurm/slurmd.log"  SlurmdSpoolDir: "/var/spool/slurm/d"
SlurmUser: "{{ slurm_user.name }}"  SrunPortRange: "6-61000"
StateSaveLocation: "/var/spool/slurm/ctld"  TaskPlugin:
"task/affinity,task/cgroup"  UnkillableStepTimeout: 120


-- 
Thanks,

Daniel Healy

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Slurm DB upgrade failure behavior

2024-05-16 Thread Yuengling, Philip J. via slurm-users
Hi everyone,


I'm writing up some Ansible code to manage Slurm software updates, and I 
haven't found any documentation about slurmdbd behavior if the mysql/mariadb 
database doesn't upgrade successfully.


What I do know is that if it is sucessful I can expect to see "Conversion done: 
success!" in the slurmdbd log.  This is good, but minor updates do not update 
the database as far as I know.


If the Slurm database cannot upgrade upon an update, does it always shut down 
with a fatal error?  What other behaviors should I look for if there is a 
failure?


Cheers,

Phil Y

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] memory high water mark reporting

2024-05-16 Thread Emyr James via slurm-users
Hi,

We are trying out slurm having been running grid engine for a long while.
In grid engine, the cgroups peak memory and max_rss are generated at the end of 
a job and recorded. It logs the information from the cgroup hierarchy as well 
as doing a getrusage call right at the end on the parent pid of the whole job 
"container" before cleaning up.
With slurm it seems that the only way memory is recorded is by the acct gather 
polling. I am trying to add something in an epilog script to get the 
memory.peak but It looks like the cgroup hierarchy has been destroyed by the 
time the epilog is run.
Where in the code is the cgroup hierarchy cleared up ? Is there no way to add 
something in so that the accounting is updated during the job cleanup process 
so that peak memory usage can be accurately logged ?

I can reduce the polling interval from 30s to 5s but don't know if this causes 
a lot of overhead and in any case this seems to not be a sensible way to get 
values that should just be determined right at the end by an event rather than 
using polling.

Many thanks,

Emyr

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: memory high water mark reporting

2024-05-16 Thread Davide DelVento via slurm-users
Not exactly the answer to your question (which I don't know) but if you can
get to prefix whatever is executed with this
https://github.com/NCAR/peak_memusage (which also uses getrusage) or a
variant you will be able to do that.

On Thu, May 16, 2024 at 4:10 PM Emyr James via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> Hi,
>
> We are trying out slurm having been running grid engine for a long while.
> In grid engine, the cgroups peak memory and max_rss are generated at the
> end of a job and recorded. It logs the information from the cgroup
> hierarchy as well as doing a getrusage call right at the end on the parent
> pid of the whole job "container" before cleaning up.
> With slurm it seems that the only way memory is recorded is by the acct
> gather polling. I am trying to add something in an epilog script to get the
> memory.peak but It looks like the cgroup hierarchy has been destroyed by
> the time the epilog is run.
> Where in the code is the cgroup hierarchy cleared up ? Is there no way to
> add something in so that the accounting is updated during the job cleanup
> process so that peak memory usage can be accurately logged ?
>
> I can reduce the polling interval from 30s to 5s but don't know if this
> causes a lot of overhead and in any case this seems to not be a sensible
> way to get values that should just be determined right at the end by an
> event rather than using polling.
>
> Many thanks,
>
> Emyr
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Best practice for jobs resuming from suspended state

2024-05-16 Thread Davide DelVento via slurm-users
I don't really have an answer for you, just responding to make your message
pop out in the "flood" of other topics we've got since you posted.

On our cluster we configure cancelling our jobs because it makes more sense
for our situation, so I have no experience with that resume from being
suspended. I can think of two possible reasons for this:

- one is memory (have you checked your memory logs and see if there is a
correlation between node memory occupation and jobs not resuming correctly)
- the second one is some resources disappearing (temp files? maybe in some
circumstances slurm totally wipes out /tmp the second job -- if so, that
would be a slurm bug, obviously)

Assuming that you're stuck without finding a root cause which you can
address, I guess it depends on what "doesn't recover" means. It's one thing
if it crashes immediately. It's another if it just stalls without even
starting but slurm still thinks it's running and the users are charged
their allocation -- even worse if your cluster does not enforce a
wallclock limit (or has a very long one). Depending on frequency of the
issue, size of your cluster and other conditions, you may want to consider
writing a watchdog script which would search for these jobs and cancel them?

As I said, not really an answer, just my $0.02 cents (or even less)

On Wed, May 15, 2024 at 1:54 AM Paul Jones via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> Hi,
>
> We use PreemptMode and PriorityTier within Slurm to suspend low priority
> jobs when more urgent work needs to be done. This generally works well, but
> on occasion resumed jobs fail to restart - which is to say Slurm sets the
> job status to running but the actual code doesn't recover from being
> suspended.
>
> Technically everything is working as expected, but I wondered if there was
> any best practice to pass onto users about how to cope with this state?
> Obviously not a direct Slurm question, but wondered if others had
> experience with this and any advice on how best to limit the impact?
>
> Thanks,
> Paul
>
> --
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: memory high water mark reporting

2024-05-16 Thread Emyr James via slurm-users
Hi,

I have got a very simple LD_PRELOAD that can do this. Maybe I should see if I 
can force slurmstepd to be run with that LD_PRELOAD and then see if that does 
it.

Ultimately am trying to get all the useful accounting metrics into a clickhouse 
database. If the LD_PRELOAD on slurmstepd seems to work then I can expand it to 
insert the relevant row into the clickhouse DB in the C code of the preload 
library.

But still...this seems like a very basic thing to do and am very suprised that 
it seems so difficult to do this with the standard accounting recording out of 
the box.

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation


From: Davide DelVento 
Sent: 17 May 2024 01:02
To: Emyr James 
Cc: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] memory high water mark reporting

Not exactly the answer to your question (which I don't know) but if you can get 
to prefix whatever is executed with this 
https://github.com/NCAR/peak_memusage
 (which also uses getrusage) or a variant you will be able to do that.

On Thu, May 16, 2024 at 4:10 PM Emyr James via slurm-users 
mailto:slurm-users@lists.schedmd.com>> wrote:
Hi,

We are trying out slurm having been running grid engine for a long while.
In grid engine, the cgroups peak memory and max_rss are generated at the end of 
a job and recorded. It logs the information from the cgroup hierarchy as well 
as doing a getrusage call right at the end on the parent pid of the whole job 
"container" before cleaning up.
With slurm it seems that the only way memory is recorded is by the acct gather 
polling. I am trying to add something in an epilog script to get the 
memory.peak but It looks like the cgroup hierarchy has been destroyed by the 
time the epilog is run.
Where in the code is the cgroup hierarchy cleared up ? Is there no way to add 
something in so that the accounting is updated during the job cleanup process 
so that peak memory usage can be accurately logged ?

I can reduce the polling interval from 30s to 5s but don't know if this causes 
a lot of overhead and in any case this seems to not be a sensible way to get 
values that should just be determined right at the end by an event rather than 
using polling.

Many thanks,

Emyr

--
slurm-users mailing list -- 
slurm-users@lists.schedmd.com
To unsubscribe send an email to 
slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Removing safely a node

2024-05-16 Thread Ratnasamy, Fritz via slurm-users
Hi,

 What is the "official" process to remove nodes safely? I have drained the
nodes so jobs are completed and put them in down state after they are
completely drained.
I edited the slurm.conf file to remove the nodes. After some time, I can
see that the nodes were removed from the partition with the command sinfo

However, I was told I might need to restart the service slurmctld, do you
know if it is necessary? Should I also run scontrol reconfig?
Best,

*Fritz Ratnasamy*

Data Scientist

Information Technology

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Removing safely a node

2024-05-16 Thread Ryan Novosielski via slurm-users
If I’m not mistaken, the manual for slurm.conf or one of the others lists 
either what action is needed to change every option, or has a combined list of 
what requires what (I can never remember and would have to look it up anyway).

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
 `'

On May 16, 2024, at 23:16, Ratnasamy, Fritz via slurm-users 
 wrote:

Hi,

 What is the "official" process to remove nodes safely? I have drained the 
nodes so jobs are completed and put them in down state after they are 
completely drained.
I edited the slurm.conf file to remove the nodes. After some time, I can see 
that the nodes were removed from the partition with the command sinfo

However, I was told I might need to restart the service slurmctld, do you know 
if it is necessary? Should I also run scontrol reconfig?
Best,
Fritz Ratnasamy
Data Scientist
Information Technology


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Removing safely a node

2024-05-16 Thread Ole Holm Nielsen via slurm-users

On 5/17/24 05:16, Ratnasamy, Fritz via slurm-users wrote:
  What is the "official" process to remove nodes safely? I have drained 
the nodes so jobs are completed and put them in down state after they are 
completely drained.
I edited the slurm.conf file to remove the nodes. After some time, I can 
see that the nodes were removed from the partition with the command sinfo


However, I was told I might need to restart the service slurmctld, do you 
know if it is necessary? Should I also run scontrol reconfig?


The SchedMD presentations in https://slurm.schedmd.com/publications.html 
describe node add/remove.


I've collected my notes on this in the Wiki page
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_operations/#add-and-remove-nodes

/Ole

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Slurm DB upgrade failure behavior

2024-05-16 Thread Ole Holm Nielsen via slurm-users

On 5/16/24 20:27, Yuengling, Philip J. via slurm-users wrote:
I'm writing up some Ansible code to manage Slurm software updates, and I 
haven't found any documentation about slurmdbd behavior if the 
mysql/mariadb database doesn't upgrade successfully.



I would discourage the proposed Slurm updates automatically using Ansible 
or any other automation tool!  Unexpected bugs might come to the surface 
during upgrading!


The mysql/mariadb database service isn't affected by Slurm updates, 
although the database contents are changed of course :-)


You need to very carefully make a dry-run slurmdbd update on a test node 
before doing the actual slurmdbd upgrade, and you need to make a backup of 
the database before upgrading!


Updates of slurmctld must also be made very carefully with a backup of the 
spool directory (just in case).


The slurmd in most cases can be upgraded with now or small issues.

My Slurm upgrading notes are in this Wiki page:
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrading-slurm

What I do know is that if it is sucessful I can expect to see "Conversion 
done: success!" in the slurmdbd log.  This is good, but minor updates do 
not update the database as far as I know.



If the Slurm database cannot upgrade upon an update, does it always shut 
down with a fatal error?  What other behaviors should I look for if there 
is a failure?


IHTH,
Ole

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com