[slurm-users] Re: Slurm DB upgrade failure behavior

2024-05-16 Thread Ole Holm Nielsen via slurm-users
On 5/16/24 20:27, Yuengling, Philip J. via slurm-users wrote: I'm writing up some Ansible code to manage Slurm software updates, and I haven't found any documentation about slurmdbd behavior if the mysql/mariadb database doesn't upgrade successfully. I would discourage the proposed Slurm upda

[slurm-users] Re: Removing safely a node

2024-05-16 Thread Ole Holm Nielsen via slurm-users
On 5/17/24 05:16, Ratnasamy, Fritz via slurm-users wrote:  What is the "official" process to remove nodes safely? I have drained the nodes so jobs are completed and put them in down state after they are completely drained. I edited the slurm.conf file to remove the nodes. After some time, I can

[slurm-users] Re: Removing safely a node

2024-05-16 Thread Ryan Novosielski via slurm-users
If I’m not mistaken, the manual for slurm.conf or one of the others lists either what action is needed to change every option, or has a combined list of what requires what (I can never remember and would have to look it up anyway). -- #BlackLivesMatter || \\UTGERS, |

[slurm-users] Removing safely a node

2024-05-16 Thread Ratnasamy, Fritz via slurm-users
Hi, What is the "official" process to remove nodes safely? I have drained the nodes so jobs are completed and put them in down state after they are completely drained. I edited the slurm.conf file to remove the nodes. After some time, I can see that the nodes were removed from the partition with

[slurm-users] Re: memory high water mark reporting

2024-05-16 Thread Emyr James via slurm-users
Hi, I have got a very simple LD_PRELOAD that can do this. Maybe I should see if I can force slurmstepd to be run with that LD_PRELOAD and then see if that does it. Ultimately am trying to get all the useful accounting metrics into a clickhouse database. If the LD_PRELOAD on slurmstepd seems to

[slurm-users] Re: Best practice for jobs resuming from suspended state

2024-05-16 Thread Davide DelVento via slurm-users
I don't really have an answer for you, just responding to make your message pop out in the "flood" of other topics we've got since you posted. On our cluster we configure cancelling our jobs because it makes more sense for our situation, so I have no experience with that resume from being suspende

[slurm-users] Re: memory high water mark reporting

2024-05-16 Thread Davide DelVento via slurm-users
Not exactly the answer to your question (which I don't know) but if you can get to prefix whatever is executed with this https://github.com/NCAR/peak_memusage (which also uses getrusage) or a variant you will be able to do that. On Thu, May 16, 2024 at 4:10 PM Emyr James via slurm-users < slurm-us

[slurm-users] memory high water mark reporting

2024-05-16 Thread Emyr James via slurm-users
Hi, We are trying out slurm having been running grid engine for a long while. In grid engine, the cgroups peak memory and max_rss are generated at the end of a job and recorded. It logs the information from the cgroup hierarchy as well as doing a getrusage call right at the end on the parent pid

[slurm-users] Slurm DB upgrade failure behavior

2024-05-16 Thread Yuengling, Philip J. via slurm-users
Hi everyone, I'm writing up some Ansible code to manage Slurm software updates, and I haven't found any documentation about slurmdbd behavior if the mysql/mariadb database doesn't upgrade successfully. What I do know is that if it is sucessful I can expect to see "Conversion done: success!"

[slurm-users] Executing srun -n X where X is greater than total CPU in entire cluster

2024-05-16 Thread Dan Healy via slurm-users
Hi there, SLURM community, I swear I've done this before, but now it's failing on a new cluster I'm deploying. We have 6 compute nodes with 64 cpu each (384 CPU total). When I run `srun -n 500 hostname`, the task gets queued since there's not 500 available CPU. Wasn't there an option that allows

[slurm-users] Re: _refresh_assoc_mgr_qos_list: no new list given back keeping cached one

2024-05-16 Thread J D via slurm-users
I figured out that the mailing list may not be appropriate for this message, so I've created a bug report instead: https://bugs.schedmd.com/show_bug.cgi?id=19894 -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com