Re: [slurm-users] [External] Re: Slurm queue seems to be completely blocked

2020-05-11 Thread Joakim Hove
And we have ignition - thank you very much! :-) On Mon, May 11, 2020 at 8:44 PM Alex Chekholko wrote: > Any time a node goes into DRAIN state you need to manually intervene and > put it back into service. > scontrol update nodename=ip-172-31-80-232 state=resume > > On Mon, May 11, 2020 at 11:40

Re: [slurm-users] [External] Re: Slurm queue seems to be completely blocked

2020-05-11 Thread Alex Chekholko
Any time a node goes into DRAIN state you need to manually intervene and put it back into service. scontrol update nodename=ip-172-31-80-232 state=resume On Mon, May 11, 2020 at 11:40 AM Joakim Hove wrote: > > You’re on the right track with the DRAIN state. The more specific answer >> is in the

Re: [slurm-users] [External] Re: Slurm queue seems to be completely blocked

2020-05-11 Thread Joakim Hove
> You’re on the right track with the DRAIN state. The more specific answer > is in the “Reason=” description on the last line. > > It looks like your node has less memory than what you’ve defined for the > node in slurm.conf > Thank you; that sounded meaningful to me. My slurm.conf file had RealMe

Re: [slurm-users] [External] Re: Slurm queue seems to be completely blocked

2020-05-11 Thread Michael Robbert
You’re on the right track with the DRAIN state. The more specific answer is in the “Reason=” description on the last line. It looks like your node has less memory than what you’ve defined for the node in slurm.conf Mike From: slurm-users on behalf of Joakim Hove Reply-To: Slurm User

Re: [slurm-users] Slurm queue seems to be completely blocked

2020-05-11 Thread Joakim Hove
ubuntu@ip-172-31-80-232:/var/run/slurm-llnl$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 1 drain ip-172-31-80-232 ● slurmd.service - Slurm node daemon Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled) Active: ac

Re: [slurm-users] Slurm queue seems to be completely blocked

2020-05-11 Thread Joakim Hove
ubuntu@ip-172-31-80-232:/var/run/slurm-llnl$ scontrol show node NodeName=ip-172-31-80-232 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUErr=0 CPUTot=1 CPULoad=0.00 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=ip-172-31-80-232 NodeHostName=ip-172-31-80-232 Version=

Re: [slurm-users] Slurm queue seems to be completely blocked

2020-05-11 Thread Alex Chekholko
You will want to look at the output of 'sinfo' and 'scontrol show node' to see what slurmctld thinks about your compute nodes; then on the compute nodes you will want to check the status of the slurmd service ('systemctl status -l slurmd') and possibly read through the slurmd logs as well. On Mon,

[slurm-users] Slurm queue seems to be completely blocked

2020-05-11 Thread Joakim Hove
Hello; I am in the process of familiarizing myself with slurm - I will write a piece of software which will submit jobs to a slurm cluster. Right now I have just made my own "cluster" consisting of one Amazon AWS node and use that to familiarize myself with the sxxx commands - has worked nicely.

Re: [slurm-users] additional jobs killed by scancel.

2020-05-11 Thread Nathan Harper
Overzealous node cleanup epilog script? > On 11 May 2020, at 17:56, Alastair Neil wrote: > >  > Hi there, > > We are using slurm 18.08 and had a weird occurrence over the weekend. A user > canceled one of his jobs using scancel, and two additional jobs of the user > running on the same nod

[slurm-users] additional jobs killed by scancel.

2020-05-11 Thread Alastair Neil
Hi there, We are using slurm 18.08 and had a weird occurrence over the weekend. A user canceled one of his jobs using scancel, and two additional jobs of the user running on the same node were killed concurrently. The jobs had no dependency, but they were all allocated 1 gpu. I am curious to kno

Re: [slurm-users] slurmdbd crashes with segmentation fault following DBD_GET_ASSOCS

2020-05-11 Thread Steven Dick
Previous versions of mysql are suppose to have nasty security issues. I'm not sure why I had mysql instead of mariadb anyway. On Mon, May 11, 2020 at 9:29 AM Relu Patrascu wrote: > > We've experienced the same problem on several versions of slurmdbd > (18, 19) so we downgraded mysql and put a ho

Re: [slurm-users] Slurm on Debian Stretch

2020-05-11 Thread Steffen Grunewald
Hi Martijn, I'm sorry that it took me several weeks to get back to this issue - never fix anything that isn't broken (... too much), and I've been busy with user accounting all over the place... On Thu, 2020-03-05 at 12:03:38 +0100, Martijn Kruiten wrote: > Hi Steffen, > > We are using Slurm on

Re: [slurm-users] slurmdbd crashes with segmentation fault following DBD_GET_ASSOCS

2020-05-11 Thread Relu Patrascu
We've experienced the same problem on several versions of slurmdbd (18, 19) so we downgraded mysql and put a hold on the package. Hey Dustin, funny we meet here :) Relu On Tue, May 5, 2020 at 3:43 PM Dustin Lang wrote: > > I tried upgrading Slurm to 18.08.9 and I am still getting this Segmentati