You’re on the right track with the DRAIN state. The more specific answer is in 
the “Reason=” description on the last line. 

It looks like your node has less memory than what you’ve defined for the node 
in slurm.conf

 

Mike

 

From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Joakim 
Hove <joakim.h...@gmail.com>
Reply-To: Slurm User Community List <slurm-users@lists.schedmd.com>
Date: Monday, May 11, 2020 at 11:25
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: [External] Re: [slurm-users] Slurm queue seems to be completely blocked

 

CAUTION: This email originated from outside of the Colorado School of Mines 
organization. Do not click on links or open attachments unless you recognize 
the sender and know the content is safe.

 

 

ubuntu@ip-172-31-80-232:/var/run/slurm-llnl$ scontrol show node
NodeName=ip-172-31-80-232 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=1 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=ip-172-31-80-232 NodeHostName=ip-172-31-80-232 Version=17.11
   OS=Linux 5.3.0-1017-aws #18~18.04.1-Ubuntu SMP Wed Apr 8 15:12:16 UTC 2020 
   RealMemory=983 AllocMem=0 FreeMem=355 Sockets=1 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=debug 
   BootTime=2020-05-11T17:02:15 SlurmdStartTime=2020-05-11T17:02:27
   CfgTRES=cpu=1,mem=983M,billing=1
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Low RealMemory [root@2020-05-11T16:20:02]

 

The "State=IDLE+DRAIN" looks a bit suspicious?

 

 

 

 

On Mon, May 11, 2020 at 7:16 PM Alex Chekholko <a...@calicolabs.com> wrote:

You will want to look at the output of 'sinfo' and 'scontrol show node' to see 
what slurmctld thinks about your compute nodes; then on the compute nodes you 
will want to check the status of the slurmd service ('systemctl status -l 
slurmd') and possibly read through the slurmd logs as well.

 

On Mon, May 11, 2020 at 10:11 AM Joakim Hove <joakim.h...@gmail.com> wrote:

Hello;

 

I am in the process of familiarizing myself with slurm - I will write a piece 
of software which will submit jobs to a slurm cluster. Right now I have just 
made my own "cluster" consisting of one Amazon AWS node and use that to 
familiarize myself with the sxxx commands - has worked nicely.

 

Now I just brought this AWS node completely to it's knees (not slurm related) 
and had to stop and start the node from the AWS console - during that process a 
job managed by slurm was killed hard. Now when the node is back up again slurm 
refuses to start up jobs - the queue looks like this:

 

ubuntu@ip-172-31-80-232:~$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES 
NODELIST(REASON)
               186     debug tmp-file www-data PD       0:00      1 (Resources)
               187     debug tmp-file www-data PD       0:00      1 (Resources)
               188     debug tmp-file www-data PD       0:00      1 (Resources)
               189     debug tmp-file www-data PD       0:00      1 (Resources)

 

I.e. the jobs are pending due to Resource reasons, but no jobs are running? I 
have tried scancel all jobs, but when I add new jobs they again just stay 
pending. It should be said that when the node/slurm came back up again the 
offending job which initially created the havoc was still in "Running" state, 
but the filesystem of that job had been completely wiped so it was not in a 
sane state. scancel of this job worked fine - but no new jobs will start. Seems 
like there is "ghost job" blocking the other jobs from starting? I even tried 
to reinstall slurm using the package manager, but the new slurm installation 
would still not start jobs. Any tips on how I can proceed to debug this?

 

Regards

 

Joakim

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to