Hi Joakim, one more thing to mention:
Am 11.05.2020 um 19:23 schrieb Joakim Hove:
ubuntu@ip-172-31-80-232:/var/run/slurm-llnl$ scontrol show node NodeName=ip-172-31-80-232 Arch=x86_64 CoresPerSocket=1 Reason=Low RealMemory [root@2020-05-11T16:20:02] The "State=IDLE+DRAIN" looks a bit suspicious?
I assume, you think it is suspicious, that a node has the states IDLE and DRAIN together, right?
But that is fully OK an fairly easy to explain in this case.There are two different sets of flags and you can here see one the states of each.
IDLE could also be ALLOCATED or MIXED DRAIN could also be e.g. DOWN or FAIL... It gets clearer, if you look at the sinfo output. A node with ALLOCATED or MIXED together with DRAIN will be shown as DRAININGA node with IDLE (no running job, all cores free) together with DRAIN will be shown as DRAINED
Best Marcus
On Mon, May 11, 2020 at 7:16 PM Alex Chekholko <a...@calicolabs.com <mailto:a...@calicolabs.com>> wrote:You will want to look at the output of 'sinfo' and 'scontrol show node' to see what slurmctld thinks about your compute nodes; then on the compute nodes you will want to check the status of the slurmd service ('systemctl status -l slurmd') and possibly read through the slurmd logs as well. On Mon, May 11, 2020 at 10:11 AM Joakim Hove <joakim.h...@gmail.com <mailto:joakim.h...@gmail.com>> wrote: Hello; I am in the process of familiarizing myself with slurm - I will write a piece of software which will submit jobs to a slurm cluster. Right now I have just made my own "cluster" consisting of one Amazon AWS node and use that to familiarize myself with the sxxx commands - has worked nicely. Now I just brought this AWS node completely to it's knees (not slurm related) and had to stop and start the node from the AWS console - during that process a job managed by slurm was killed hard. Now when the node is back up again slurm refuses to start up jobs - the queue looks like this: ubuntu@ip-172-31-80-232:~$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)186 debug tmp-file www-data PD 0:00 1 (Resources) 187 debug tmp-file www-data PD 0:00 1 (Resources) 188 debug tmp-file www-data PD 0:00 1 (Resources) 189 debug tmp-file www-data PD 0:00 1 (Resources)I.e. the jobs are pending due to Resource reasons, but no jobs are running? I have tried scancel all jobs, but when I add new jobs they again just stay pending. It should be said that when the node/slurm came back up again the offending job which initially created the havoc was still in "Running" state, but the filesystem of that job had been completely wiped so it was not in a sane state. scancel of this job worked fine - but no new jobs will start. Seems like there is "ghost job" blocking the other jobs from starting? I even tried to reinstall slurm using the package manager, but the new slurm installation would still not start jobs. Any tips on how I can proceed to debug this? Regards Joakim
smime.p7s
Description: S/MIME Cryptographic Signature