Hi Joakim,

one more thing to mention:

Am 11.05.2020 um 19:23 schrieb Joakim Hove:

ubuntu@ip-172-31-80-232:/var/run/slurm-llnl$ scontrol show node
NodeName=ip-172-31-80-232 Arch=x86_64 CoresPerSocket=1
    Reason=Low RealMemory [root@2020-05-11T16:20:02]

The "State=IDLE+DRAIN" looks a bit suspicious?



I assume, you think it is suspicious, that a node has the states IDLE and DRAIN together, right?
But that is fully OK an fairly easy to explain in this case.
There are two different sets of flags and you can here see one the states of each.

IDLE could also be ALLOCATED or MIXED
DRAIN could also be e.g. DOWN or FAIL...

It gets clearer, if you look at the sinfo output.

A node with ALLOCATED or MIXED together with DRAIN will be shown as DRAINING
A node with IDLE (no running job, all cores free) together with DRAIN will be shown as DRAINED


Best
Marcus



On Mon, May 11, 2020 at 7:16 PM Alex Chekholko <a...@calicolabs.com <mailto:a...@calicolabs.com>> wrote:

    You will want to look at the output of 'sinfo' and 'scontrol show
    node' to see what slurmctld thinks about your compute nodes; then on
    the compute nodes you will want to check the status of the slurmd
    service ('systemctl status -l slurmd') and possibly read through the
    slurmd logs as well.

    On Mon, May 11, 2020 at 10:11 AM Joakim Hove <joakim.h...@gmail.com
    <mailto:joakim.h...@gmail.com>> wrote:

        Hello;

        I am in the process of familiarizing myself with slurm - I will
        write a piece of software which will submit jobs to a slurm
        cluster. Right now I have just made my own "cluster" consisting
        of one Amazon AWS node and use that to familiarize myself with
        the sxxx commands - has worked nicely.

        Now I just brought this AWS node completely to it's knees (not
        slurm related) and had to stop and start the node from the AWS
        console - during that process a job managed by slurm was killed
        hard. Now when the node is back up again slurm refuses to start
        up jobs - the queue looks like this:

        ubuntu@ip-172-31-80-232:~$ squeue
                      JOBID PARTITION     NAME     USER ST       TIME
          NODES NODELIST(REASON)
               186     debug tmp-file www-data PD       0:00    1 (Resources)                187     debug tmp-file www-data PD       0:00    1 (Resources)                188     debug tmp-file www-data PD       0:00    1 (Resources)                189     debug tmp-file www-data PD       0:00    1 (Resources)

        I.e. the jobs are pending due to Resource reasons, but no jobs
        are running? I have tried scancel all jobs, but when I add new
        jobs they again just stay pending. It should be said that when
        the node/slurm came back up again the offending job which
        initially created the havoc was still in "Running" state, but
        the filesystem of that job had been completely wiped so it was
        not in a sane state. scancel of this job worked fine - but no
        new jobs will start. Seems like there is "ghost job" blocking
        the other jobs from starting? I even tried to reinstall slurm
        using the package manager, but the new slurm installation would
        still not start jobs. Any tips on how I can proceed to debug this?

        Regards

        Joakim


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to