Re: [slurm-users] Slurm queue seems to be completely blocked

2020-05-12 Thread Marcus Wagner
Hi Joakim, one more thing to mention: Am 11.05.2020 um 19:23 schrieb Joakim Hove: ubuntu@ip-172-31-80-232:/var/run/slurm-llnl$ scontrol show node NodeName=ip-172-31-80-232 Arch=x86_64 CoresPerSocket=1    Reason=Low RealMemory [root@2020-05-11T16:20:02] The "State=IDLE+DRAIN" looks a bit susp

Re: [slurm-users] Reset TMPDIR for All Jobs

2020-05-12 Thread Marcus Wagner
Hi Erik, the output of task-prolog is sourced/evaluated (not really sure, how) in the job environment. Thus you don't have to export a variable in task-prolog, but echo the export, e.g. echo export TMPDIR=/scratch/$SLURM_JOB_ID The variable will then be set in job environment. Best Marcu

Re: [slurm-users] Reset TMPDIR for All Jobs

2020-05-12 Thread Brian Andrus
Maybe too obvious, but have you checked your .bashrc, .bash_profile and such? Brian Andrus On 5/12/2020 10:27 AM, Ellestad, Erik wrote: Which SLURM prolog specifically? I’m not finding that to work for me in either task-prolog or prolog. SLURM_TMPDIR and TMPDIR are still both set to /tmp wh

Re: [slurm-users] additional jobs killed by scancel.

2020-05-12 Thread Steven Dick
What do you get from sacct -o jobid,elapsed,reason,exit -j 533900,533902 On Tue, May 12, 2020 at 4:12 PM Alastair Neil wrote: > > The log is continuous and has all the messages logged by slurmd on the node > for all the jobs mentioned, below are the entries from the slurmctld log: > >> [2020-0

Re: [slurm-users] additional jobs killed by scancel.

2020-05-12 Thread Alastair Neil
The log is continuous and has all the messages logged by slurmd on the node for all the jobs mentioned, below are the entries from the slurmctld log: [2020-05-10T00:26:03.097] _slurm_rpc_kill_job: REQUEST_KILL_JOB > JobId=533898 uid 1224431221 > [2020-05-10T00:26:03.098] email msg to sshr...@maso

Re: [slurm-users] Reset TMPDIR for All Jobs

2020-05-12 Thread Ellestad, Erik
Which SLURM prolog specifically? I'm not finding that to work for me in either task-prolog or prolog. SLURM_TMPDIR and TMPDIR are still both set to /tmp when I run a job. Erik -- Erik Ellestad Wynton Cluster SysAdmin UCSF From: slurm-users On Behalf Of Roger Moye Sent: Tuesday, May 12, 2020

Re: [slurm-users] Reset TMPDIR for All Jobs

2020-05-12 Thread Roger Moye
We had issues getting TMPDIR to work as well. We finally did this in our prolog: export SLURM_TMPDIR="/tmp/slurm/${SLURM_JOB_ID}" This works. -Roger From: slurm-users On Behalf Of Ellestad, Erik Sent: Tuesday, May 12, 2020 10:40 AM To: slurm-users@lists.schedmd.com Subject: [slurm-users] R

Re: [slurm-users] [External] slurm only looking in "default" partition during scheduling

2020-05-12 Thread Michael Robbert
You have defined both of your partitions with “Default=YES”, but Slurm can have only one default partition. You can see from * on the compute partition in your sinfo output that Slurm selected that one as the default. When you use srun or sbatch it will only look at the default partition unless

[slurm-users] Reset TMPDIR for All Jobs

2020-05-12 Thread Ellestad, Erik
I was wanted to set TMPDIR from /tmp to a per job directory I create in local /scratch/$SLURM_JOB_ID (for example) This bug suggests I should be able to do this in a task-prolog. https://bugs.schedmd.com/show_bug.cgi?id=2664 However adding the following to task-prolog doesn't seem to affect the

[slurm-users] slurm only looking in "default" partition during scheduling

2020-05-12 Thread Durai Arasan
Hi, We have a cluster with 2 slave nodes. These are the slurm.conf lines describing nodes and partitions: *NodeName=slurm-gpu-1 NodeAddr=192.168.0.200 Procs=16 Gres=gpu:2 State=UNKNOWNNodeName=slurm-gpu-2 NodeAddr=192.168.0.124 Procs=1 Gres=gpu:0 State=UNKNOWNPartitionName=gpu Nodes=slurm-gpu

Re: [slurm-users] additional jobs killed by scancel.

2020-05-12 Thread Steven Dick
I see one job cancelled and two jobs failed. Your slurmd log is incomplete -- it doesn't show the two failed jobs exiting/failing, so the real error is not here. It might also be helpful to look through slurmctld's log starting from when the first job was canceled, looking at any messages mentioni

[slurm-users] sacct returns nothing after reboot

2020-05-12 Thread Roger Mason
Hello, Yesterday I instituted job accounting via mysql on my (FreeBSD 11.3) test cluster. The cluster consists of a machine running slurmctld+slurmdbd and two running slurmd (slurm version 20.02.1). After experiencing a slurmdbd core dump when using mysql-5.7.30 (reported on this list on May 5) I

[slurm-users] Yet another issue with AssocGrpMemLimit

2020-05-12 Thread Mahmood Naderan
Hi With the following memory stats on two nodes [root@hpc slurm]# scontrol show node compute-0-0 | grep Memory RealMemory=64259 AllocMem=0 FreeMem=63429 Sockets=32 Boards=1 [root@hpc slurm]# scontrol show node compute-0-1 | grep Memory RealMemory=120705 AllocMem=1024 FreeMem=103051 Sockets=3