As Michael had suggested earlier, debugflags=gpu will give you detailed output of the gres being reported by the nodes.  This would be in the slurmctld log.

Or, show us the output of 'scontrol show node=tiger[01-02]' and 'scontrol show partition=tiger_1' From your previous message, that should be a node with a 1080gtx, a k20, and the partition you are submitting to.

-b

On 12/04/2018 09:06 AM, Michael Di Domenico wrote:
unfortunately, someone smarter then me will have to help further.  I'm
not sure i see anything specifically wrong.  The one thing i might try
is backing the software down to a 17.x release series.  I recently
tried 18.x and had some issues.  I can't say whether it'll be any
different, but you might be exposing an undiagnosed bug in the 18.x
branch
On Mon, Dec 3, 2018 at 4:17 PM Lou Nicotra <lnico...@interactions.com> wrote:
Made the change in the gres.conf on local server file and restarted slurmd and 
slurmctld on master.... Unfortunately same error...

Distributed corrected gres.conf to all k20 servers, restarted slurmd and 
slurmdctl...   Still has same error...

On Mon, Dec 3, 2018 at 4:04 PM Brian W. Johanson <bjoha...@psc.edu> wrote:
Is that a lowercase k in k20 specified in the batch script and nodename and a 
uppercase K specified in gres.conf?

On 12/03/2018 09:13 AM, Lou Nicotra wrote:

Hi All, I have recently set up a slurm cluster with my servers and I'm running 
into an issue while submitting GPU jobs. It has something to to with gres 
configurations, but I just can't seem to figure out what is wrong. Non GPU jobs 
run fine.

The error is as follows:
sbatch: error: Batch job submission failed: Invalid Trackable RESource (TRES) 
specification  after submitting a batch job.

My batch job is as follows:
#!/bin/bash
#SBATCH --partition=tiger_1   # partition name
#SBATCH --gres=gpu:k20:1
#SBATCH --gres-flags=enforce-binding
#SBATCH --time=0:20:00  # wall clock limit
#SBATCH --output=gpu-%J.txt
#SBATCH --account=lnicotra
module load cuda
python gpu1

Where gpu1 is a GPU test script that runs correctly while invoked via python. 
Tiger_1 partition has servers with GPUs, with a mix of 1080GTX and K20 as 
specified in slurm.conf

I have defined GRES resources in the slurm.conf file:
# GPU GRES
GresTypes=gpu
NodeName=tiger[01,05,10,15,20] Gres=gpu:1080gtx:2
NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Gres=gpu:k20:2

And have a local gres.conf on the servers containing GPUs...
lnicotra@tiger11 ~# cat /etc/slurm/gres.conf
# GPU Definitions
# NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=K20 
File=/dev/nvidia[0-1]
Name=gpu Type=K20 File=/dev/nvidia[0-1] Cores=0,1

and a similar one for the 1080GTX
# GPU Definitions
# NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080GTX File=/dev/nvidia[0-1]
Name=gpu Type=1080GTX File=/dev/nvidia[0-1] Cores=0,1

The account manager seems to know about the GPUs...
lnicotra@tiger11 ~# sacctmgr show tres
     Type            Name     ID
-------- --------------- ------
      cpu                      1
      mem                      2
   energy                      3
     node                      4
  billing                      5
       fs            disk      6
     vmem                      7
    pages                      8
     gres             gpu   1001
     gres         gpu:k20   1002
     gres     gpu:1080gtx   1003

Can anyone point out what am I missing?

Thanks!
Lou


--

Lou Nicotra

IT Systems Engineer - SLT

Interactions LLC

o:  908-673-1833

m: 908-451-6983

lnico...@interactions.com

www.interactions.com

*******************************************************************************

This e-mail and any of its attachments may contain Interactions LLC proprietary 
information, which is privileged, confidential, or subject to copyright 
belonging to the Interactions LLC. This e-mail is intended solely for the use 
of the individual or entity to which it is addressed. If you are not the 
intended recipient of this e-mail, you are hereby notified that any 
dissemination, distribution, copying, or action taken in relation to the 
contents of and attachments to this e-mail is strictly prohibited and may be 
unlawful. If you have received this e-mail in error, please notify the sender 
immediately and permanently delete the original and any copy of this e-mail and 
any printout. Thank You.

*******************************************************************************



--

Lou Nicotra

IT Systems Engineer - SLT

Interactions LLC

o:  908-673-1833

m: 908-451-6983

lnico...@interactions.com

www.interactions.com

*******************************************************************************

This e-mail and any of its attachments may contain Interactions LLC proprietary 
information, which is privileged, confidential, or subject to copyright 
belonging to the Interactions LLC. This e-mail is intended solely for the use 
of the individual or entity to which it is addressed. If you are not the 
intended recipient of this e-mail, you are hereby notified that any 
dissemination, distribution, copying, or action taken in relation to the 
contents of and attachments to this e-mail is strictly prohibited and may be 
unlawful. If you have received this e-mail in error, please notify the sender 
immediately and permanently delete the original and any copy of this e-mail and 
any printout. Thank You.

*******************************************************************************


Reply via email to