I'm running slurmd version 18.08.0... It seems that the system recognizes the GPUs after a slurmd restart. I tuned debug to 5, restarted and then submitted job. Nothing get logged to log file in local server... [2018-12-03T11:55:18.442] Slurmd shutdown completing [2018-12-03T11:55:18.484] debug: Log file re-opened [2018-12-03T11:55:18.485] debug: CPUs:48 Boards:1 Sockets:2 CoresPerSocket:12 ThreadsPerCore:2 [2018-12-03T11:55:18.485] Message aggregation disabled [2018-12-03T11:55:18.486] debug: CPUs:48 Boards:1 Sockets:2 CoresPerSocket:12 ThreadsPerCore:2 [2018-12-03T11:55:18.486] debug: init: Gres GPU plugin loaded [2018-12-03T11:55:18.486] Gres Name=gpu Type=K20 Count=2 [2018-12-03T11:55:18.487] gpu device number 0(/dev/nvidia0):c 195:0 rwm [2018-12-03T11:55:18.487] gpu device number 1(/dev/nvidia1):c 195:1 rwm [2018-12-03T11:55:18.487] topology NONE plugin loaded [2018-12-03T11:55:18.487] route default plugin loaded [2018-12-03T11:55:18.530] debug: Resource spec: No specialized cores configured by default on this node [2018-12-03T11:55:18.530] debug: Resource spec: Reserved system memory limit not configured for this node [2018-12-03T11:55:18.530] debug: task NONE plugin loaded [2018-12-03T11:55:18.530] debug: Munge authentication plugin loaded [2018-12-03T11:55:18.530] debug: spank: opening plugin stack /etc/slurm/plugstack.conf [2018-12-03T11:55:18.530] Munge cryptographic signature plugin loaded [2018-12-03T11:55:18.532] slurmd version 18.08.0 started [2018-12-03T11:55:18.532] debug: Job accounting gather LINUX plugin loaded [2018-12-03T11:55:18.532] debug: job_container none plugin loaded [2018-12-03T11:55:18.532] debug: switch NONE plugin loaded [2018-12-03T11:55:18.532] slurmd started on Mon, 03 Dec 2018 11:55:18 -0500 [2018-12-03T11:55:18.533] CPUs=48 Boards=1 Sockets=2 Cores=12 Threads=2 Memory=386757 TmpDisk=4758 Uptime=21165906 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) [2018-12-03T11:55:18.533] debug: AcctGatherEnergy NONE plugin loaded [2018-12-03T11:55:18.533] debug: AcctGatherProfile NONE plugin loaded [2018-12-03T11:55:18.533] debug: AcctGatherInterconnect NONE plugin loaded [2018-12-03T11:55:18.533] debug: AcctGatherFilesystem NONE plugin loaded root@tiger11 slurm#
So, I turned on debug to 5 in slurmcltd in master server, and after I submitted my job, it shows... [2018-12-03T12:02:10.355] _job_create: account 'lnicotra' has no association for user 1498 using default account 'slt' [2018-12-03T12:02:10.356] _slurm_rpc_submit_batch_job: Invalid Trackable RESource (TRES) specification So, we use LDAP for authentication and my UID is 1498, but I created a user in slurm using my login name. The default account for all users is "slt" Is this the cause of my problems? root@panther02 slurm# getent passwd lnicotra lnicotra:*:1498:1152:Lou Nicotra:/home/lnicotra:/bin/bash If so, how is this resolved as we use multiple servers and there are no local accounts for them? Thanks! Lou On Mon, Dec 3, 2018 at 11:36 AM Michael Di Domenico <mdidomeni...@gmail.com> wrote: > do you get anything additional in the slurm logs? have you tried > adding gres to the debugflags? what version of slurm are you running? > On Mon, Dec 3, 2018 at 9:18 AM Lou Nicotra <lnico...@interactions.com> > wrote: > > > > Hi All, I have recently set up a slurm cluster with my servers and I'm > running into an issue while submitting GPU jobs. It has something to to > with gres configurations, but I just can't seem to figure out what is > wrong. Non GPU jobs run fine. > > > > The error is as follows: > > sbatch: error: Batch job submission failed: Invalid Trackable RESource > (TRES) specification after submitting a batch job. > > > > My batch job is as follows: > > #!/bin/bash > > #SBATCH --partition=tiger_1 # partition name > > #SBATCH --gres=gpu:k20:1 > > #SBATCH --gres-flags=enforce-binding > > #SBATCH --time=0:20:00 # wall clock limit > > #SBATCH --output=gpu-%J.txt > > #SBATCH --account=lnicotra > > module load cuda > > python gpu1 > > > > Where gpu1 is a GPU test script that runs correctly while invoked via > python. Tiger_1 partition has servers with GPUs, with a mix of 1080GTX and > K20 as specified in slurm.conf > > > > I have defined GRES resources in the slurm.conf file: > > # GPU GRES > > GresTypes=gpu > > NodeName=tiger[01,05,10,15,20] Gres=gpu:1080gtx:2 > > NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Gres=gpu:k20:2 > > > > And have a local gres.conf on the servers containing GPUs... > > lnicotra@tiger11 ~# cat /etc/slurm/gres.conf > > # GPU Definitions > > # NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=K20 > File=/dev/nvidia[0-1] > > Name=gpu Type=K20 File=/dev/nvidia[0-1] Cores=0,1 > > > > and a similar one for the 1080GTX > > # GPU Definitions > > # NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080GTX > File=/dev/nvidia[0-1] > > Name=gpu Type=1080GTX File=/dev/nvidia[0-1] Cores=0,1 > > > > The account manager seems to know about the GPUs... > > lnicotra@tiger11 ~# sacctmgr show tres > > Type Name ID > > -------- --------------- ------ > > cpu 1 > > mem 2 > > energy 3 > > node 4 > > billing 5 > > fs disk 6 > > vmem 7 > > pages 8 > > gres gpu 1001 > > gres gpu:k20 1002 > > gres gpu:1080gtx 1003 > > > > Can anyone point out what am I missing? > > > > Thanks! > > Lou > > > > > > -- > > > > Lou Nicotra > > > > IT Systems Engineer - SLT > > > > Interactions LLC > > > > o: 908-673-1833 > > > > m: 908-451-6983 > > > > lnico...@interactions.com > > > > www.interactions.com > > > > > ******************************************************************************* > > > > This e-mail and any of its attachments may contain Interactions LLC > proprietary information, which is privileged, confidential, or subject to > copyright belonging to the Interactions LLC. This e-mail is intended solely > for the use of the individual or entity to which it is addressed. If you > are not the intended recipient of this e-mail, you are hereby notified that > any dissemination, distribution, copying, or action taken in relation to > the contents of and attachments to this e-mail is strictly prohibited and > may be unlawful. If you have received this e-mail in error, please notify > the sender immediately and permanently delete the original and any copy of > this e-mail and any printout. Thank You. > > > > > ******************************************************************************* > > -- *Lou Nicotra* IT Systems Engineer - SLT Interactions LLC o: 908-673-1833 <781-405-5114> m: 908-451-6983 <781-405-5114> *lnico...@interactions.com <lnico...@interactions.com>* www.interactions.com -- ******************************************************************************* This e-mail and any of its attachments may contain Interactions LLC proprietary information, which is privileged, confidential, or subject to copyright belonging to the Interactions LLC. This e-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this e-mail is strictly prohibited and may be unlawful. If you have received this e-mail in error, please notify the sender immediately and permanently delete the original and any copy of this e-mail and any printout. Thank You. *******************************************************************************