Hi,
This isn't quite what your after and I'm not 100% sure how close it is.
We do check that the GPU drivers are working on nodes before launching a job on
them.
Our prolog call's another in house script, (we should move to NHC to be honest),
that does the following:
if [[ -n $(lspci | grep GK110BGL) ]]; then # only applicable to boole
nodes with gpu's installed
if [[ -z
$(/home/support/apps/apps/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery
| grep 'Result = PASS') ]]; then # deviceQuery is slow
print_problem "GPU Drivers"
else
print_ok "GPU Drivers"
fi
fi
Its not very elegant but alleviates the intermittent problem we where having
where users would request a GPU node but the drivers had stopped working. If the
above or any of the other checks fail the node is taken out of production.
Hope that helps.
Sean
On Mon, Jul 31, 2017 at 08:52:34AM -0600, Michael Di Domenico wrote:
>
> do people here running slurm with gres based gpu's, check that the gpu
> is actually usable before launching the job? if so, can you detail
> how you're doing it?
>
> my cluster is currently using slurm, but we run htcondor on the nodes
> in the background. when a node isn't currently allocated through
> slurm it's made available to htcondor for use. in general this works
> pretty well.
>
> however, the issue that arises is that condor can't detect a slurm
> allocated node fast enough or halt the job it's running quick enough.
> when a user srun's a job, it usually errors out with some irreverent
> error about not being able to use the gpu. generally the user can't
> decipher it and tell what actually happened.
>
> i've tried setting up a prolog on the nodes to kick the jobs off, but
> I've seen issues in the past where users quickly issuing srun commands
> will hork up the nodes. and if the srun takes to long they'll just
> kill it and try again, hastening the problem. whether it's the node,
> slurm, or condor or the combination of all three i've have not nailed
> down yet.
>
> it might come down to, i'm doing it correctly, but my script is just
> too chunky. before i spend a bunch of hours tuning, i'd like to
> double check i'm going down the right path and/or incorporate some
> other ideas
>
> thanks
>
--
Sean McGrath M.Sc
Systems Administrator
Trinity Centre for High Performance and Research Computing
Trinity College Dublin
[email protected]
https://www.tcd.ie/
https://www.tchpc.tcd.ie/
+353 (0) 1 896 3725