On Tue, Feb 14, 2017 at 12:22:59PM +0000, Mark Dixon wrote: > On Tue, 14 Feb 2017, William Hay wrote: > ... > >We tweak the permissions on the device nodes from a privileged prolog but > >otherwise I suspect we're doing something similar. > > Hi William, > > Yeah, but I've put the permission tweaker in the starter, as that fits our > existing model a bit better (looking ahead to multi-node GPU codes in > future).
None of our gpu nodes have infiniband at the moment so we don't allow multi-node gpu jobs. Our prolog does a parallel ssh(passing through appropriate envvars) into every node assigned to the job and does the equivalent of a run-parts on a directory filled with scripts. Some of these scripts check if they are running on the head node. > > >One thing to watch out for is that unless you disable it the device driver > >can change the permissions on the device nodes behind your back. > > The device driver, or do you mean any CUDA program at all? > > It's a bit of an eye-opener to see no dev entries being created by the > kernel module / udev, then strace a simple CUDA program and watch it try to > mknod some /dev entries and call a privileged binary to do some > modprobe/mknod's before actually doing what the program's supposed to do. Sounds like your setup is a bit different from ours. Our devices show up in the normal way but we need a file in /etc/modprobe.d with the following magic module option: options nvidia NVreg_ModifyDeviceFiles=0 Our prolog ensures /dev/nvidiactl is world accessible and the relevant /dev/nvidia? file is owned by the per job sge group. Without the magic option various things trying to access the gpus reset the permissions on the /dev/nvidia? devices. With the magic option programs permissions are left alone and jobs only access the gpu we intend for them. Given that this is an option to a kernel module I assume that it is responsible for the reset of permissions. > > Would really like to know how to stop it doing that: had been wondering > about offering the ability to reconfigure or reset the GPU card via a job > request / JSV / starter method, but at the moment I cannot run anything > interesting with root privs without screwing up permissions. Grr. > > ... > >We have separate requests for memory, gpus ,local scratch space, etc with > >sensible defaults. If someone did use the command line it could end up > >looking quite like the example you give. > ... > > Do people fiddle with them and stick funny numbers in, resulting in GPUs > unintentionally left idle? GPUs are not terribly popular on Legion. Most of the heavy GPU users use a shared facility we don't run(Emerald). So idle but somewhat intentional. This is probably a good thing as most of the GPUs are: a)Old. b)In an external PCI box so have a habit of disspaearing from the PCI bus if you look at them funny. Emerald is shutting down shortly so we may get some people attempting to use them. This may lead to some contention for the one machine with a half-decent GPU actually inside it. William
signature.asc
Description: Digital signature
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users