[slurm-dev] RE: Selecting a network interface with srun

2017-10-26 Thread Sebastian Eastham
Hi all, Thank you all for the quick responses! My apologies for my own slow response - turns out my spam filter got a bit too aggressive and I only just found out about all these.. As it happens, you are exactly right, and it turns out this was a case of "error exists between keyboard and chai

[slurm-dev] RE: Selecting a network interface with srun

2017-10-26 Thread Ryan Novosielski
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 10/26/2017 09:58 AM, Sebastian Eastham wrote: > As it happens, you are exactly right, and it turns out this was a > case of "error exists between keyboard and chair". In particular I > wanted to thank John Hearns for pointing out the difference

[slurm-dev] Re: CPU/GPU Affinity Not Working

2017-10-26 Thread Kilian Cavalotti
Hi Dave, On Wed, Oct 25, 2017 at 9:23 PM, Dave Sizer wrote: > For some reason, we are observing that the preferred CPUs defined in > gres.conf for GPU devices are being ignored when running jobs. That is, in > our gres.conf we have gpu resource lines, such as: > > Name=gpu Type=kepler File=/dev

[slurm-dev] Re: CPU/GPU Affinity Not Working

2017-10-26 Thread Dave Sizer
Thanks for the tips, Kilian, this really pointed me in the right direction. It turns out the issue was the CPU IDs we were using in gres.conf were based on how our system was identifying them, when they really needed to be in the platform-agnostic format (CPU_ID = Board_ID x threads_per_board

[slurm-dev] srun --accel-bind not working

2017-10-26 Thread Dave Sizer
Is anyone familiar with the '--accel-bind=g' option for srun? It seems like using this option when you have CPU affinities set in your gres.conf causes "fatal: Invalid gres data for gpu, CPUs=*", and makes the job hang. But the configuration seems to work correctly if you omit this option. I

[slurm-dev] How to get the realtime output in the job output file

2017-10-26 Thread Chaofeng Zhang
Hi Guys When we submit one slurm job on the login node, job use env of login node on compute nodes, so we add #SBATCH export=None in the job file, then the job will use the env of compute node. We want to get the real-time output in the job out file, so we use this command to submit job file: