Re: [slurm-users] swap size

2018-09-22 Thread Chris Samuel
On Saturday, 22 September 2018 4:19:09 PM AEST Raymond Wan wrote: > SLURM's ability to suspend jobs must be storing the state in a > location outside of this 512 GB. So, you're not helping this by > allocating more swap. I don't believe that's the case. My understanding is that in this mode it'

Re: [slurm-users] Job allocating more CPUs than requested

2018-09-22 Thread Chris Samuel
On Saturday, 22 September 2018 2:35:34 PM AEST Ryan Novosielski wrote: > We constrain using cgroups, and occasionally someone will request 1 > core (-n1 -c1) and then run something that asks for way more > cores/threads, or that tries to use the whole machine. They won't > succeed obviously. Is th

Re: [slurm-users] Job allocating more CPUs than requested

2018-09-22 Thread Renfro, Michael
Anecdotally, I’ve had a user cause load averages of 10x the node’s core count. The user caught it and cancelled the job before I noticed it myself. Where I’ve seen it happen live on less severe cases, I’ve never noticed anything other than the excessive load average. Viewed from ‘top’, the offen

Re: [slurm-users] swap size

2018-09-22 Thread Renfro, Michael
If your workflows are primarily CPU-bound rather than memory-bound, and since you’re the only user, you could ensure all your Slurm scripts ‘nice’ their Python commands, or use the -n flag for slurmd and the PropagatePrioProcess configuration parameter. Both of these are in the thread at https:

Re: [slurm-users] swap size

2018-09-22 Thread John Hearns
I would say that, yes, you have a good workflow here with Slurm. As another aside - is anyone working with suspending and resuming containers? I see on the Singularity site that suspend/resume in on the roadmap (I am not talking about checkpointing here). Also it is worth saying that these days on