On Saturday, 22 September 2018 4:19:09 PM AEST Raymond Wan wrote:
> SLURM's ability to suspend jobs must be storing the state in a
> location outside of this 512 GB. So, you're not helping this by
> allocating more swap.
I don't believe that's the case. My understanding is that in this mode it'
On Saturday, 22 September 2018 2:35:34 PM AEST Ryan Novosielski wrote:
> We constrain using cgroups, and occasionally someone will request 1
> core (-n1 -c1) and then run something that asks for way more
> cores/threads, or that tries to use the whole machine. They won't
> succeed obviously. Is th
Anecdotally, I’ve had a user cause load averages of 10x the node’s core count.
The user caught it and cancelled the job before I noticed it myself. Where I’ve
seen it happen live on less severe cases, I’ve never noticed anything other
than the excessive load average. Viewed from ‘top’, the offen
If your workflows are primarily CPU-bound rather than memory-bound, and since
you’re the only user, you could ensure all your Slurm scripts ‘nice’ their
Python commands, or use the -n flag for slurmd and the PropagatePrioProcess
configuration parameter. Both of these are in the thread at
https:
I would say that, yes, you have a good workflow here with Slurm.
As another aside - is anyone working with suspending and resuming containers?
I see on the Singularity site that suspend/resume in on the roadmap (I
am not talking about checkpointing here).
Also it is worth saying that these days on