On 25/07/16 12:55, Lachlan Musicman wrote:
> - how does slurm put jobs into suspended mode given that some may have
> large amounts of data in memory?
I suspect it depends on how you've configured Slurm for checkpointing.
CheckpointType
The system-initiated checkpoint method to be used for user jobs.
BLCR should support resuming it on another node, but if it's a restart
type then the job might have started again from scratch - or from its
own internal checkpoint system if it has one.
Best of luck!
Chris
--
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: [email protected] Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci