Am 17.09.2012 um 17:11 schrieb William Hay:

> I'm trying to get blcr checkpointing running on our cluster.   I've
> created a checkpointing environment that looks
> like this:
> 
> ckpt_name          blcr
> interface          application-level
> ckpt_command       /cm/shared/apps/sge/assist/ckpt/blcr/checkpoint.sh $job_pid
> migr_command       /cm/shared/apps/sge/assist/ckpt/blcr/checkpoint.sh $job_pid
> restart_command    none
> clean_command      /bin/true
> ckpt_dir           /tmp
> signal             none
> when               xsmr
> 
> I submit a serial job to the checkpointing environment with
> #$ -c mxs
> #$ -ckpt blcr
> and after it starts running I suspend it.
> 
> The messages file for the node it runs on contains the following:
> 
> 09/17/2012 15:42:44|  main|node-o03|I|initiate migration at job
> suspend for job 898195 task 1
> 09/17/2012 15:42:44|  main|node-o03|I|SIGNAL jid: 898195 jatask: 1
> signal: MIGRATE
> 
> However as far as I can tell neither the ckpt_command nor the
> migr_command are run.  The first line of the
> checkpoint.sh script touches a file in /tmp which does not appear (nor
> do any checkpoints).

You checked /tmp on the node?

The ckpt_command is only run in "min_cpu_interval" which you define in the 
queue.


> The ckpt_command is duplicated to migr_command because I was trying to
> get checkpointing to run without migration
> at first but since the logs mentioned migration  I copied the
> checkpoint script to migr_command to see if it was being run
> instead of ckpt_command when a suitable job is suspended rather than
> as an optional addition to it as the man page implies.

Yes, it should. But the man page is wrong in the aspect, that a checkpoint is 
created just be fore the migration. This you have to do on your own in the 
defined migr_command.

There are Howto's:

http://arc.liv.ac.uk/SGE/howto/checkpointing.html
http://arc.liv.ac.uk/SGE/howto/APSTC-TB-2004-005.pdf

-- Reuti


> We're using 6.2u3 (still).
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to