On 17 September 2012 16:11, William Hay <[email protected]> wrote:
> I'm trying to get blcr checkpointing running on our cluster.   I've
> created a checkpointing environment that looks
> like this:
>
> ckpt_name          blcr
> interface          application-level
> ckpt_command       /cm/shared/apps/sge/assist/ckpt/blcr/checkpoint.sh $job_pid
> migr_command       /cm/shared/apps/sge/assist/ckpt/blcr/checkpoint.sh $job_pid
> restart_command    none
> clean_command      /bin/true
> ckpt_dir           /tmp
> signal             none
> when               xsmr
>
I've figured out my problem and how I broke the checkpointing
interface:(  A while back we decided we wanted to allow a certain
program to be run with a longer time limit than most jobs.  In order
to enforce this we ran long jobs with a restricted shell set via the
jsv.  Our cluster had previously been unix_behavior so in order to
maintain backward compatibility I made the default shell /bin/env.
This looks to be being invoked in a way I hadn't expected when trying
to run the checkpointing commands with an argument.  Now I just have
to figure out which way to dig that isn't deeper:)

William
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to