On 17 September 2012 16:11, William Hay <[email protected]> wrote: > I'm trying to get blcr checkpointing running on our cluster. I've > created a checkpointing environment that looks > like this: > > ckpt_name blcr > interface application-level > ckpt_command /cm/shared/apps/sge/assist/ckpt/blcr/checkpoint.sh $job_pid > migr_command /cm/shared/apps/sge/assist/ckpt/blcr/checkpoint.sh $job_pid > restart_command none > clean_command /bin/true > ckpt_dir /tmp > signal none > when xsmr > I've figured out my problem and how I broke the checkpointing interface:( A while back we decided we wanted to allow a certain program to be run with a longer time limit than most jobs. In order to enforce this we ran long jobs with a restricted shell set via the jsv. Our cluster had previously been unix_behavior so in order to maintain backward compatibility I made the default shell /bin/env. This looks to be being invoked in a way I hadn't expected when trying to run the checkpointing commands with an argument. Now I just have to figure out which way to dig that isn't deeper:)
William _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
