Dear all, I have figured it out. It was a simple issue, I didn't add the "blcr lib" to the $PATH environment varable. However, it can make checkpoint operation, but can't make restart operation successfully. It was so wield.
Best regards Xianjun Meng 在 2010年12月23日 下午5:35,孟宪军 <xjun.m...@gmail.com>写道: > My main question is: > > after I finished the checkpoint operation against a simple task which ran > on tow machines, I can only restart it on one machine. if I ran the > following command to force the ompi-restart to run the program on two > machines: > > *ompi-restart -hostfile ./machine_names ompi_global_snapshot_XXX.ckpt* > (the machine_names include two host names) > > the output is: > * > -------------------------------------------------------------------------- > Error: Unable to obtain the proper restart command to restart from the > checkpoint file (opal_snapshot_1.ckpt). Returned -1. > > -------------------------------------------------------------------------- > [jx-mpi-fcr048:04116] [ 0] /lib64/tls/libpthread.so.0 [0x302b80c420] > [jx-mpi-fcr048:04116] [ 1] /lib64/tls/libc.so.6(__libc_free+0x25) > [0x302af68b85] > [jx-mpi-fcr048:04116] [ 2] > /home/hpc_meng/openmpi/lib/libopen-pal.so.0(opal_argv_free+0x41) > [0x2a9557de31] > [jx-mpi-fcr048:04116] [ 3] > /home/hpc_meng/openmpi/lib/libopen-pal.so.0(opal_event_fini+0x27) > [0x2a95573ac7] > [jx-mpi-fcr048:04116] [ 4] > /home/hpc_meng/openmpi/lib/libopen-pal.so.0(opal_finalize+0x2f) > [0x2a95568a0f] > [jx-mpi-fcr048:04116] [ 5] opal-restart [0x401888] > [jx-mpi-fcr048:04116] [ 6] /lib64/tls/libc.so.6(__libc_start_main+0xdb) > [0x302af1c4bb] > [jx-mpi-fcr048:04116] [ 7] opal-restart [0x40147a] > [jx-mpi-fcr048:04116] *** End of error message *** > -------------------------------------------------------------------------- > mpirun noticed that process rank 1 with PID 4116 on node > jx-mpi-fcr048.jx.baidu.com exited on signal 11 (Segmentation fault). > -------------------------------------------------------------------------- > * > > My global_snapshot_meta.data is: > > *# Seq: 0 > # Timestamp: Thu Dec 23 16:39:46 2010 > # Process: 1680080897.0 > # OPAL CRS Component: blcr > # Snapshot Reference: opal_snapshot_0.ckpt > # Snapshot Location: > /home/work/checkpoint/ompi_global_snapshot_22817.ckpt/0 > # Process: 1680080897.1 > # OPAL CRS Component: blcr > # Snapshot Reference: opal_snapshot_1.ckpt > # Snapshot Location: > /home/work/checkpoint/ompi_global_snapshot_22817.ckpt/0 > # Timestamp: Thu Dec 23 16:39:47 2010 > # Finished Seq: 0* > > Does anabody know why? > > Thanks > Xianjun Meng > > > 2010/12/23 孟宪军 <xjun.m...@gmail.com> > > Dear all, >> >> I had to try the checkpoint/restart function of Openmpi recently, and >> after several failure and checking lots of the docement, I am still very >> confused about how to config the checkpoint/restart function. Can anybody >> give me a $HOME/.openmpi/mca-params.conf script and introduce me what >> parameters I should specified when i install the openmpi? >> >> BTW, I want to install the openmpi1.5.1 and blcr 0.8.0. >> >> >> Thanks >> Xianjun Meng >> > >