My main question is: after I finished the checkpoint operation against a simple task which ran on tow machines, I can only restart it on one machine. if I ran the following command to force the ompi-restart to run the program on two machines:
*ompi-restart -hostfile ./machine_names ompi_global_snapshot_XXX.ckpt* (the machine_names include two host names) the output is: *-------------------------------------------------------------------------- Error: Unable to obtain the proper restart command to restart from the checkpoint file (opal_snapshot_1.ckpt). Returned -1. -------------------------------------------------------------------------- [jx-mpi-fcr048:04116] [ 0] /lib64/tls/libpthread.so.0 [0x302b80c420] [jx-mpi-fcr048:04116] [ 1] /lib64/tls/libc.so.6(__libc_free+0x25) [0x302af68b85] [jx-mpi-fcr048:04116] [ 2] /home/hpc_meng/openmpi/lib/libopen-pal.so.0(opal_argv_free+0x41) [0x2a9557de31] [jx-mpi-fcr048:04116] [ 3] /home/hpc_meng/openmpi/lib/libopen-pal.so.0(opal_event_fini+0x27) [0x2a95573ac7] [jx-mpi-fcr048:04116] [ 4] /home/hpc_meng/openmpi/lib/libopen-pal.so.0(opal_finalize+0x2f) [0x2a95568a0f] [jx-mpi-fcr048:04116] [ 5] opal-restart [0x401888] [jx-mpi-fcr048:04116] [ 6] /lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x302af1c4bb] [jx-mpi-fcr048:04116] [ 7] opal-restart [0x40147a] [jx-mpi-fcr048:04116] *** End of error message *** -------------------------------------------------------------------------- mpirun noticed that process rank 1 with PID 4116 on node jx-mpi-fcr048.jx.baidu.com exited on signal 11 (Segmentation fault). --------------------------------------------------------------------------* My global_snapshot_meta.data is: *# Seq: 0 # Timestamp: Thu Dec 23 16:39:46 2010 # Process: 1680080897.0 # OPAL CRS Component: blcr # Snapshot Reference: opal_snapshot_0.ckpt # Snapshot Location: /home/work/checkpoint/ompi_global_snapshot_22817.ckpt/0 # Process: 1680080897.1 # OPAL CRS Component: blcr # Snapshot Reference: opal_snapshot_1.ckpt # Snapshot Location: /home/work/checkpoint/ompi_global_snapshot_22817.ckpt/0 # Timestamp: Thu Dec 23 16:39:47 2010 # Finished Seq: 0* Does anabody know why? Thanks Xianjun Meng 2010/12/23 孟宪军 <xjun.m...@gmail.com> > Dear all, > > I had to try the checkpoint/restart function of Openmpi recently, and after > several failure and checking lots of the docement, I am still very confused > about how to config the checkpoint/restart function. Can anybody give me a > $HOME/.openmpi/mca-params.conf script and introduce me what parameters I > should specified when i install the openmpi? > > BTW, I want to install the openmpi1.5.1 and blcr 0.8.0. > > > Thanks > Xianjun Meng >