[OMPI users] Openmpi Checkpoint/Restart failed
Dear all, I had to try the checkpoint/restart function of Openmpi recently, and after several failure and checking lots of the docement, I am still very confused about how to config the checkpoint/restart function. Can anybody give me a $HOME/.openmpi/mca-params.conf script and introduce me what parameters I should specified when i install the openmpi? BTW, I want to install the openmpi1.5.1 and blcr 0.8.0. Thanks Xianjun Meng
Re: [OMPI users] Openmpi Checkpoint/Restart failed
My main question is: after I finished the checkpoint operation against a simple task which ran on tow machines, I can only restart it on one machine. if I ran the following command to force the ompi-restart to run the program on two machines: *ompi-restart -hostfile ./machine_names ompi_global_snapshot_XXX.ckpt* (the machine_names include two host names) the output is: *-- Error: Unable to obtain the proper restart command to restart from the checkpoint file (opal_snapshot_1.ckpt). Returned -1. -- [jx-mpi-fcr048:04116] [ 0] /lib64/tls/libpthread.so.0 [0x302b80c420] [jx-mpi-fcr048:04116] [ 1] /lib64/tls/libc.so.6(__libc_free+0x25) [0x302af68b85] [jx-mpi-fcr048:04116] [ 2] /home/hpc_meng/openmpi/lib/libopen-pal.so.0(opal_argv_free+0x41) [0x2a9557de31] [jx-mpi-fcr048:04116] [ 3] /home/hpc_meng/openmpi/lib/libopen-pal.so.0(opal_event_fini+0x27) [0x2a95573ac7] [jx-mpi-fcr048:04116] [ 4] /home/hpc_meng/openmpi/lib/libopen-pal.so.0(opal_finalize+0x2f) [0x2a95568a0f] [jx-mpi-fcr048:04116] [ 5] opal-restart [0x401888] [jx-mpi-fcr048:04116] [ 6] /lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x302af1c4bb] [jx-mpi-fcr048:04116] [ 7] opal-restart [0x40147a] [jx-mpi-fcr048:04116] *** End of error message *** -- mpirun noticed that process rank 1 with PID 4116 on node jx-mpi-fcr048.jx.baidu.com exited on signal 11 (Segmentation fault). --* My global_snapshot_meta.data is: *# Seq: 0 # Timestamp: Thu Dec 23 16:39:46 2010 # Process: 1680080897.0 # OPAL CRS Component: blcr # Snapshot Reference: opal_snapshot_0.ckpt # Snapshot Location: /home/work/checkpoint/ompi_global_snapshot_22817.ckpt/0 # Process: 1680080897.1 # OPAL CRS Component: blcr # Snapshot Reference: opal_snapshot_1.ckpt # Snapshot Location: /home/work/checkpoint/ompi_global_snapshot_22817.ckpt/0 # Timestamp: Thu Dec 23 16:39:47 2010 # Finished Seq: 0* Does anabody know why? Thanks Xianjun Meng 2010/12/23 孟宪军 > Dear all, > > I had to try the checkpoint/restart function of Openmpi recently, and after > several failure and checking lots of the docement, I am still very confused > about how to config the checkpoint/restart function. Can anybody give me a > $HOME/.openmpi/mca-params.conf script and introduce me what parameters I > should specified when i install the openmpi? > > BTW, I want to install the openmpi1.5.1 and blcr 0.8.0. > > > Thanks > Xianjun Meng >
Re: [OMPI users] Call to MPI_Test has large time-jitter
On Fri, Dec 17, 2010 at 5:43 PM, Sashi Balasingam wrote: > Hi, > I recently started on an MPI-based, 'real-time', pipelined-processing > application, and the application fails due to large time-jitter in sending > and receiving messages. Here are related info - > > 1) Platform: > a) Intel Box: Two Hex-core, Intel Xeon, 2.668 GHz (...total of 12 cores), > b) OS: SUSE Linux Enterprise Server 11 (x86_64) - Kernel \r (\l) > c) MPI Rev: (OpenRTE) 1.4, (...Installed OFED package) > d) HCA: InfiniBand: Mellanox Technologies MT26428 [ConnectX IB QDR, PCIe > 2.0 5GT/s] (rev a0) > > 2) Application detail > > a) Launching 7 processes, for pipelined processing, where each process > waits for a message (sizes vary between 1 KBytes to 26 KBytes), > then process the data, and outputs a message (sizes vary between 1 KBytes > to 26 KBytes), to next process. > > b) MPI transport functions used : "MPI_Isend", MPI_Irecv, MPI_Test. >i) For Receiving messages, I first make an MPI_Irecv call, followed by a > busy-loop on MPI_Test, waiting for message >ii) For Sending message, there is a busy-loop on MPI_Test to ensure > prior buffer was sent, then use MPI_Isend. > > c) When the job starts, all these 7 process are put in High priority mode ( > SCHED_FIFO policy, with priority setting of 99). > The Job entails an input data packet stream (and a series of MPI messages), > continually at 40 micro-sec rate, for a few minutes. > > 3) The Problem: > Most calls to MPI_Test (...which is non-blocking) takes a few micro-sec, > but around 10% of the job, it has a large jitter, that vary from 1 to 100 > odd millisec. This causes > some of the application input queues to fill-up and cause a failure. > > Any suggestions to look at on the MPI settings or OS config/issues will be > much appreciated. > > Thanks in advance. > Sanji > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > I had a similar issue, a work-around is to avoid polling too much by placing some kind of a timer in your code before the MPI_Test call.
[OMPI users] srun and openmpi
Can anyone point me towards the most recent documentation for using srun and openmpi? I followed what i found on the web with enabling the MpiPorts config in slurm and using the --resv-ports switch, but I'm getting an error from openmpi during setup. I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM I'm sure I'm missing a step. Thanks
Re: [OMPI users] srun and openmpi
I'm not sure there is any documentation yet - not much clamor for it. :-/ It would really help if you included the error message. Otherwise, all I can do is guess, which wastes both of our time :-( My best guess is that the port reservation didn't get passed down to the MPI procs properly - but that's just a guess. On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote: > Can anyone point me towards the most recent documentation for using > srun and openmpi? > > I followed what i found on the web with enabling the MpiPorts config > in slurm and using the --resv-ports switch, but I'm getting an error > from openmpi during setup. > > I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM > > I'm sure I'm missing a step. > > Thanks > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Openmpi Checkpoint/Restart failed
Dear all, I have figured it out. It was a simple issue, I didn't add the "blcr lib" to the $PATH environment varable. However, it can make checkpoint operation, but can't make restart operation successfully. It was so wield. Best regards Xianjun Meng 在 2010年12月23日 下午5:35,孟宪军 写道: > My main question is: > > after I finished the checkpoint operation against a simple task which ran > on tow machines, I can only restart it on one machine. if I ran the > following command to force the ompi-restart to run the program on two > machines: > > *ompi-restart -hostfile ./machine_names ompi_global_snapshot_XXX.ckpt* > (the machine_names include two host names) > > the output is: > * > -- > Error: Unable to obtain the proper restart command to restart from the >checkpoint file (opal_snapshot_1.ckpt). Returned -1. > > -- > [jx-mpi-fcr048:04116] [ 0] /lib64/tls/libpthread.so.0 [0x302b80c420] > [jx-mpi-fcr048:04116] [ 1] /lib64/tls/libc.so.6(__libc_free+0x25) > [0x302af68b85] > [jx-mpi-fcr048:04116] [ 2] > /home/hpc_meng/openmpi/lib/libopen-pal.so.0(opal_argv_free+0x41) > [0x2a9557de31] > [jx-mpi-fcr048:04116] [ 3] > /home/hpc_meng/openmpi/lib/libopen-pal.so.0(opal_event_fini+0x27) > [0x2a95573ac7] > [jx-mpi-fcr048:04116] [ 4] > /home/hpc_meng/openmpi/lib/libopen-pal.so.0(opal_finalize+0x2f) > [0x2a95568a0f] > [jx-mpi-fcr048:04116] [ 5] opal-restart [0x401888] > [jx-mpi-fcr048:04116] [ 6] /lib64/tls/libc.so.6(__libc_start_main+0xdb) > [0x302af1c4bb] > [jx-mpi-fcr048:04116] [ 7] opal-restart [0x40147a] > [jx-mpi-fcr048:04116] *** End of error message *** > -- > mpirun noticed that process rank 1 with PID 4116 on node > jx-mpi-fcr048.jx.baidu.com exited on signal 11 (Segmentation fault). > -- > * > > My global_snapshot_meta.data is: > > *# Seq: 0 > # Timestamp: Thu Dec 23 16:39:46 2010 > # Process: 1680080897.0 > # OPAL CRS Component: blcr > # Snapshot Reference: opal_snapshot_0.ckpt > # Snapshot Location: > /home/work/checkpoint/ompi_global_snapshot_22817.ckpt/0 > # Process: 1680080897.1 > # OPAL CRS Component: blcr > # Snapshot Reference: opal_snapshot_1.ckpt > # Snapshot Location: > /home/work/checkpoint/ompi_global_snapshot_22817.ckpt/0 > # Timestamp: Thu Dec 23 16:39:47 2010 > # Finished Seq: 0* > > Does anabody know why? > > Thanks > Xianjun Meng > > > 2010/12/23 孟宪军 > > Dear all, >> >> I had to try the checkpoint/restart function of Openmpi recently, and >> after several failure and checking lots of the docement, I am still very >> confused about how to config the checkpoint/restart function. Can anybody >> give me a $HOME/.openmpi/mca-params.conf script and introduce me what >> parameters I should specified when i install the openmpi? >> >> BTW, I want to install the openmpi1.5.1 and blcr 0.8.0. >> >> >> Thanks >> Xianjun Meng >> > >