[OMPI users] Openmpi Checkpoint/Restart failed

2010-12-23 Thread 孟宪军
Dear all,

I had to try the checkpoint/restart function of Openmpi recently, and after
several failure and checking lots of the docement, I am still very confused
about how to config the checkpoint/restart function. Can anybody give me a
$HOME/.openmpi/mca-params.conf script and introduce me what parameters I
should specified when i install the openmpi?

BTW, I want to install the openmpi1.5.1 and blcr 0.8.0.


Thanks
Xianjun Meng


Re: [OMPI users] Openmpi Checkpoint/Restart failed

2010-12-23 Thread 孟宪军
My main question is:

after I finished the checkpoint operation against a simple task which ran on
tow machines, I can only restart it on one machine. if I ran the following
command to force the ompi-restart to run the program on two machines:

*ompi-restart  -hostfile  ./machine_names  ompi_global_snapshot_XXX.ckpt*
(the machine_names include two host names)

the output is:
*--
Error: Unable to obtain the proper restart command to restart from the
   checkpoint file (opal_snapshot_1.ckpt). Returned -1.

--
[jx-mpi-fcr048:04116] [ 0] /lib64/tls/libpthread.so.0 [0x302b80c420]
[jx-mpi-fcr048:04116] [ 1] /lib64/tls/libc.so.6(__libc_free+0x25)
[0x302af68b85]
[jx-mpi-fcr048:04116] [ 2]
/home/hpc_meng/openmpi/lib/libopen-pal.so.0(opal_argv_free+0x41)
[0x2a9557de31]
[jx-mpi-fcr048:04116] [ 3]
/home/hpc_meng/openmpi/lib/libopen-pal.so.0(opal_event_fini+0x27)
[0x2a95573ac7]
[jx-mpi-fcr048:04116] [ 4]
/home/hpc_meng/openmpi/lib/libopen-pal.so.0(opal_finalize+0x2f)
[0x2a95568a0f]
[jx-mpi-fcr048:04116] [ 5] opal-restart [0x401888]
[jx-mpi-fcr048:04116] [ 6] /lib64/tls/libc.so.6(__libc_start_main+0xdb)
[0x302af1c4bb]
[jx-mpi-fcr048:04116] [ 7] opal-restart [0x40147a]
[jx-mpi-fcr048:04116] *** End of error message ***
--
mpirun noticed that process rank 1 with PID 4116 on node
jx-mpi-fcr048.jx.baidu.com exited on signal 11 (Segmentation fault).
--*

My global_snapshot_meta.data is:

*# Seq: 0
# Timestamp: Thu Dec 23 16:39:46 2010
# Process: 1680080897.0
# OPAL CRS Component: blcr
# Snapshot Reference: opal_snapshot_0.ckpt
# Snapshot Location: /home/work/checkpoint/ompi_global_snapshot_22817.ckpt/0
# Process: 1680080897.1
# OPAL CRS Component: blcr
# Snapshot Reference: opal_snapshot_1.ckpt
# Snapshot Location: /home/work/checkpoint/ompi_global_snapshot_22817.ckpt/0
# Timestamp: Thu Dec 23 16:39:47 2010
# Finished Seq: 0*

Does anabody know why?

Thanks
Xianjun Meng


2010/12/23 孟宪军 

> Dear all,
>
> I had to try the checkpoint/restart function of Openmpi recently, and after
> several failure and checking lots of the docement, I am still very confused
> about how to config the checkpoint/restart function. Can anybody give me a
> $HOME/.openmpi/mca-params.conf script and introduce me what parameters I
> should specified when i install the openmpi?
>
> BTW, I want to install the openmpi1.5.1 and blcr 0.8.0.
>
>
> Thanks
> Xianjun Meng
>


Re: [OMPI users] Call to MPI_Test has large time-jitter

2010-12-23 Thread Yiannis Papadopoulos
On Fri, Dec 17, 2010 at 5:43 PM, Sashi Balasingam wrote:

> Hi,
> I recently started on an MPI-based, 'real-time', pipelined-processing
> application, and the application fails due to large time-jitter in sending
> and receiving messages. Here are related info -
>
> 1) Platform:
> a) Intel Box: Two Hex-core, Intel Xeon, 2.668 GHz (...total of 12 cores),
> b) OS: SUSE Linux Enterprise Server 11 (x86_64) - Kernel \r (\l)
> c) MPI Rev: (OpenRTE) 1.4, (...Installed OFED package)
> d) HCA: InfiniBand: Mellanox Technologies MT26428 [ConnectX IB QDR, PCIe
> 2.0 5GT/s] (rev a0)
>
> 2) Application detail
>
> a) Launching 7 processes, for pipelined processing, where each process
> waits for a message (sizes vary between 1 KBytes to 26 KBytes),
> then process the data, and outputs a message (sizes vary between 1 KBytes
> to 26 KBytes), to next process.
>
> b) MPI transport functions used : "MPI_Isend", MPI_Irecv, MPI_Test.
>i) For Receiving messages, I first make an MPI_Irecv call, followed by a
> busy-loop on MPI_Test, waiting for message
>ii) For Sending message, there is a busy-loop on MPI_Test to ensure
> prior buffer was sent, then use MPI_Isend.
>
> c) When the job starts, all these 7 process are put in High priority mode (
> SCHED_FIFO policy, with priority setting of 99).
> The Job entails an input data packet stream (and a series of MPI messages),
> continually at 40 micro-sec rate, for a few minutes.
>
> 3) The Problem:
> Most calls to MPI_Test (...which is non-blocking) takes a few micro-sec,
> but around 10% of the job, it has a large jitter, that vary from 1 to 100
> odd millisec. This causes
> some of the application input queues to fill-up  and cause a failure.
>
> Any suggestions to look at on the MPI settings or OS config/issues will be
> much appreciated.
>
> Thanks in advance.
> Sanji
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

I had a similar issue, a work-around is to avoid polling too much by placing
some kind of a timer in your code before the MPI_Test call.


[OMPI users] srun and openmpi

2010-12-23 Thread Michael Di Domenico
Can anyone point me towards the most recent documentation for using
srun and openmpi?

I followed what i found on the web with enabling the MpiPorts config
in slurm and using the --resv-ports switch, but I'm getting an error
from openmpi during setup.

I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM

I'm sure I'm missing a step.

Thanks


Re: [OMPI users] srun and openmpi

2010-12-23 Thread Ralph Castain
I'm not sure there is any documentation yet - not much clamor for it. :-/

It would really help if you included the error message. Otherwise, all I can do 
is guess, which wastes both of our time :-(

My best guess is that the port reservation didn't get passed down to the MPI 
procs properly - but that's just a guess.


On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote:

> Can anyone point me towards the most recent documentation for using
> srun and openmpi?
> 
> I followed what i found on the web with enabling the MpiPorts config
> in slurm and using the --resv-ports switch, but I'm getting an error
> from openmpi during setup.
> 
> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM
> 
> I'm sure I'm missing a step.
> 
> Thanks
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Openmpi Checkpoint/Restart failed

2010-12-23 Thread 孟宪军
Dear all,

I have figured it out. It was a simple issue, I didn't add the "blcr lib" to
the $PATH environment varable. However, it can make checkpoint operation,
but can't make restart operation successfully. It was so wield.


Best regards
Xianjun Meng

在 2010年12月23日 下午5:35,孟宪军 写道:

> My main question is:
>
> after I finished the checkpoint operation against a simple task which ran
> on tow machines, I can only restart it on one machine. if I ran the
> following command to force the ompi-restart to run the program on two
> machines:
>
> *ompi-restart  -hostfile  ./machine_names  ompi_global_snapshot_XXX.ckpt*
> (the machine_names include two host names)
>
> the output is:
> *
> --
> Error: Unable to obtain the proper restart command to restart from the
>checkpoint file (opal_snapshot_1.ckpt). Returned -1.
>
> --
> [jx-mpi-fcr048:04116] [ 0] /lib64/tls/libpthread.so.0 [0x302b80c420]
> [jx-mpi-fcr048:04116] [ 1] /lib64/tls/libc.so.6(__libc_free+0x25)
> [0x302af68b85]
> [jx-mpi-fcr048:04116] [ 2]
> /home/hpc_meng/openmpi/lib/libopen-pal.so.0(opal_argv_free+0x41)
> [0x2a9557de31]
> [jx-mpi-fcr048:04116] [ 3]
> /home/hpc_meng/openmpi/lib/libopen-pal.so.0(opal_event_fini+0x27)
> [0x2a95573ac7]
> [jx-mpi-fcr048:04116] [ 4]
> /home/hpc_meng/openmpi/lib/libopen-pal.so.0(opal_finalize+0x2f)
> [0x2a95568a0f]
> [jx-mpi-fcr048:04116] [ 5] opal-restart [0x401888]
> [jx-mpi-fcr048:04116] [ 6] /lib64/tls/libc.so.6(__libc_start_main+0xdb)
> [0x302af1c4bb]
> [jx-mpi-fcr048:04116] [ 7] opal-restart [0x40147a]
> [jx-mpi-fcr048:04116] *** End of error message ***
> --
> mpirun noticed that process rank 1 with PID 4116 on node
> jx-mpi-fcr048.jx.baidu.com exited on signal 11 (Segmentation fault).
> --
> *
>
> My global_snapshot_meta.data is:
>
> *# Seq: 0
> # Timestamp: Thu Dec 23 16:39:46 2010
> # Process: 1680080897.0
> # OPAL CRS Component: blcr
> # Snapshot Reference: opal_snapshot_0.ckpt
> # Snapshot Location:
> /home/work/checkpoint/ompi_global_snapshot_22817.ckpt/0
> # Process: 1680080897.1
> # OPAL CRS Component: blcr
> # Snapshot Reference: opal_snapshot_1.ckpt
> # Snapshot Location:
> /home/work/checkpoint/ompi_global_snapshot_22817.ckpt/0
> # Timestamp: Thu Dec 23 16:39:47 2010
> # Finished Seq: 0*
>
> Does anabody know why?
>
> Thanks
> Xianjun Meng
>
>
> 2010/12/23 孟宪军 
>
> Dear all,
>>
>> I had to try the checkpoint/restart function of Openmpi recently, and
>> after several failure and checking lots of the docement, I am still very
>> confused about how to config the checkpoint/restart function. Can anybody
>> give me a $HOME/.openmpi/mca-params.conf script and introduce me what
>> parameters I should specified when i install the openmpi?
>>
>> BTW, I want to install the openmpi1.5.1 and blcr 0.8.0.
>>
>>
>> Thanks
>> Xianjun Meng
>>
>
>