[OMPI users] Grid launcher question.

2007-04-05 Thread Xie, Hugh

Hi,

Have anyone attempt to integrate ompi with commercial grid scheduler
(e.g. Platform Symphony). The integration would be similar to
LoadLeveler. I am trying to understand if the integration can be done
simply without Platform Symphony's change their code. Otherwise, I have
to contact them for getting a integration patch.

Thanks.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - -

This message is intended only for the personal and confidential use of the 
designated recipient(s) named above.  If you are not the intended recipient of 
this message you are hereby notified that any review, dissemination, 
distribution or copying of this message is strictly prohibited.  This 
communication is for information purposes only and should not be regarded as an 
offer to sell or as a solicitation of an offer to buy any financial product, an 
official confirmation of any transaction, or as an official statement of Lehman 
Brothers.  Email transmission cannot be guaranteed to be secure or error-free.  
Therefore, we do not represent that this information is complete or accurate 
and it should not be relied upon as such.  All information is subject to change 
without notice.


IRS Circular 230 Disclosure:
Please be advised that any discussion of U.S. tax matters contained within this 
communication (including any attachments) is not intended or written to be used 
and cannot be used for the purpose of (i) avoiding U.S. tax related penalties 
or (ii) promoting, marketing or recommending to another party any transaction 
or matter addressed herein.





[OMPI users] openmpi and Torque

2007-04-05 Thread Bas van der Vlies

Hello,

 I am just try to enable  PBS /Torque support in Open MPI with the -- 
with-tm option.  My question is why the utility 'pbs-config' is not  
used to determine the location of the include/library directory.  It  
is standard included in the torque software.



# pbs-config ---cflags
{{{
-I/usr/include/torque
}}}

# pbs-config --libs
{{{
-ltorque
}}}

--
Bas van der Vlies
b...@sara.nl





[OMPI users] MPI 1.2 stuck in pthread_condition_wait

2007-04-05 Thread herve PETIT Perso

Some precision about this thread,

I have read the answer you provided for thread "MPI_Comm_Spawn" posted by 
rozzen.vincent
I have actually reproduced the same behavior on my debian sarge installation
i.e
1) mpi_com_spawn failure after 31 spawns ("--disable-threads" is set)
2) MPI applications lock when "--enable-threads" is set

* For issue 1)
MPI 1.2 release solves the problem, so it does not seem to be a system 
limitation but anyway, now, it is behind us

* For issue 2)
I have been in contact with Rozenn. After a little talk with her, I have done a new test 
with a "--enable-debug" setting of OpenMpi 1.2 (stable version).

The gdb log is a little bit explicit on the deadlock situation.
-
main***
main : Start MPI*
opal_mutex_lock(): Resource deadlock avoided
[host10:20607] *** Process received signal ***
[host10:20607] Signal: Aborted (6)
[host10:20607] Signal code:  (-6)
[host10:20607] [ 0] [0xe440]
[host10:20607] [ 1] /lib/tls/libc.so.6(abort+0x1d2) [0x4029cfa2]
[host10:20607] [ 2] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x40061d25]
[host10:20607] [ 3] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x4006030e]
[host10:20607] [ 4] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x40061e23]
[host10:20607] [ 5] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x40060175]
[host10:20607] [ 6] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x40061da3]
[host10:20607] [ 7] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x40062315]
[host10:20607] [ 8] 
/usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0(ompi_proc_unpack+0x15a) [0x40061392]
[host10:20607] [ 9] 
/usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0(ompi_comm_connect_accept+0x45c) 
[0x4004dd62]
[host10:20607] [10] 
/usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0(PMPI_Comm_spawn+0x346) [0x400949a8]
[host10:20607] [11] spawn(main+0xe2) [0x80489a6]
[host10:20607] [12] /lib/tls/libc.so.6(__libc_start_main+0xf4) [0x40288974]
[host10:20607] [13] spawn [0x8048821]
[host10:20607] *** End of error message ***
[host10:20602] [0,0,0]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection 
reset by peer (104)



So, it seems that the lock is in the spawn code.
I have also discovered that the spawned program is also locked in the spawn 
mechanism.
Here after, a gdb log from the spawned program.


--
#0  0x4019c436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
#1  0x40199893 in _L_mutex_lock_26 () from /lib/tls/libpthread.so.0
#2  0xb4b8 in ?? ()
#3  0xb4b8 in ?? ()
#4  0x in ?? ()
#5  0x400a663c in __JCR_LIST__ () from 
/usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
#6  0x400a663c in __JCR_LIST__ () from 
/usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
#7  0x401347a4 in opal_condition_t_class () from 
/usr/local/Mpi/CURRENT_MPI/lib/libopen-pal.so.0
#8  0xb4e8 in ?? ()
#9  0x400554a8 in ompi_proc_construct () from 
/usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
#10 0x400554a8 in ompi_proc_construct () from 
/usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
#11 0x40056946 in ompi_proc_find_and_add () from 
/usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
#12 0x4005609e in ompi_proc_unpack () from 
/usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
#13 0x400481cd in ompi_comm_connect_accept () from 
/usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
#14 0x40049b2a in ompi_comm_dyn_init () from 
/usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
#15 0x40058e6d in ompi_mpi_init () from 
/usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
#16 0x4007e122 in PMPI_Init_thread () from 
/usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
#17 0x08048a3b in main (argc=1, argv=0xb844) at ExeToSpawned6.c:31
---

Hopefully, it can help you to investigate.



Herve


Re: [OMPI users] MPI 1.2 stuck in pthread_condition_wait

2007-04-05 Thread Ralph Castain
Thanks Herve - and Rozenn too.

I can't speak to the thread lock issue as it appears to be occurring in the
MPI side of the code.

As to the spawn limit, I honestly never checked the 1.1.x code family as we
aren't planning any repairs to it anyway. My observations were based on the
1.2 family. We have done our own fairly extensive testing and found there
are system-imposed limits that do cause problems, but that the levels at
which these occur are *very* system dependent - i.e., they depend upon
kernel configuration parameters that vary across releases, how your system
admin configured things, etc. They are, therefore, impossible to predict.

What we are going to do is modify the code so we can at least detect these
situations, alert you to them, and gracefully exit when we encounter them.
Hopefully, we'll have those fixes out soon.

Thanks again
Ralph


On 4/5/07 2:47 PM, "herve PETIT Perso"  wrote:

> Some precision about this thread,
> 
> I have read the answer you provided for thread "MPI_Comm_Spawn" posted by
> rozzen.vincent
> I have actually reproduced the same behavior on my debian sarge installation
> i.e
> 1) mpi_com_spawn failure after 31 spawns ("--disable-threads" is set)
> 2) MPI applications lock when "--enable-threads" is set
> 
> * For issue 1)
> MPI 1.2 release solves the problem, so it does not seem to be a system
> limitation but anyway, now, it is behind us
> 
> * For issue 2)
> I have been in contact with Rozenn. After a little talk with her, I have done
> a new test with a "--enable-debug" setting of OpenMpi 1.2 (stable version).
> 
> The gdb log is a little bit explicit on the deadlock situation.
> -
> main***
> main : Start MPI*
> opal_mutex_lock(): Resource deadlock avoided
> [host10:20607] *** Process received signal ***
> [host10:20607] Signal: Aborted (6)
> [host10:20607] Signal code:  (-6)
> [host10:20607] [ 0] [0xe440]
> [host10:20607] [ 1] /lib/tls/libc.so.6(abort+0x1d2) [0x4029cfa2]
> [host10:20607] [ 2] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x40061d25]
> [host10:20607] [ 3] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x4006030e]
> [host10:20607] [ 4] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x40061e23]
> [host10:20607] [ 5] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x40060175]
> [host10:20607] [ 6] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x40061da3]
> [host10:20607] [ 7] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x40062315]
> [host10:20607] [ 8]
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0(ompi_proc_unpack+0x15a)
> [0x40061392]
> [host10:20607] [ 9]
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0(ompi_comm_connect_accept+0x45c)
> [0x4004dd62]
> [host10:20607] [10]
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0(PMPI_Comm_spawn+0x346) [0x400949a8]
> [host10:20607] [11] spawn(main+0xe2) [0x80489a6]
> [host10:20607] [12] /lib/tls/libc.so.6(__libc_start_main+0xf4) [0x40288974]
> [host10:20607] [13] spawn [0x8048821]
> [host10:20607] *** End of error message ***
> [host10:20602] [0,0,0]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection
> reset by peer (104)
> --
> --
> 
> 
> So, it seems that the lock is in the spawn code.
> I have also discovered that the spawned program is also locked in the spawn
> mechanism.
> Here after, a gdb log from the spawned program.
> 
> 
> --
> 
> #0  0x4019c436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
> #1  0x40199893 in _L_mutex_lock_26 () from /lib/tls/libpthread.so.0
> #2  0xb4b8 in ?? ()
> #3  0xb4b8 in ?? ()
> #4  0x in ?? ()
> #5  0x400a663c in __JCR_LIST__ () from
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
> #6  0x400a663c in __JCR_LIST__ () from
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
> #7  0x401347a4 in opal_condition_t_class () from
> /usr/local/Mpi/CURRENT_MPI/lib/libopen-pal.so.0
> #8  0xb4e8 in ?? ()
> #9  0x400554a8 in ompi_proc_construct () from
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
> #10 0x400554a8 in ompi_proc_construct () from
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
> #11 0x40056946 in ompi_proc_find_and_add () from
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
> #12 0x4005609e in ompi_proc_unpack () from
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
> #13 0x400481cd in ompi_comm_connect_accept () from
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
> #14 0x40049b2a in ompi_comm_dyn_init () from
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
> #15 0x40058e6d in ompi_mpi_init () from
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
> #16 0x4007e122 in PMPI_Init_thread () from
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
> #17 0x08048a3b in main (argc=1, argv=0xb844) at ExeToSpawned6.c:31
> --
> -
> 
> Hopefully, it can help you t