[OMPI users] [openMPI-infiniband] openMPI in IB network when openSM with LASH is running

2007-11-28 Thread Keshetti Mahesh
Has anyone in the list ever tested openMPI in infiniband network
in which openSM is running with LASH routing algorithm enabled?

I haven't tested the above case but i could foresee a problem
because LASH routing algorithm in openSM uses virtual
lanes (VL) which are directly mapped with service levels (SL).
And LASH routing algorithm assigns different VLs (SLs) to different
paths in the network. This SL <-> path association is available only
through the subnet manager (openSM) during connection establishment.
But AFAIK, openMPI don't use the services of subnet manager for
connection establishment between nodes. So I want to know whether anyone
thought about it and working on it or not.

regards,
Mahesh


[OMPI users] version 1.3

2007-11-28 Thread Neeraj Chourasia
Hello Guys,

   When is the version 1.3 scheduled to be released? As it would contain 
checkpointing, library for non-blocking communication, ConnectX for QP's, it 
would be great to have it ASAP. Since i am evaluating MVAPICH against OpenMPI, 
i found that MVAPICH still has upper hand in terms of checkpointing. But i am 
pretty sure, once v1.3 will come, it will help a lot to HPC community.

I can find the development trunk version, but i am more interested in 
production release version.

-Neeraj
  


[OMPI users] ./configure error on windows while installing openmpi-1.2.4(latest)

2007-11-28 Thread geetha r
Hi,
   Subject: "Need exact command line for ./configure {optionslist} "  to
build OPENMPI-1.2.4 on windows."


while configuration script checking the FORTRAN77 compiler , iam getting
following error,so openmpi- build is unsuccessful on windows(with configure
script)

 checking for correct handling of FORTRAN logical arrays... no
configure: error: Error determining if arrays of logical values work
properly.


i want to build, openmpi-1.2.4 (which is downloaded from MINGW), on windows
-2000 machine.

can somebody give proper build command i can use to "build opennmpi on
windows-2000" machine.

i.e

 ./configure  ...(options list)

can some body pls tell "exact options to pass" in the option list.

iam using cygwin to build openmpi on windows.

PS:
I am attaching the output files.

config.log -> actual log file.
config.out -> output of the ./configure  file
make.out -> fail because, configure build unsuccess on windows.
make.install-> fail because, configure build unsuccess on windows


PS: I am using all g77,g++,gcc from MINGW package.

i have downloaded and added g95 also, but which does not solve my problem.

Thanks,
Geetha


*
** **
** WARNING:  This email contains an attachment of a very suspicious type.  **
** You are urged NOT to open this attachment unless you are absolutely **
** sure it is legitimate.  Opening this attachment may cause irreparable   **
** damage to your computer and your files.  If you have any questions  **
** about the validity of this message, PLEASE SEEK HELP BEFORE OPENING IT. **
** **
** This warning was added by the IU Computer Science Dept. mail scanner.   **
*


<>
<>
<>
<>


Re: [OMPI users] SegFault with MPI_THREAD_MULTIPLE in 1.2.4

2007-11-28 Thread Jeff Squyres
This is to be expected.  OMPI's support for THREAD_MULTIPLE is  
incomplete and most likely doesn't work.



On Nov 25, 2007, at 6:45 PM, Emilio J. Padron wrote:


Hi,

it's my fist message here so greetings to everyone (and sorry about my
poor english) :-)

I'm coding a parallel algorithm and I've decided to upgrade the  
openmpi
version used in our cluster (1.2.3) this week. After that, problems  
arise :-/


There seems to be any problem with multithreding support in OpenMPI  
1.2.4, at

least in my installation. Problem appears when more than one process
per node is spawned. A simple *hello world* program (with no snd/ 
rcvs) works
ok in MPI_THREAD_SINGLE mode, but when I tried MPI_THREAD_MULTIPLE  
this

error arises:

/opt/openmpi/bin/mpirun -np 2 -machinefile /home/users/emilioj/ 
machinefileOpenMPI --debug-daemons justhi

Daemon [0,0,1] checking in as pid 5446 on host c0-0
[pvfs2-compute-0-0.local:05446] [0,0,1] orted: received launch  
callback

[pvfs2-compute-0-0:05447] *** Process received signal ***
[pvfs2-compute-0-0:05447] Signal: Segmentation fault (11)
[pvfs2-compute-0-0:05447] Signal code: Address not mapped (1)
[pvfs2-compute-0-0:05447] Failing at address: (nil)
[pvfs2-compute-0-0:05448] *** Process received signal ***
[pvfs2-compute-0-0:05448] Signal: Segmentation fault (11)
[pvfs2-compute-0-0:05448] Signal code: Address not mapped (1)
[pvfs2-compute-0-0:05448] Failing at address: (nil)
[pvfs2-compute-0-0:05448] [ 0] /lib/tls/libpthread.so.0 [0xbb2890]
[pvfs2-compute-0-0:05448] [ 1] /opt/openmpi/lib/openmpi/ 
mca_bml_r2.so(mca_bml_r2_progress+0x39) [0x4b1d99]
[pvfs2-compute-0-0:05448] [ 2] /opt/openmpi/lib/libopen-pal.so. 
0(opal_progress+0x65) [0x592265]
[pvfs2-compute-0-0:05448] [ 3] /opt/openmpi/lib/openmpi/ 
mca_oob_tcp.so(mca_oob_tcp_msg_wait+0x29) [0x20a731]
[pvfs2-compute-0-0:05448] [ 4] /opt/openmpi/lib/openmpi/ 
mca_oob_tcp.so(mca_oob_tcp_recv+0x365) [0x20f301]
[pvfs2-compute-0-0:05448] [ 5] /opt/openmpi/lib/libopen-rte.so. 
0(mca_oob_recv_packed+0x38) [0x13c6a0]
[pvfs2-compute-0-0:05448] [ 6] /opt/openmpi/lib/libopen-rte.so. 
0(mca_oob_xcast+0xa0e) [0x13d36a]
[pvfs2-compute-0-0:05448] [ 7] /opt/openmpi/lib/libmpi.so. 
0(ompi_mpi_init+0x566) [0xda9f22]

[pvfs2-compute-0-0:05447] [ 0] /lib/tls/libpthread.so.0 [0xbb2890]
[pvfs2-compute-0-0:05447] [ 1] /opt/openmpi/lib/openmpi/ 
mca_bml_r2.so(mca_bml_r2_progress+0x39) [0x305d99]
[pvfs2-compute-0-0:05447] [ 2] /opt/openmpi/lib/libopen-pal.so. 
0(opal_progress+0x65) [0x9fb265]
[pvfs2-compute-0-0:05447] [ 3] /opt/openmpi/lib/openmpi/ 
mca_oob_tcp.so(mca_oob_tcp_msg_wait+0x29) [0x2ed731]
[pvfs2-compute-0-0:05447] [ 4] /opt/openmpi/lib/openmpi/ 
mca_oob_tcp.so(mca_oob_tcp_recv+0x365) [0x2f2301]
[pvfs2-compute-0-0:05447] [ 5] /opt/openmpi/lib/libopen-rte.so. 
0(mca_oob_recv_packed+0x38) [0x53c6a0]
[pvfs2-compute-0-0:05447] [ 6] /opt/openmpi/lib/openmpi/ 
mca_gpr_proxy.so(orte_gpr_proxy_put+0x1b0) [0x2c4fc8]
[pvfs2-compute-0-0:05447] [ 7] /opt/openmpi/lib/libopen-rte.so. 
0(orte_smr_base_set_proc_state+0x244) [0x551420]
[pvfs2-compute-0-0:05447] [ 8] /opt/openmpi/lib/libmpi.so. 
0(ompi_mpi_init+0x52e) [0x13ceea]
[pvfs2-compute-0-0:05447] [ 9] /opt/openmpi/lib/libmpi.so. 
0(PMPI_Init_thread+0x5c) [0x15e844]

[pvfs2-compute-0-0:05447] [10] justhi(main+0x36) [0x8048782]
[pvfs2-compute-0-0:05448] [ 8] /opt/openmpi/lib/libmpi.so. 
0(PMPI_Init_thread+0x5c) [0xdcb844]

[pvfs2-compute-0-0:05448] [ 9] justhi(main+0x36) [0x8048782]
[pvfs2-compute-0-0:05448] [10] /lib/tls/libc.so.6(__libc_start_main 
+0xd3) [0x970de3]

[pvfs2-compute-0-0:05448] [11] justhi [0x80486c5]
[pvfs2-compute-0-0:05448] *** End of error message ***
[pvfs2-compute-0-0:05447] [11] /lib/tls/libc.so.6(__libc_start_main 
+0xd3) [0x1a0de3]

[pvfs2-compute-0-0:05447] [12] justhi [0x80486c5]
[pvfs2-compute-0-0:05447] *** End of error message ***
[pvfs2-compute-0-0.local:05446] [0,0,1] orted_recv_pls: received  
message from [0,0,0]
[pvfs2-compute-0-0.local:05446] [0,0,1] orted_recv_pls: received  
kill_local_procs



[Ctrl+Z and kill -9 is needed to finish the execution]

The machinefile contains:

c0-0 slots=4
c0-1 slots=4
c0-2 slots=4
c0-3 slots=4
...

If processes are forced to be spawned in different nodes (c0-0  
slots=1,
c0-1 slots=1, c0-2 slots=1, c0-3 slots=1...) then there is no  
error :-?
With 1.2.3 version (same *configure* options) everything runs  
perfectly.


The ompi_info for my openmpi 1.2.4 installation:
   Open MPI: 1.2.4
  Open MPI SVN revision: r16187
   Open RTE: 1.2.4
  Open RTE SVN revision: r16187
   OPAL: 1.2.4
  OPAL SVN revision: r16187
 Prefix: /opt/openmpi
Configured architecture: i686-pc-linux-gnu
  Configured by: root
  Configured on: Sun Nov 25 20:13:42 CET 2007
 Configure host: pvfs2-compute-0-0.local
   Built by: root
   Built on: Sun Nov 25 20:19:55 CET 2007
 Built host: pvfs2-compute-0-0.local
 C bindings: yes
  

Re: [OMPI users] Newbie: Using hostfile

2007-11-28 Thread Jeff Squyres

Well, that's odd.

What happens if you try to mpirun "hostname" (i.e., a non-MPI  
application)?  Does it run, or does it hang?



On Nov 23, 2007, at 6:00 AM, Madireddy Samuel Vijaykumar wrote:


I have been using using clusters for some tests. My localhost "lynx"
and i have "puma" and "tiger" which make up the cluster. All have
passwordless ssh enabled. Now if i have the following in my
hostfile(perline in the same order)

lynx
puma
tiger

My tests(from lynx) run over the cluster without any issues.

But if move/remove the lynx from there either (perline in the same  
order)


puma
lynx
tiger

or

puma
tiger

My test(from lynx) just does not get any where. It just hangs. And
does not proceed at all. Is this an issue with way my script handles
the cluster node. Or is there an method for the hostfile. Thanks.

--
Sam aka Vijju
:)~
Linux: Open, True and Cool
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



Re: [OMPI users] [openMPI-infiniband] openMPI in IB network when openSM with LASH is running

2007-11-28 Thread Jeff Squyres
There is work starting literally right about now to allow Open MPI to  
use the RDMA CM and/or the IBCM for creating OpenFabrics connections  
(IB or iWARP).



On Nov 28, 2007, at 4:37 AM, Keshetti Mahesh wrote:


Has anyone in the list ever tested openMPI in infiniband network
in which openSM is running with LASH routing algorithm enabled?

I haven't tested the above case but i could foresee a problem
because LASH routing algorithm in openSM uses virtual
lanes (VL) which are directly mapped with service levels (SL).
And LASH routing algorithm assigns different VLs (SLs) to different
paths in the network. This SL <-> path association is available only
through the subnet manager (openSM) during connection establishment.
But AFAIK, openMPI don't use the services of subnet manager for
connection establishment between nodes. So I want to know whether  
anyone

thought about it and working on it or not.

regards,
Mahesh



--
Jeff Squyres
Cisco Systems



Re: [OMPI users] OpenIB problems

2007-11-28 Thread Jeff Squyres
Roland thought that the default value of 10 might be a bit too low and  
that tuning it to be higher, particularly in apps that pound on a  
single port, would probably be acceptable.


Tuning up to 20 is probably a bit overkill.



On Nov 27, 2007, at 3:54 PM, Jeff Squyres wrote:

BTW, Andrew is correct about the unit for btl_openib_ib_timeout and  
that the value is simply passed down to the verbs library when  
making an IB connection.  Open MPI does nothing else with that  
value; it's an IBTA-defined value.


The help message was wrong on the 1.2 branch for a while; I think  
it's been corrected in more recent versions of OMPI (i.e., >1.2 -- I  
don't recall which version specifically).



On Nov 27, 2007, at 3:19 PM, Andrew Friedley wrote:




Brock Palen wrote:
What would be a place to look?  Should this just be default then  
for

OMPI?  ompi_info shows the default as 10 seconds?  Is that right
'seconds' ?
The other IB guys can probably answer better than I can -- I'm  
not an

expert in this part of IB (or really any part I guess :).  Not sure
why
a larger value isn't the default.  No, its not seconds -- check the
description of the MCA parameter:

4.096 microseconds * (2^btl_openib_ib_timeout)


You sure?
ompi_info --param btl openib

MCA btl: parameter "btl_openib_ib_timeout" (current value: "10")
 InfiniBand transmit timeout, in seconds
(must be >= 1)


Yeah:

MCA btl: parameter "btl_openib_ib_timeout" (current value: "10")
InfiniBand transmit timeout, plugged into formula:
4.096 microseconds * (2^btl_openib_ib_timeout)(must be

= 0 and <= 31)


Reading earlier in the thread you said OMPI v1.2.0, I got this from a
trunk checkout thats around 3 weeks old.  A quick check shows this
description was changed between 1.2.0 and 1.2.1.  However the use of
this parameter hasn't changed -- it's simply passed along to IB verbs
when creating a queue pair (aka a connection).

Andrew
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems





--
Jeff Squyres
Cisco Systems



Re: [OMPI users] OpenIB problems

2007-11-28 Thread Andrew Friedley
What value do you suggest then?  I know I've seen the problem persist at 
values of 14 and 16, and would rather be certain that this isn't going 
to kill the job that just sat in the queue for a week.


Andrew

Jeff Squyres wrote:
Roland thought that the default value of 10 might be a bit too low and  
that tuning it to be higher, particularly in apps that pound on a  
single port, would probably be acceptable.


Tuning up to 20 is probably a bit overkill.



On Nov 27, 2007, at 3:54 PM, Jeff Squyres wrote:

BTW, Andrew is correct about the unit for btl_openib_ib_timeout and  
that the value is simply passed down to the verbs library when  
making an IB connection.  Open MPI does nothing else with that  
value; it's an IBTA-defined value.


The help message was wrong on the 1.2 branch for a while; I think  
it's been corrected in more recent versions of OMPI (i.e., >1.2 -- I  
don't recall which version specifically).



On Nov 27, 2007, at 3:19 PM, Andrew Friedley wrote:



Brock Palen wrote:
What would be a place to look?  Should this just be default then  
for

OMPI?  ompi_info shows the default as 10 seconds?  Is that right
'seconds' ?
The other IB guys can probably answer better than I can -- I'm  
not an

expert in this part of IB (or really any part I guess :).  Not sure
why
a larger value isn't the default.  No, its not seconds -- check the
description of the MCA parameter:

4.096 microseconds * (2^btl_openib_ib_timeout)

You sure?
ompi_info --param btl openib

MCA btl: parameter "btl_openib_ib_timeout" (current value: "10")
 InfiniBand transmit timeout, in seconds
(must be >= 1)

Yeah:

MCA btl: parameter "btl_openib_ib_timeout" (current value: "10")
InfiniBand transmit timeout, plugged into formula:
4.096 microseconds * (2^btl_openib_ib_timeout)(must be

= 0 and <= 31)

Reading earlier in the thread you said OMPI v1.2.0, I got this from a
trunk checkout thats around 3 weeks old.  A quick check shows this
description was changed between 1.2.0 and 1.2.1.  However the use of
this parameter hasn't changed -- it's simply passed along to IB verbs
when creating a queue pair (aka a connection).

Andrew
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems







Re: [OMPI users] SegFault with MPI_THREAD_MULTIPLE in 1.2.4

2007-11-28 Thread Emilio J. Padron
Hi Jeff,

thank you for your answer...

> >
> > [... regarding SegFault with MPI_THREAD_MULTIPLE in OMPI 1.2.4 ...]
> >
On Wed, Nov 28, 2007 at 11:27:51AM -0500, Jeff Squyres wrote:
> This is to be expected.  OMPI's support for THREAD_MULTIPLE is  
> incomplete and most likely doesn't work.
> 

Ok, I knew it was not *too* tested, but I was been using previous
verions with *more or less* success, with a relatively small number of 
communications in a multithread (posix) environment. This last 1.2.4 
version breaks even with no explicit comms in the program. It just 
seemed quite weird to me in a (supposedly) minor-changes revision :-?

Anyway, thanks to the OMPI team for all the hard work. I hope complete
thread-safety support is available in future revisions :-)

Cheers,
E.


Re: [OMPI users] version 1.3

2007-11-28 Thread Jeff Squyres
v1.3's schedule is being developed right now -- it was a little hard  
to hear on the teleconference yesterday, but I think I heard Brad  
Benton from IBM (one of the two Release Managers for the v1.3 series)  
say that he'd have a plan for review by the group next week.


So far, I've been [wildly] estimating 1HCY2008.  I don't think it  
would be a good idea to try to be any more precise before we get the  
RM's input.  :-)




On Nov 28, 2007, at 5:39 AM, Neeraj Chourasia wrote:

   When is the version 1.3 scheduled to be released? As it would  
contain checkpointing, library for non-blocking communication,  
ConnectX for QP's, it would be great to have it ASAP. Since i am  
evaluating MVAPICH against OpenMPI, i found that MVAPICH still has  
upper hand in terms of checkpointing. But i am pretty sure, once  
v1.3 will come, it will help a lot to HPC community.


I can find the development trunk version, but i am more interested  
in production release version.





--
Jeff Squyres
Cisco Systems



Re: [OMPI users] OpenIB problems

2007-11-28 Thread Ogden, Jeffry Brandon
For what it's worth Andrew, the RETRY_EXCEEDED_ERRORS can be caused by
flaky hardware as well.  The timeout value is probably best tuned
relative to the size of your IB fabric.  But if reliability is the
biggest criteria, crank up the timemout value to 20.  That's the best
you can do.  If it continues to happen, it is more than likely you have
a flaky HCA, IB link, switch side sw, or node.  We actually have way too
much IB hardware for any sane person and my experience is that the
RETRY_EXCEEDED_ERRORS can sometimes be really tricky to track down.  One
of my favorites is the spontaneous rebooting node.  We see nodes under
heavy MPI application load sometimes randomly reboot.  This causes the
RETRY_EXCEEDED_ERROR as well.  I would second the recommendation to
watch the IB counters across the entire IB fabric from the subnet
manager.

Good luck!

> -Original Message-
> From: users-boun...@open-mpi.org 
> [mailto:users-boun...@open-mpi.org] On Behalf Of Andrew Friedley
> Sent: Wednesday, November 28, 2007 9:36 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenIB problems
> 
> What value do you suggest then?  I know I've seen the problem 
> persist at 
> values of 14 and 16, and would rather be certain that this 
> isn't going 
> to kill the job that just sat in the queue for a week.
> 
> Andrew
> 
> Jeff Squyres wrote:
> > Roland thought that the default value of 10 might be a bit 
> too low and  
> > that tuning it to be higher, particularly in apps that pound on a  
> > single port, would probably be acceptable.
> > 
> > Tuning up to 20 is probably a bit overkill.
> > 
> > 




Re: [OMPI users] ./configure error on windows while installing openmpi-1.2.4(latest)

2007-11-28 Thread George Bosilca
If your F77 compiler do not support array of LOGICAL variables (which  
seems to be the case if you look in the config.log file), then you're  
left with only one option. Remove the F77 support from the  
compilation. This means adding the --disable-mpi-f77 option to the ./ 
configure.


  Thanks,
george.

On Nov 28, 2007, at 9:24 AM, geetha r wrote:


Hi,
   Subject: "Need exact command line for ./configure  
{optionslist} "  to build OPENMPI-1.2.4 on windows."



while configuration script checking the FORTRAN77 compiler , iam  
getting following error,so openmpi- build is unsuccessful on  
windows(with configure script)


 checking for correct handling of FORTRAN logical arrays... no
configure: error: Error determining if arrays of logical values work  
properly.



i want to build, openmpi-1.2.4 (which is downloaded from MINGW), on  
windows -2000 machine.


can somebody give proper build command i can use to "build opennmpi  
on windows-2000" machine.


i.e

 ./configure  ...(options list)

can some body pls tell "exact options to pass" in the option list.

iam using cygwin to build openmpi on windows.

PS:
I am attaching the output files.

config.log -> actual log file.
config.out -> output of the ./configure  file
make.out -> fail because, configure build unsuccess on windows.
make.install-> fail because, configure build unsuccess on windows


PS: I am using all g77,g++,gcc from MINGW package.

i have downloaded and added g95 also, but which does not solve my  
problem.


Thanks,
Geetha

*
** **
** WARNING:  This email contains an attachment of a very suspicious  
type.  **
** You are urged NOT to open this attachment unless you are  
absolutely **
** sure it is legitimate.  Opening this attachment may cause  
irreparable   **
** damage to your computer and your files.  If you have any  
questions  **
** about the validity of this message, PLEASE SEEK HELP BEFORE  
OPENING IT. **

** **
** This warning was added by the IU Computer Science Dept. mail  
scanner.   **

*

< 
make 
.install 
.zip 
> 
< 
make 
.out 
.zip 
> 
< 
config 
.out.zip>___

users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI users] ./configure error on windows while installing openmpi-1.2.4(latest)

2007-11-28 Thread Terry Frankcombe
On Wed, 2007-11-28 at 13:20 -0500, George Bosilca wrote:
> If your F77 compiler do not support array of LOGICAL variables (which  
> seems to be the case if you look in the config.log file), then you're  
> left with only one option. Remove the F77 support from the  
> compilation. This means adding the --disable-mpi-f77 option to the ./ 
> configure.

It's a lot weirder than that.

configure: WARNING: *** Fortran 77 REAL*8 does not have expected size!
configure: WARNING: *** Expected 8, got 8
configure: WARNING: *** Disabling MPI support for Fortran 77 REAL*8

Somehow, 8/=8

:-\




[OMPI users] mca_oob_tcp_peer_try_connect problem

2007-11-28 Thread Bob Soliday

I am new to openmpi and have a problem that I cannot seem to solve.
I am trying to run the hello_c example and I can't get it to work.
I compiled openmpi with:

./configure --prefix=/usr/local/software/openmpi-1.2.4 --disable-ipv6 
--with-openib

The hostname file contains the local host and one other node. When I run it I 
get:


[soliday@max14 mpi-ex]$ /usr/local/software/openmpi-1.2.4/bin/mpirun 
--debug-daemons -mca oob_tcp_debug 1000 -machinefile hostfile -np 2 hello_c
[max14:31465] [0,0,0] accepting connections via event library
[max14:31465] [0,0,0] mca_oob_tcp_init: calling orte_gpr.subscribe
[max14:31466] [0,0,1] accepting connections via event library
[max14:31466] [0,0,1] mca_oob_tcp_init: calling orte_gpr.subscribe
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_peer_try_connect: connecting port 
55152 to: 192.168.2.14:38852
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_peer_complete_connect: sending ack, 0
[max14:31465] [0,0,0] mca_oob_tcp_accept: 192.168.2.14:37255
[max14:31465] [0,0,0]-[0,0,1] accepted: 192.168.2.14 - 192.168.2.14 nodelay 1 
sndbuf 262142 rcvbuf 262142 flags 0802
[max14:31466] [0,0,1]-[0,0,0] connected: 192.168.2.14 - 192.168.2.14 nodelay 1 
sndbuf 262142 rcvbuf 262142 flags 0802
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2
Daemon [0,0,1] checking in as pid 31466 on host max14
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2
[max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to 
192.168.1.14:38852 failed: Software caused connection abort (103)
[max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to 
192.168.1.14:38852 failed: Software caused connection abort (103)
[max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to 
192.168.1.14:38852 failed, connecting over all interfaces failed!
[max15:28222] OOB: Connection to HNP lost
[max14:31466] [0,0,1] orted_recv_pls: received message from [0,0,0]
[max14:31466] [0,0,1] orted_recv_pls: received kill_local_procs
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 15
[max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
base/pls_base_orted_cmds.c at line 275
[max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 
1166
[max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
[max14:31465] ERROR: A daemon on node max15 failed to start as expected.
[max14:31465] ERROR: There may be more information available from
[max14:31465] ERROR: the remote shell (see above).
[max14:31465] ERROR: The daemon exited unexpectedly with status 1.
[max14:31466] [0,0,1] orted_recv_pls: received message from [0,0,0]
[max14:31466] [0,0,1] orted_recv_pls: received exit
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 15
[max14:31465] [0,0,0]-[0,0,1] mca_oob_tcp_msg_recv: peer closed connection
[max14:31465] [0,0,0]-[0,0,1] mca_oob_tcp_peer_close(0x523100) sd 6 state 4
[max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
base/pls_base_orted_cmds.c at line 188
[max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 
1198
--
mpirun was unable to cleanly terminate the daemons for this job. Returned value 
Timeout instead of ORTE_SUCCESS.
--



I can see that the orted deamon program is starting on both computers but it 
looks to
me like they can't talk to each other.

Here is the output from ifconfig on one of the nodes, the other node is similar.

[root@max14 ~]# /sbin/ifconfig
eth0  Link encap:Ethernet  HWaddr 00:17:31:9C:93:A1
  inet addr:192.168.2.14  Bcast:192.168.2.255  Mask:255.255.255.0
  inet6 addr: fe80::217:31ff:fe9c:93a1/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:1353 errors:0 dropped:0 overruns:0 frame:0
  TX packets:9572 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:188125 (183.7 KiB)  TX bytes:1500567 (1.4 MiB)
  Interrupt:17

eth1  Link encap:Ethernet  HWaddr 00:17:31:9C:93:A2
  inet addr:192.168.1.14  Bcast:192.168.1.255  Mask:255.255.255.0
  inet6 addr: fe80::217:31ff:fe9c:93a2/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:49652796 errors:0 dropped:0 overruns:0 frame:0
  TX packets:49368158 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:21844618928 (20.3 GiB)  TX bytes:16122676331 (15.0 GiB)
  Interrupt:19

loLink encap:Local Loopback
  inet addr:127.0.0.1  Mask:255.0.0.0
  inet6 addr: ::1/128 Scope:Host
  UP LOOPBACK RUNNING  MTU:16436  Metric:1
 

Re: [OMPI users] OpenIB problems

2007-11-28 Thread Brock Palen

Jeff thanks for all the reply's,

Hate to admit but at the moment we can't log onto the switch.

But the ibcheckerrors command returns nothing out of bounds, and i  
think that command also checks the switch ports.


Thanks, we will do some tests

Brock Palen
Center for Advanced Computing
bro...@umich.edu
(734)936-1985


On Nov 27, 2007, at 4:50 PM, Jeff Squyres wrote:


Sorry for jumping in late; the holiday and other travel prevented me
from getting to all my mail recently...  :-\

Have you checked the counters on the subnet manager to see if any
other errors are occurring?  It might be good to clear all the
counters, run the job, and see if the counters are increasing faster
than they should (i.e., any particular counter should advance very
very slowly -- perhaps 1 per day or so).

I'll ask around the kernel-level guys (i.e., Roland) to see what else
could cause this kind of error.



On Nov 27, 2007, at 3:35 PM, Brock Palen wrote:


Ok i will open a case with cisco,


Brock Palen
Center for Advanced Computing
bro...@umich.edu
(734)936-1985


On Nov 27, 2007, at 4:19 PM, Andrew Friedley wrote:




Brock Palen wrote:

What would be a place to look?  Should this just be default then
for
OMPI?  ompi_info shows the default as 10 seconds?  Is that right
'seconds' ?

The other IB guys can probably answer better than I can -- I'm
not an
expert in this part of IB (or really any part I guess :).  Not  
sure

why
a larger value isn't the default.  No, its not seconds -- check  
the

description of the MCA parameter:

4.096 microseconds * (2^btl_openib_ib_timeout)


You sure?
ompi_info --param btl openib

MCA btl: parameter "btl_openib_ib_timeout" (current value: "10")
  InfiniBand transmit timeout, in seconds
(must be >= 1)


Yeah:

MCA btl: parameter "btl_openib_ib_timeout" (current value: "10")
 InfiniBand transmit timeout, plugged into formula:
 4.096 microseconds * (2^btl_openib_ib_timeout)(must be

= 0 and <= 31)


Reading earlier in the thread you said OMPI v1.2.0, I got this  
from a

trunk checkout thats around 3 weeks old.  A quick check shows this
description was changed between 1.2.0 and 1.2.1.  However the use of
this parameter hasn't changed -- it's simply passed along to IB  
verbs

when creating a queue pair (aka a connection).

Andrew
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






[OMPI users] Run a process double

2007-11-28 Thread Henry Adolfo Lambis Miranda
Hi everybody out there.

This is my first post to the mail list.
I have installed openmp 1.2.4 over a x_64 AMD double processor with SuSE
linux.
In principal, the instalation was succefull, with ifort 10.X.
But when i run any code ( mpirun -np 2 a.out), instead of share the
calcules between the two
processor, the system duplicate the executable and send one to each
processor.


i don´t know what the h$%& is going on..



regards..

Henry

-- 
Henry Adolfo Lambis Miranda,Chem.Eng.
Molecular Simulation Group  I & II
Rovira i Virgili University.
http://www.etseq.urv.es/ms
Av. Pa?sos Catalans, 26
C.P. 43007. Tarragona, Catalunya
Espanya.


"No podr?s quedarte en casa, hermano.
No podr?s encender, apagar y olvidarte
() Porque la revoluci?n no ser? televisada". 
Gil Scott-Heron (The Revolution Will Not Be Televised, 1974) 

Es una cosa bastante repugnante el exito. Su falsa semejanza con el merito 
enga?a a los hombres. -- Victor Hugo. (1802-1885) Novelista franc?s.

El militar es una planta que hay que cuidar con esmero para que no de sus 
frutos. -- Jacques Tati.

"La libertad viene en paquetes peque?os, usualmente TCP/IP"

Colombian Reality bite:
http://www.youtube.com/watch?v=jn3vM_5kIgM

http://en.wikipedia.org/wiki/Cartagena,_Colombia

http://www.youtube.com/watch?v=cvxMWSsrwg0

http://www.youtube.com/watch?v=eVmYf5U6x3k








__ 
Preguntá. Respondé. Descubrí. 
Todo lo que querías saber, y lo que ni imaginabas, 
está en Yahoo! Respuestas (Beta). 
¡Probalo ya! 
http://www.yahoo.com.ar/respuestas 




Re: [OMPI users] Run a process double

2007-11-28 Thread Damien Hocking
That's what's supposed to happen, it's how MPI works.  Process 0 is the 
head or boss process, and the others are slaves, and execute partially 
different code even though they're in the same executable.  MPI is 
multi-process, not multi-thread.


Damien

Henry Adolfo Lambis Miranda wrote:

Hi everybody out there.

This is my first post to the mail list.
I have installed openmp 1.2.4 over a x_64 AMD double processor with SuSE
linux.
In principal, the instalation was succefull, with ifort 10.X.
But when i run any code ( mpirun -np 2 a.out), instead of share the
calcules between the two
processor, the system duplicate the executable and send one to each
processor.


i don´t know what the h$%& is going on..



regards..

Henry





Re: [OMPI users] Run a process double

2007-11-28 Thread Mark Potts

Henry,
   Apologies ahead of time for any unintended insults, but...

   Your "a.out" sounds like it is not truly a parallel code.  If you
   submit a hello_world program using OpenMPI's mpirun, you will simply
   get two copies of "Hello World" printed to the screen.

   If you want the work shared, you must change your serial program
   such that it executes different code pieces or operates on different
   portions of your data, based on something like the "rank" of the
   process.  (Rank is the numerical ID assigned by MPI to each process
   running from a single invocation of mpirun.)

   All MPI, or specifically OpenMPI, provides you is a vehicle to
   launch multiple copies of a program or programs and then to
   facilitate the communication of those separate processes with
   one another.

   Perhaps a primer on parallel processing would be in order.  Or since
   you have started with Message Passing, perhaps the old standard
   "Using MPI Portable Parallel Programming with the Message-Passing
   Interface, MIT Press, by Gropp, Lusk, and Skjellum would give you
   the familiarization needed.  Other books in that series by some of
   the same authors are also good starting points for MPI.  I'm sure
   other readers can pipe in with a host of better references.

   Good luck.
 regards,

Henry Adolfo Lambis Miranda wrote:

Hi everybody out there.

This is my first post to the mail list.
I have installed openmp 1.2.4 over a x_64 AMD double processor with SuSE
linux.
In principal, the instalation was succefull, with ifort 10.X.
But when i run any code ( mpirun -np 2 a.out), instead of share the
calcules between the two
processor, the system duplicate the executable and send one to each
processor.


i don´t know what the h$%& is going on..



regards..

Henry



--
***
>> Mark J. Potts, PhD
>>
>> HPC Applications Inc.
>> phone: 410-992-8360 Bus
>>410-313-9318 Home
>>443-418-4375 Cell
>> email: po...@hpcapplications.com
>>po...@excray.com
***