[OMPI users] mpirun fails on remote applications

2009-05-12 Thread feng chen
hi all,

First of all,i'm new to openmpi. So i don't know much about mpi setting. That's 
why i'm following manual and FAQ suggestions from the beginning.
Everything went well untile i try to run a pllication on a remote node by using 
'mpirun -np' command. It just hanging there without doing anything, no error 
messanges, no 
complaining or whatsoever. What confused me is that i can run application over 
ssh with no problem, while it comes to mpirun, just stuck in there does nothing.
I'm pretty sure i got everyting setup in the right way manner, including no 
password signin over ssh, environment variables for bot interactive and 
non-interactive logons.
A sample list of commands been used list as following:




[fch6699@anfield05 test]$ mpicc -o hello hello.f
[fch6699@anfield05 test]$ ssh anfield04 ./hello
0 of 1: Hello world!
[fch6699@anfield05 test]$ mpirun -host anfield05 -np 4 ./hello
0 of 4: Hello world!
2 of 4: Hello world!
3 of 4: Hello world!
1 of 4: Hello world!
[fch6699@anfield05 test]$ mpirun -host anfield04 -np 4 ./hello
just hanging there for years!!!
need help to fix this !!
if u try it in another way
[fch6699@anfield05 test]$ mpirun -hostfile my_hostfile -np 4 ./hell
still nothing happened, no warnnings, no complains, no error messages.. !!


All other files related to this issue can be found in my_files.tar.gz in 
attachment.

.cshrc
The output of the "ompi_info --all" command.
my_hostfile
hello.c
output of iptables

The only thing i've noticed is that the port of our ssh has been changed from 
22 to other number for security issues.
Don't know will that have anything to with it or not.


Any help will be highly appreciated!!

thanks in advance!

Kevin




my_files.tar.gz
Description: application/gzip-compressed


Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-05-12 Thread Anton Starikov

Now there is another problem :)

You can try oversubscribe node. At least by 1 task.
If you hostfile and rank file limit you at N procs, you can ask mpirun  
for N+1 and it wil be not rejected.

Although in reality there will be N tasks.
So, if your hostfile limit is 4, then "mpirun -np 4" and "mpirun -np  
5" both works, but in both cases there are only 4 tasks. It isn't  
crucial, because there is nor real oversubscription, but there is  
still some bug which can affect something in future.


--
Anton Starikov.

On May 12, 2009, at 1:45 AM, Ralph Castain wrote:


This is fixed as of r21208.

Thanks for reporting it!
Ralph


On May 11, 2009, at 12:51 PM, Anton Starikov wrote:

Although removing this check solves problem of having more slots in  
rankfile than necessary, there is another problem.


If I set rmaps_base_no_oversubscribe=1 then if, for example:


hostfile:

node01
node01
node02
node02

rankfile:

rank 0=node01 slot=1
rank 1=node01 slot=0
rank 2=node02 slot=1
rank 3=node02 slot=0

mpirun -np 4 ./something

complains with:

"There are not enough slots available in the system to satisfy the  
4 slots

that were requested by the application"

but "mpirun -np 3 ./something" will work though. It works, when you  
ask for 1 CPU less. And the same behavior in any case (shared  
nodes, non-shared nodes, multi-node)


If you switch off rmaps_base_no_oversubscribe, then it works and  
all affinities set as it requested in rankfile, there is no  
oversubscription.



Anton.

On May 5, 2009, at 3:08 PM, Ralph Castain wrote:

Ah - thx for catching that, I'll remove that check. It no longer  
is required.


Thx!

On Tue, May 5, 2009 at 7:04 AM, Lenny Verkhovsky > wrote:

According to the code it does cares.

$vi orte/mca/rmaps/rank_file/rmaps_rank_file.c +572

ival = orte_rmaps_rank_file_value.ival;
if ( ival > (np-1) ) {
orte_show_help("help-rmaps_rank_file.txt", "bad-rankfile", true,  
ival, rankfile);

rc = ORTE_ERR_BAD_PARAM;
goto unlock;
}

If I remember correctly, I used an array to map ranks, and since  
the length of array is NP, maximum index must be less than np, so  
if you have the number of rank > NP, you have no place to put it  
inside array.


"Likewise, if you have more procs than the rankfile specifies, we  
map the additional procs either byslot (default) or bynode (if you  
specify that option). So the rankfile doesn't need to contain an  
entry for every proc."  - Correct point.



Lenny.


On 5/5/09, Ralph Castain  wrote: Sorry Lenny,  
but that isn't correct. The rankfile mapper doesn't care if the  
rankfile contains additional info - it only maps up to the number  
of processes, and ignores anything beyond that number. So there is  
no need to remove the additional info.


Likewise, if you have more procs than the rankfile specifies, we  
map the additional procs either byslot (default) or bynode (if you  
specify that option). So the rankfile doesn't need to contain an  
entry for every proc.


Just don't want to confuse folks.
Ralph




On Tue, May 5, 2009 at 5:59 AM, Lenny Verkhovsky > wrote:

Hi,
maximum rank number must be less then np.
if np=1 then there is only rank 0 in the system, so rank 1 is  
invalid.

please remove "rank 1=node2 slot=*" from the rankfile
Best regards,
Lenny.

On Mon, May 4, 2009 at 11:14 AM, Geoffroy Pignot > wrote:

Hi ,

I got the openmpi-1.4a1r21095.tar.gz tarball, but unfortunately my  
command doesn't work


cat rankf:
rank 0=node1 slot=*
rank 1=node2 slot=*

cat hostf:
node1 slots=2
node2 slots=2

mpirun  --rankfile rankf --hostfile hostf  --host node1 -n 1  
hostname : --host node2 -n 1 hostname


Error, invalid rank (1) in the rankfile (rankf)

--
[r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in  
file rmaps_rank_file.c at line 403
[r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in  
file base/rmaps_base_map_job.c at line 86
[r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in  
file base/plm_base_launch_support.c at line 86
[r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in  
file plm_rsh_module.c at line 1016



Ralph, could you tell me if my command syntax is correct or not ?  
if not, give me the expected one ?


Regards

Geoffroy




2009/4/30 Geoffroy Pignot 

Immediately Sir !!! :)

Thanks again Ralph

Geoffroy





--

Message: 2
Date: Thu, 30 Apr 2009 06:45:39 -0600
From: Ralph Castain 
Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
To: Open MPI Users 
Message-ID:
 <71d2d8cc0904300545v61a42fe1k50086d2704d0f...@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

I believe this is fixed now in our development trunk - you can  
download any

tarball starting from last night and give it a try, if you like. Any
feedback would be appreciated.

Ralph


On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote:

Ah now, I didn't say it -worked-, did I? :-)

Clearly a bug exist

[OMPI users] Torque 2.2.1 problem with OpenMPI 1.2.5

2009-05-12 Thread ansul.srivastava1
Hi,

I am using OFED 1.3 version  inw hich OPENMPI 1.2.5 is included , I have
compiled with Intel and gcc .


Problem is that during qsub i am not able to run the jobs but same time
when i use mpiexec command it is working fine without any issue.

here is my script file ; Please help me to diagnose this issue.



#!/bin/sh

#PBS -N Ad2.0

#PBS -l nodes=3:ppn=8

#PBS -l walltime=100:00:00

date

cd ${PBS_O_WORKDIR}

nprocs=`wc -l < ${PBS_NODEFILE}`

echo "--- PBS_NODEFILE CONTENT ---"

cat $PBS_NODEFILE

echo "--- PBS_NODEFILE CONTENT ---"

echo "Submit host: $(hostname)"

mpiexec -machinefile ${PBS_NODEFILE} -n ${nprocs} ./prewet.out

date

Thanks and Regards

ANSUL SRIVASTAVA
MOBILE -- 9900180278
Sr. CSE
TSG Group(Infrastructure Availability Services )
Wipro Infotech | 146-147, Metagalli Industrial Area
Metagalli, Mysore 570016
Direct Number :  0821-2419074/3088125







Please do not print this email unless it is absolutely necessary.

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email.

www.wipro.com


[OMPI users] New warning messages in 1.3.2 (not present in 1.2.8)

2009-05-12 Thread Matthieu Brucher
Hi,

I've managed to use 1.3.2 (still not with LSF and InfiniPath, I start
one step after another), but I have additional warnings that didn't
show up in 1.2.8:

[host-b:09180] mca: base: component_find: unable to open
/home/brucher/lib/openmpi/mca_ras_dash_host: file not found (ignored)
[host-b:09180] mca: base: component_find: unable to open
/home/brucher/lib/openmpi/mca_ras_gridengine: file not found (ignored)
[host-b:09180] mca: base: component_find: unable to open
/home/brucher/lib/openmpi/mca_ras_localhost: file not found (ignored)
[host-b:09180] mca: base: component_find: unable to open
/home/brucher/lib/openmpi/mca_errmgr_hnp: file not found (ignored)
[host-b:09180] mca: base: component_find: unable to open
/home/brucher/lib/openmpi/mca_errmgr_orted: file not found (ignored)
[host-b:09180] mca: base: component_find: unable to open
/home/brucher/lib/openmpi/mca_errmgr_proxy: file not found (ignored)
[host-b:09180] mca: base: component_find: unable to open
/home/brucher/lib/openmpi/mca_iof_proxy: file not found (ignored)
[host-b:09180] mca: base: component_find: unable to open
/home/brucher//lib/openmpi/mca_iof_svc: file not found (ignored)

Can this be fixed in some way?

Matthieu
-- 
Information System Engineer, Ph.D.
Website: http://matthieu-brucher.developpez.com/
Blogs: http://matt.eifelle.com and http://blog.developpez.com/?blog=92
LinkedIn: http://www.linkedin.com/in/matthieubrucher


Re: [OMPI users] mpirun fails on remote applications

2009-05-12 Thread Lenny Verkhovsky
sounds like firewall problems to or from anfield04.
Lenny,

On Tue, May 12, 2009 at 8:18 AM, feng chen  wrote:

>  hi all,
>
> First of all,i'm new to openmpi. So i don't know much about mpi setting.
> That's why i'm following manual and FAQ suggestions from the beginning.
> Everything went well untile i try to run a pllication on a remote node by
> using 'mpirun -np' command. It just hanging there without doing anything, no
> error messanges, no
> complaining or whatsoever. What confused me is that i can run application
> over ssh with no problem, while it comes to mpirun, just stuck in there does
> nothing.
> I'm pretty sure i got everyting setup in the right way manner, including no
> password signin over ssh, environment variables for bot interactive and
> non-interactive logons.
> A sample list of commands been used list as following:
>
>
>
>
>  [fch6699@anfield05 test]$ mpicc -o hello hello.f
> [fch6699@anfield05 test]$ ssh anfield04 ./hello
> 0 of 1: Hello world!
> [fch6699@anfield05 test]$ mpirun -host anfield05 -np 4 ./hello
> 0 of 4: Hello world!
> 2 of 4: Hello world!
> 3 of 4: Hello world!
> 1 of 4: Hello world!
> [fch6699@anfield05 test]$ mpirun -host anfield04 -np 4 ./hello
> just hanging there for years!!!
> need help to fix this !!
> if u try it in another way
> [fch6699@anfield05 test]$ mpirun -hostfile my_hostfile -np 4 ./hell
> still nothing happened, no warnnings, no complains, no error messages.. !!
>
> All other files related to this issue can be found in my_files.tar.gz in
> attachment.
>
> .cshrc
> The output of the "ompi_info --all" command.
> my_hostfile
> hello.c
> output of iptables
>
> The only thing i've noticed is that the port of our ssh has been changed
> from 22 to other number for security issues.
> Don't know will that have anything to with it or not.
>
>
> Any help will be highly appreciated!!
>
> thanks in advance!
>
> Kevin
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] mpirun fails on remote applications

2009-05-12 Thread Micha Feigin
On Tue, 12 May 2009 11:54:57 +0300
Lenny Verkhovsky  wrote:

> sounds like firewall problems to or from anfield04.
> Lenny,
> 
> On Tue, May 12, 2009 at 8:18 AM, feng chen  wrote:
>

I'm having a similar problem, not sure if it's related (gave up for the moment
on 1.3+ openmpi, 1.2.8 works fine nothing above that).

1. Try taking down the firewall and see if it works
2. Make sure that passwordless ssh is working (not sure if it's needed for all
things but still ...)
3. can you test it maybe with openmpi 1.2.8?
4. also, does posting the job in the other direction work? (4 -> 5 instead of 5 
-> 4)
[fch6699@anfield04 test]$ mpirun -host anfield05 -np 4 ./hello

>From what it seems on my cluster for my specific problem is that machines have
different addresses based on which machine you are connecting from (they are
connected directly to each other, not through a switch with a central name
server), and name lookup seems to happen on the master instead of the client
node so it is getting the wrong address.

> >  hi all,
> >
> > First of all,i'm new to openmpi. So i don't know much about mpi setting.
> > That's why i'm following manual and FAQ suggestions from the beginning.
> > Everything went well untile i try to run a pllication on a remote node by
> > using 'mpirun -np' command. It just hanging there without doing anything, no
> > error messanges, no
> > complaining or whatsoever. What confused me is that i can run application
> > over ssh with no problem, while it comes to mpirun, just stuck in there does
> > nothing.
> > I'm pretty sure i got everyting setup in the right way manner, including no
> > password signin over ssh, environment variables for bot interactive and
> > non-interactive logons.
> > A sample list of commands been used list as following:
> >
> >
> >
> >
> >  [fch6699@anfield05 test]$ mpicc -o hello hello.f
> > [fch6699@anfield05 test]$ ssh anfield04 ./hello
> > 0 of 1: Hello world!
> > [fch6699@anfield05 test]$ mpirun -host anfield05 -np 4 ./hello
> > 0 of 4: Hello world!
> > 2 of 4: Hello world!
> > 3 of 4: Hello world!
> > 1 of 4: Hello world!
> > [fch6699@anfield05 test]$ mpirun -host anfield04 -np 4 ./hello
> > just hanging there for years!!!
> > need help to fix this !!
> > if u try it in another way
> > [fch6699@anfield05 test]$ mpirun -hostfile my_hostfile -np 4 ./hell
> > still nothing happened, no warnnings, no complains, no error messages.. !!
> >
> > All other files related to this issue can be found in my_files.tar.gz in
> > attachment.
> >
> > .cshrc
> > The output of the "ompi_info --all" command.
> > my_hostfile
> > hello.c
> > output of iptables
> >
> > The only thing i've noticed is that the port of our ssh has been changed
> > from 22 to other number for security issues.
> > Don't know will that have anything to with it or not.
> >
> >
> > Any help will be highly appreciated!!
> >
> > thanks in advance!
> >
> > Kevin
> >
> >
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >



Re: [OMPI users] mpirun fails on remote applications

2009-05-12 Thread feng chen
thanks a lot. firewall it is.. It works with firewall's off, while that brings 
another questions from me. Is there anyway we can run mpirun while firwall 's 
on? If yes, how do we setup firewall or iptables? 

thank you





From: Micha Feigin 
To: us...@open-mpi.org
Sent: Tuesday, May 12, 2009 4:30:30 AM
Subject: Re: [OMPI users] mpirun fails on remote applications

On Tue, 12 May 2009 11:54:57 +0300
Lenny Verkhovsky  wrote:

> sounds like firewall problems to or from anfield04.
> Lenny,
> 
> On Tue, May 12, 2009 at 8:18 AM, feng chen  wrote:
>

I'm having a similar problem, not sure if it's related (gave up for the moment
on 1.3+ openmpi, 1.2.8 works fine nothing above that).

1. Try taking down the firewall and see if it works
2. Make sure that passwordless ssh is working (not sure if it's needed for all
things but still ...)
3. can you test it maybe with openmpi 1.2.8?
4. also, does posting the job in the other direction work? (4 -> 5 instead of 5 
-> 4)
[fch6699@anfield04 test]$ mpirun -host anfield05 -np 4 ./hello

>From what it seems on my cluster for my specific problem is that machines have
different addresses based on which machine you are connecting from (they are
connected directly to each other, not through a switch with a central name
server), and name lookup seems to happen on the master instead of the client
node so it is getting the wrong address.

> >  hi all,
> >
> > First of all,i'm new to openmpi. So i don't know much about mpi setting.
> > That's why i'm following manual and FAQ suggestions from the beginning.
> > Everything went well untile i try to run a pllication on a remote node by
> > using 'mpirun -np' command. It just hanging there without doing anything, no
> > error messanges, no
> > complaining or whatsoever. What confused me is that i can run application
> > over ssh with no problem, while it comes to mpirun, just stuck in there does
> > nothing.
> > I'm pretty sure i got everyting setup in the right way manner, including no
> > password signin over ssh, environment variables for bot interactive and
> > non-interactive logons.
> > A sample list of commands been used list as following:
> >
> >
> >
> >
> >  [fch6699@anfield05 test]$ mpicc -o hello hello.f
> > [fch6699@anfield05 test]$ ssh anfield04 ./hello
> > 0 of 1: Hello world!
> > [fch6699@anfield05 test]$ mpirun -host anfield05 -np 4 ./hello
> > 0 of 4: Hello world!
> > 2 of 4: Hello world!
> > 3 of 4: Hello world!
> > 1 of 4: Hello world!
> > [fch6699@anfield05 test]$ mpirun -host anfield04 -np 4 ./hello
> > just hanging there for years!!!
> > need help to fix this !!
> > if u try it in another way
> > [fch6699@anfield05 test]$ mpirun -hostfile my_hostfile -np 4 ./hell
> > still nothing happened, no warnnings, no complains, no error messages.. !!
> >
> > All other files related to this issue can be found in my_files.tar.gz in
> > attachment.
> >
> > .cshrc
> > The output of the "ompi_info --all" command.
> > my_hostfile
> > hello.c
> > output of iptables
> >
> > The only thing i've noticed is that the port of our ssh has been changed
> > from 22 to other number for security issues.
> > Don't know will that have anything to with it or not.
> >
> >
> > Any help will be highly appreciated!!
> >
> > thanks in advance!
> >
> > Kevin
> >
> >
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



  

Re: [OMPI users] strange bug

2009-05-12 Thread Jeff Squyres

Can you send all the information listed here:

http://www.open-mpi.org/community/help/



On May 11, 2009, at 10:03 PM, Anton Starikov wrote:


By the way, this if fortran code, which uses F77 bindings.

--
Anton Starikov.


On May 12, 2009, at 3:06 AM, Anton Starikov wrote:

> Due to rankfile fixes I switched to SVN r21208, now my code dies
> with error
>
> [node037:20519] *** An error occurred in MPI_Comm_dup
> [node037:20519] *** on communicator MPI COMMUNICATOR 32 SPLIT FROM 4
> [node037:20519] *** MPI_ERR_INTERN: internal error
> [node037:20519] *** MPI_ERRORS_ARE_FATAL (your MPI job will now  
abort)

>
> --
> Anton Starikov.
>

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



Re: [OMPI users] Torque 2.2.1 problem with OpenMPI 1.2.5

2009-05-12 Thread Ralph Castain
The 1.2.x series has a bug in it when used with Torque. Simply do not  
include -machinefile on your mpiexec cmd line and it should work fine.  
It will  automatically pickup the PBS_NODEFILE contents.



On May 12, 2009, at 1:17 AM,  wrote:


Hi,

I am using OFED 1.3 version  inw hich OPENMPI 1.2.5 is included , I  
have compiled with Intel and gcc .



Problem is that during qsub i am not able to run the jobs but same  
time when i use mpiexec command it is working fine without any issue.


here is my script file ; Please help me to diagnose this issue.



#!/bin/sh

#PBS -N Ad2.0

#PBS -l nodes=3:ppn=8

#PBS -l walltime=100:00:00

date

cd ${PBS_O_WORKDIR}

nprocs=`wc -l < ${PBS_NODEFILE}`

echo "--- PBS_NODEFILE CONTENT ---"

cat $PBS_NODEFILE

echo "--- PBS_NODEFILE CONTENT ---"

echo "Submit host: $(hostname)"

mpiexec -machinefile ${PBS_NODEFILE} -n ${nprocs} ./prewet.out

date


Thanks and Regards
ANSUL SRIVASTAVA
MOBILE -- 9900180278
Sr. CSE
TSG Group(Infrastructure Availability Services )
Wipro Infotech | 146-147, Metagalli Industrial Area
Metagalli, Mysore 570016
Direct Number :  0821-2419074/3088125



Please do not print this email unless it is absolutely necessary.

The information contained in this electronic message and any  
attachments to this message are intended for the exclusive use of  
the addressee(s) and may contain proprietary, confidential or  
privileged information. If you are not the intended recipient, you  
should not disseminate, distribute or copy this e-mail. Please  
notify the sender immediately and destroy all copies of this message  
and any attachments.


WARNING: Computer viruses can be transmitted via email. The  
recipient should check this email and any attachments for the  
presence of viruses. The company accepts no liability for any damage  
caused by any virus transmitted by this email.


www.wipro.com

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] New warning messages in 1.3.2 (not present in 1.2.8)

2009-05-12 Thread Ralph Castain
Looking at this output, I would say that the problem is you didn't  
recompile your code against 1.3.2. These are warnings about attempts  
to open components that were present in 1.2.8, but no longer exist in  
the 1.3.x series.



On May 12, 2009, at 2:30 AM, Matthieu Brucher wrote:


Hi,

I've managed to use 1.3.2 (still not with LSF and InfiniPath, I start
one step after another), but I have additional warnings that didn't
show up in 1.2.8:

[host-b:09180] mca: base: component_find: unable to open
/home/brucher/lib/openmpi/mca_ras_dash_host: file not found (ignored)
[host-b:09180] mca: base: component_find: unable to open
/home/brucher/lib/openmpi/mca_ras_gridengine: file not found (ignored)
[host-b:09180] mca: base: component_find: unable to open
/home/brucher/lib/openmpi/mca_ras_localhost: file not found (ignored)
[host-b:09180] mca: base: component_find: unable to open
/home/brucher/lib/openmpi/mca_errmgr_hnp: file not found (ignored)
[host-b:09180] mca: base: component_find: unable to open
/home/brucher/lib/openmpi/mca_errmgr_orted: file not found (ignored)
[host-b:09180] mca: base: component_find: unable to open
/home/brucher/lib/openmpi/mca_errmgr_proxy: file not found (ignored)
[host-b:09180] mca: base: component_find: unable to open
/home/brucher/lib/openmpi/mca_iof_proxy: file not found (ignored)
[host-b:09180] mca: base: component_find: unable to open
/home/brucher//lib/openmpi/mca_iof_svc: file not found (ignored)

Can this be fixed in some way?

Matthieu
--
Information System Engineer, Ph.D.
Website: http://matthieu-brucher.developpez.com/
Blogs: http://matt.eifelle.com and http://blog.developpez.com/?blog=92
LinkedIn: http://www.linkedin.com/in/matthieubrucher
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] New warning messages in 1.3.2 (not present in1.2.8)

2009-05-12 Thread Jeff Squyres
Or it could be that you installed 1.3.2 over 1.2.8 -- some of the  
1.2.8 components that no longer exist in the 1.3 series are still in  
the installation tree, but failed to open properly (unfortunately,  
libltdl gives an incorrect "file not found" error message if it is  
unable to load a plugin for any reason, such as if a symbol is unable  
to be resolved from that plugin).


The best thing to do is to install 1.3 in a clean, fresh tree, or  
uninstall your 1.2.8 before installing 1.3.



On May 12, 2009, at 7:35 AM, Ralph Castain wrote:


Looking at this output, I would say that the problem is you didn't
recompile your code against 1.3.2. These are warnings about attempts
to open components that were present in 1.2.8, but no longer exist in
the 1.3.x series.


On May 12, 2009, at 2:30 AM, Matthieu Brucher wrote:

> Hi,
>
> I've managed to use 1.3.2 (still not with LSF and InfiniPath, I  
start

> one step after another), but I have additional warnings that didn't
> show up in 1.2.8:
>
> [host-b:09180] mca: base: component_find: unable to open
> /home/brucher/lib/openmpi/mca_ras_dash_host: file not found  
(ignored)

> [host-b:09180] mca: base: component_find: unable to open
> /home/brucher/lib/openmpi/mca_ras_gridengine: file not found  
(ignored)

> [host-b:09180] mca: base: component_find: unable to open
> /home/brucher/lib/openmpi/mca_ras_localhost: file not found  
(ignored)

> [host-b:09180] mca: base: component_find: unable to open
> /home/brucher/lib/openmpi/mca_errmgr_hnp: file not found (ignored)
> [host-b:09180] mca: base: component_find: unable to open
> /home/brucher/lib/openmpi/mca_errmgr_orted: file not found (ignored)
> [host-b:09180] mca: base: component_find: unable to open
> /home/brucher/lib/openmpi/mca_errmgr_proxy: file not found (ignored)
> [host-b:09180] mca: base: component_find: unable to open
> /home/brucher/lib/openmpi/mca_iof_proxy: file not found (ignored)
> [host-b:09180] mca: base: component_find: unable to open
> /home/brucher//lib/openmpi/mca_iof_svc: file not found (ignored)
>
> Can this be fixed in some way?
>
> Matthieu
> --
> Information System Engineer, Ph.D.
> Website: http://matthieu-brucher.developpez.com/
> Blogs: http://matt.eifelle.com and http://blog.developpez.com/?blog=92
> LinkedIn: http://www.linkedin.com/in/matthieubrucher
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



Re: [OMPI users] New warning messages in 1.3.2 (not present in1.2.8)

2009-05-12 Thread Matthieu Brucher
2009/5/12 Jeff Squyres :
> Or it could be that you installed 1.3.2 over 1.2.8 -- some of the 1.2.8
> components that no longer exist in the 1.3 series are still in the
> installation tree, but failed to open properly (unfortunately, libltdl gives
> an incorrect "file not found" error message if it is unable to load a plugin
> for any reason, such as if a symbol is unable to be resolved from that
> plugin).
>
> The best thing to do is to install 1.3 in a clean, fresh tree, or uninstall
> your 1.2.8 before installing 1.3.

OK, this is indeed the case. I'll try to clean the tree (I have
several other package and deleted the original 1.2.8 package) and test
again.

Thanks for the answers!

Matthieu
-- 
Information System Engineer, Ph.D.
Website: http://matthieu-brucher.developpez.com/
Blogs: http://matt.eifelle.com and http://blog.developpez.com/?blog=92
LinkedIn: http://www.linkedin.com/in/matthieubrucher


Re: [OMPI users] strange bug

2009-05-12 Thread Anton Starikov
hostfile from torque PBS_NODEFILE (OMPI is compilled with torque  
support)


It happens with or without rankfile.
Started with
mpirun -np 16 ./somecode

mca parameters:

btl = self,sm,openib
mpi_maffinity_alone = 1
rmaps_base_no_oversubscribe = 1 (rmaps_base_no_oversubscribe = 0  
doesn't change it)


I tested with both: "btl=self,sm" on 16c-core nodes and  
"btl=self,sm,openib" on 8x dual-core nodes , result is the same.


It looks like it always occurs exactly at the same point in the  
execution, not at the beginning, it is not first MPI_Comm_dup in the  
code.


I can't say too much about particular piece of the code, where it is  
happening, because it is in the 3rd-party library (MUMPS).  When error  
occurs, MPI_Comm_dup in every task deals with single-task communicator  
(MPI_Comm_split of initial MPI_Comm_world for 16 processes into 16  
groups, 1 process per group). And I  can guess that before this error,  
MPI_Comm_dup is called something like 100 of times by the same piece  
of code on the same communicators without any problems.


I can say that it used to work correctly with all previous versions of  
openmpi we used (1.2.8-1.3.2 and some earlier versions). It also works  
correctly on other platforms/MPI implementations.


All environmental variables (PATH, LD_LIBRARY_PATH) are correct.
I recompiled code and 3rd-party libraries with this version of OMPI.





config.log.gz
Description: GNU Zip compressed data


ompi-info.txt.gz
Description: GNU Zip compressed data


--
Anton Starikov.
Computational Material Science,
Faculty of Science and Technology,
University of Twente.
Phone: +31 (0)53 489 2986
Fax: +31 (0)53 489 2910

On May 12, 2009, at 12:35 PM, Jeff Squyres wrote:


Can you send all the information listed here:

   http://www.open-mpi.org/community/help/



On May 11, 2009, at 10:03 PM, Anton Starikov wrote:


By the way, this if fortran code, which uses F77 bindings.

--
Anton Starikov.


On May 12, 2009, at 3:06 AM, Anton Starikov wrote:

> Due to rankfile fixes I switched to SVN r21208, now my code dies
> with error
>
> [node037:20519] *** An error occurred in MPI_Comm_dup
> [node037:20519] *** on communicator MPI COMMUNICATOR 32 SPLIT  
FROM 4

> [node037:20519] *** MPI_ERR_INTERN: internal error
> [node037:20519] *** MPI_ERRORS_ARE_FATAL (your MPI job will now  
abort)

>
> --
> Anton Starikov.
>

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] New warning messages in 1.3.2 (not present in1.2.8)

2009-05-12 Thread Jeff Squyres

On May 12, 2009, at 8:17 AM, Matthieu Brucher wrote:


OK, this is indeed the case. I'll try to clean the tree (I have
several other package and deleted the original 1.2.8 package) and test
again.




This misleading libltdl error message continues to bite us over and  
over again (users and developers alike), so I just put in a workaround  
with some heuristics to try to print a better error message in cases  
like yours.  Now you'll see something like this:


[foo.example.com:24273] mca: base: component_find: unable to open / 
home/jsquyres/bogus/lib/openmpi/mca_btl_openib: perhaps a missing  
symbol, or compiled for a different version of Open MPI? (ignored)


Changeset here (Shiqing tells me it'll work on Windows, too -- MTT  
will tell us for sure :-) ):


https://svn.open-mpi.org/trac/ompi/changeset/21214

And I filed to move it to v1.3.3 here:

https://svn.open-mpi.org/trac/ompi/ticket/1917

--
Jeff Squyres
Cisco Systems



Re: [OMPI users] New warning messages in 1.3.2 (not present in1.2.8)

2009-05-12 Thread Matthieu Brucher
Thank you a lot for this.

I've just checked everything again, recompiled my code as well (I'm
using SCons so it detects that the headers and the libraries changed)
and it works without a warning.

Matthieu

2009/5/12 Jeff Squyres :
> On May 12, 2009, at 8:17 AM, Matthieu Brucher wrote:
>
>> OK, this is indeed the case. I'll try to clean the tree (I have
>> several other package and deleted the original 1.2.8 package) and test
>> again.
>>
>
>
> This misleading libltdl error message continues to bite us over and over
> again (users and developers alike), so I just put in a workaround with some
> heuristics to try to print a better error message in cases like yours.  Now
> you'll see something like this:
>
> [foo.example.com:24273] mca: base: component_find: unable to open
> /home/jsquyres/bogus/lib/openmpi/mca_btl_openib: perhaps a missing symbol,
> or compiled for a different version of Open MPI? (ignored)
>
> Changeset here (Shiqing tells me it'll work on Windows, too -- MTT will tell
> us for sure :-) ):
>
>    https://svn.open-mpi.org/trac/ompi/changeset/21214
>
> And I filed to move it to v1.3.3 here:
>
>    https://svn.open-mpi.org/trac/ompi/ticket/1917
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Information System Engineer, Ph.D.
Website: http://matthieu-brucher.developpez.com/
Blogs: http://matt.eifelle.com and http://blog.developpez.com/?blog=92
LinkedIn: http://www.linkedin.com/in/matthieubrucher



Re: [OMPI users] Bug in return status of MPI_WAIT()?

2009-05-12 Thread Jeff Squyres

Greetings Jacob; sorry for the slow reply.

This is pretty subtle, but I think that your test is incorrect (I  
remember arguing about this a long time ago and eventually having  
another OMPI developer prove me wrong! :-) ).


1. You're setting MPI_ERRORS_RETURN, which, if you're using the C++  
bindings, means you won't be able to see if an error occurs because  
they don't return the int error codes.


2. The MPI_ERROR field in the status is specifically *not* set for  
MPI_TEST and MPI_WAIT.  It *is* set for the multi-test/wait functions  
(e.g., MPI_TESTANY, MPI_WAITALL).  MPI-2.1 p52:44-48 says:


"Error codes belonging to the error class MPI_ERR_IN_STATUS should be  
returned only by the MPI completion functions that take arrays of  
MPI_STATUS.  For the functions MPI_TEST, MPI_TESTANY, MPI_WAIT, and  
MPI_WAITANY, which return a single MPI_STATUS value, the normal MPI  
error return process should be used (not the MPI_ERROR field in the  
MPI_STATUS argument)."


So I think you need to use MPI::ERRORS_THROW_EXCEPTIONS to catch the  
error in this case, or look at the return value from the C binding for  
MPI_WAIT.



On May 10, 2009, at 5:51 AM, Katz, Jacob wrote:


Hi,
While trying error-related functionality of OMPI, I came across a  
situation where when I use MPI_ERRORS_RETURN error handler, the  
errors do not come out correctly from WAIT calls.
The program below correctly terminates with a fatal “message  
truncated” error, but when the line setting the error handler to  
MPI_ERRORS_RETURN is uncommented, it silently completes. I expected  
the print out that checks the status after WAIT call to be executed,  
but it wasn’t.

The issue didn’t happen when using blocking recv.

A bug or my incorrect usage?

Thanks!

// mpic++ -o test test.cpp
// mpirun -np2 ./test
#include "mpi.h"
#include 
using namespace std;

int main (int argc, char *argv[])
{
int rank;
char buf[100] = "h";
MPI::Status stat;

MPI::Init(argc, argv);
rank = MPI::COMM_WORLD.Get_rank();

//MPI::COMM_WORLD.Set_errhandler(MPI::ERRORS_RETURN);

if (rank == 0)
{
MPI::Request r = MPI::COMM_WORLD.Irecv(buf, 1, MPI_CHAR,  
MPI::ANY_SOURCE, MPI::ANY_TAG);

r.Wait(stat);
if (stat.Get_error() != MPI::SUCCESS)
{
cout << "0: Error during recv" << endl;
}
}
else
{
MPI::COMM_WORLD.Send(buf, 2, MPI_CHAR, 0, 0);
}

MPI::Finalize();
return (0);
}


Jacob M. Katz | jacob.k...@intel.com | Work: +972-4-865-5726 | iNet:  
(8)-465-5726


-
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems




Re: [OMPI users] Bug in return status of MPI_WAIT()?

2009-05-12 Thread Jeff Squyres

On May 12, 2009, at 9:37 AM, Jeff Squyres wrote:

2. The MPI_ERROR field in the status is specifically *not* set for  
MPI_TEST and MPI_WAIT.  It *is* set for the multi-test/wait  
functions (e.g., MPI_TESTANY, MPI_WAITALL).


Oops!  Typo -- I should have said "(e.g., MPI_TESTALL, MPI_WAITALL)".   
Just like MPI_TEST, the MPI_ERROR field is not set for MPI_TESTANY  
because it's a single-completion function.


--
Jeff Squyres
Cisco Systems



Re: [OMPI users] strange bug

2009-05-12 Thread Jeff Squyres

Hey Edgar --

Could this have anything to do with your recent fixes?

On May 12, 2009, at 8:30 AM, Anton Starikov wrote:


hostfile from torque PBS_NODEFILE (OMPI is compilled with torque
support)

It happens with or without rankfile.
Started with
mpirun -np 16 ./somecode

mca parameters:

btl = self,sm,openib
mpi_maffinity_alone = 1
rmaps_base_no_oversubscribe = 1 (rmaps_base_no_oversubscribe = 0
doesn't change it)

I tested with both: "btl=self,sm" on 16c-core nodes and
"btl=self,sm,openib" on 8x dual-core nodes , result is the same.

It looks like it always occurs exactly at the same point in the
execution, not at the beginning, it is not first MPI_Comm_dup in the
code.

I can't say too much about particular piece of the code, where it is
happening, because it is in the 3rd-party library (MUMPS).  When error
occurs, MPI_Comm_dup in every task deals with single-task communicator
(MPI_Comm_split of initial MPI_Comm_world for 16 processes into 16
groups, 1 process per group). And I  can guess that before this error,
MPI_Comm_dup is called something like 100 of times by the same piece
of code on the same communicators without any problems.

I can say that it used to work correctly with all previous versions of
openmpi we used (1.2.8-1.3.2 and some earlier versions). It also works
correctly on other platforms/MPI implementations.

All environmental variables (PATH, LD_LIBRARY_PATH) are correct.
I recompiled code and 3rd-party libraries with this version of OMPI.







--
Jeff Squyres
Cisco Systems



Re: [OMPI users] mpirun fails on remote applications

2009-05-12 Thread Jeff Squyres
Open MPI requires that each MPI process be able to connect to any  
other MPI process in the same job with random TCP ports.  It is  
usually easiest to leave the firewall off, or setup trust  
relationships between your cluster nodes.



On May 12, 2009, at 6:04 AM, feng chen wrote:

thanks a lot. firewall it is.. It works with firewall's off, while  
that brings another questions from me. Is there anyway we can run  
mpirun while firwall 's on? If yes, how do we setup firewall or  
iptables?


thank you

From: Micha Feigin 
To: us...@open-mpi.org
Sent: Tuesday, May 12, 2009 4:30:30 AM
Subject: Re: [OMPI users] mpirun fails on remote applications

On Tue, 12 May 2009 11:54:57 +0300
Lenny Verkhovsky  wrote:

> sounds like firewall problems to or from anfield04.
> Lenny,
>
> On Tue, May 12, 2009 at 8:18 AM, feng chen   
wrote:

>

I'm having a similar problem, not sure if it's related (gave up for  
the moment

on 1.3+ openmpi, 1.2.8 works fine nothing above that).

1. Try taking down the firewall and see if it works
2. Make sure that passwordless ssh is working (not sure if it's  
needed for all

things but still ...)
3. can you test it maybe with openmpi 1.2.8?
4. also, does posting the job in the other direction work? (4 -> 5  
instead of 5 -> 4)

[fch6699@anfield04 test]$ mpirun -host anfield05 -np 4 ./hello

>From what it seems on my cluster for my specific problem is that  
machines have
different addresses based on which machine you are connecting from  
(they are
connected directly to each other, not through a switch with a  
central name
server), and name lookup seems to happen on the master instead of  
the client

node so it is getting the wrong address.

> >  hi all,
> >
> > First of all,i'm new to openmpi. So i don't know much about mpi  
setting.
> > That's why i'm following manual and FAQ suggestions from the  
beginning.
> > Everything went well untile i try to run a pllication on a  
remote node by
> > using 'mpirun -np' command. It just hanging there without doing  
anything, no

> > error messanges, no
> > complaining or whatsoever. What confused me is that i can run  
application
> > over ssh with no problem, while it comes to mpirun, just stuck  
in there does

> > nothing.
> > I'm pretty sure i got everyting setup in the right way manner,  
including no
> > password signin over ssh, environment variables for bot  
interactive and

> > non-interactive logons.
> > A sample list of commands been used list as following:
> >
> >
> >
> >
> >  [fch6699@anfield05 test]$ mpicc -o hello hello.f
> > [fch6699@anfield05 test]$ ssh anfield04 ./hello
> > 0 of 1: Hello world!
> > [fch6699@anfield05 test]$ mpirun -host anfield05 -np 4 ./hello
> > 0 of 4: Hello world!
> > 2 of 4: Hello world!
> > 3 of 4: Hello world!
> > 1 of 4: Hello world!
> > [fch6699@anfield05 test]$ mpirun -host anfield04 -np 4 ./hello
> > just hanging there for years!!!
> > need help to fix this !!
> > if u try it in another way
> > [fch6699@anfield05 test]$ mpirun -hostfile my_hostfile -np 4 ./ 
hell
> > still nothing happened, no warnnings, no complains, no error  
messages.. !!

> >
> > All other files related to this issue can be found in  
my_files.tar.gz in

> > attachment.
> >
> > .cshrc
> > The output of the "ompi_info --all" command.
> > my_hostfile
> > hello.c
> > output of iptables
> >
> > The only thing i've noticed is that the port of our ssh has been  
changed

> > from 22 to other number for security issues.
> > Don't know will that have anything to with it or not.
> >
> >
> > Any help will be highly appreciated!!
> >
> > thanks in advance!
> >
> > Kevin
> >
> >
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



Re: [OMPI users] strange bug

2009-05-12 Thread Edgar Gabriel
I would say the probability is large that it is due to the recent 'fix'. 
 I will try to create a testcase similar to what you suggested. Could 
you give us maybe some hints on which functionality of MUMPS you are 
using, or even share the code/ a code fragment?


Thanks
Edgar

Jeff Squyres wrote:

Hey Edgar --

Could this have anything to do with your recent fixes?

On May 12, 2009, at 8:30 AM, Anton Starikov wrote:


hostfile from torque PBS_NODEFILE (OMPI is compilled with torque
support)

It happens with or without rankfile.
Started with
mpirun -np 16 ./somecode

mca parameters:

btl = self,sm,openib
mpi_maffinity_alone = 1
rmaps_base_no_oversubscribe = 1 (rmaps_base_no_oversubscribe = 0
doesn't change it)

I tested with both: "btl=self,sm" on 16c-core nodes and
"btl=self,sm,openib" on 8x dual-core nodes , result is the same.

It looks like it always occurs exactly at the same point in the
execution, not at the beginning, it is not first MPI_Comm_dup in the
code.

I can't say too much about particular piece of the code, where it is
happening, because it is in the 3rd-party library (MUMPS).  When error
occurs, MPI_Comm_dup in every task deals with single-task communicator
(MPI_Comm_split of initial MPI_Comm_world for 16 processes into 16
groups, 1 process per group). And I  can guess that before this error,
MPI_Comm_dup is called something like 100 of times by the same piece
of code on the same communicators without any problems.

I can say that it used to work correctly with all previous versions of
openmpi we used (1.2.8-1.3.2 and some earlier versions). It also works
correctly on other platforms/MPI implementations.

All environmental variables (PATH, LD_LIBRARY_PATH) are correct.
I recompiled code and 3rd-party libraries with this version of OMPI.









--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335


Re: [OMPI users] Bug in return status of MPI_WAIT()?

2009-05-12 Thread Katz, Jacob
Ah... Thanks, Jeff.
If the standard would explicitly mention that MPI::ERRORS_RETURN is useless 
with C++ binding, life would be a little easier...


Jacob M. Katz | jacob.k...@intel.com | Work: +972-4-865-5726 | iNet: 
(8)-465-5726


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Jeff Squyres
Sent: Tuesday, May 12, 2009 16:37
To: Open MPI Users
Subject: Re: [OMPI users] Bug in return status of MPI_WAIT()?

Greetings Jacob; sorry for the slow reply.

This is pretty subtle, but I think that your test is incorrect (I
remember arguing about this a long time ago and eventually having
another OMPI developer prove me wrong! :-) ).

1. You're setting MPI_ERRORS_RETURN, which, if you're using the C++
bindings, means you won't be able to see if an error occurs because
they don't return the int error codes.

2. The MPI_ERROR field in the status is specifically *not* set for
MPI_TEST and MPI_WAIT.  It *is* set for the multi-test/wait functions
(e.g., MPI_TESTANY, MPI_WAITALL).  MPI-2.1 p52:44-48 says:

"Error codes belonging to the error class MPI_ERR_IN_STATUS should be
returned only by the MPI completion functions that take arrays of
MPI_STATUS.  For the functions MPI_TEST, MPI_TESTANY, MPI_WAIT, and
MPI_WAITANY, which return a single MPI_STATUS value, the normal MPI
error return process should be used (not the MPI_ERROR field in the
MPI_STATUS argument)."

So I think you need to use MPI::ERRORS_THROW_EXCEPTIONS to catch the
error in this case, or look at the return value from the C binding for
MPI_WAIT.


On May 10, 2009, at 5:51 AM, Katz, Jacob wrote:

> Hi,
> While trying error-related functionality of OMPI, I came across a
> situation where when I use MPI_ERRORS_RETURN error handler, the
> errors do not come out correctly from WAIT calls.
> The program below correctly terminates with a fatal "message
> truncated" error, but when the line setting the error handler to
> MPI_ERRORS_RETURN is uncommented, it silently completes. I expected
> the print out that checks the status after WAIT call to be executed,
> but it wasn't.
> The issue didn't happen when using blocking recv.
>
> A bug or my incorrect usage?
>
> Thanks!
>
> // mpic++ -o test test.cpp
> // mpirun -np2 ./test
> #include "mpi.h"
> #include 
> using namespace std;
>
> int main (int argc, char *argv[])
> {
> int rank;
> char buf[100] = "h";
> MPI::Status stat;
>
> MPI::Init(argc, argv);
> rank = MPI::COMM_WORLD.Get_rank();
>
> //MPI::COMM_WORLD.Set_errhandler(MPI::ERRORS_RETURN);
>
> if (rank == 0)
> {
> MPI::Request r = MPI::COMM_WORLD.Irecv(buf, 1, MPI_CHAR,
> MPI::ANY_SOURCE, MPI::ANY_TAG);
> r.Wait(stat);
> if (stat.Get_error() != MPI::SUCCESS)
> {
> cout << "0: Error during recv" << endl;
> }
> }
> else
> {
> MPI::COMM_WORLD.Send(buf, 2, MPI_CHAR, 0, 0);
> }
>
> MPI::Finalize();
> return (0);
> }
>
> 
> Jacob M. Katz | jacob.k...@intel.com | Work: +972-4-865-5726 | iNet:
> (8)-465-5726
>
> -
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
-
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.




[OMPI users] Problem installing Dalton with OpenMPI over PelicanHPC

2009-05-12 Thread Silviu Groza
Dear all,
I am trying to install Dalton quantum chemistry program with OpenMPI over 
PelicanHPC, but it ends with an error.
PelicanHPC comes with both LAM and OpenMPI preinstalled. The version of OpenMPI 
is "OMPI_VERSION "1.2.7rc2"" (from version.h).
The wrappers that I use are mpif77.openmpi and mpicc.openmpicc.
Bellow, you can see the "link" and "include" of the wrappers:

++

pelican:/# mpicc.openmpi -show
gcc -I/usr/lib/openmpi/include/openmpi -I/usr/lib/openmpi/include -pthread 
-L/usr/lib/openmpi/lib -lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic 
-lnsl -lutil -lm -ldl

pelican:/# mpif77.openmpi -show
gfortran -I/usr/lib/openmpi/include -pthread -L/usr/lib/openmpi/lib -lmpi_f77 
-lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl

pelican:/# mpif77.openmpi -v
Using built-in specs.
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 4.3.2-1.1' 
--with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs 
--enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --enable-shared 
--with-system-zlib --libexecdir=/usr/lib --without-included-gettext 
--enable-threads=posix --enable-nls --with-gxx-include-dir=/usr/include/c++/4.3 
--program-suffix=-4.3 --enable-clocale=gnu --enable-libstdcxx-debug 
--enable-objc-gc --enable-mpfr --enable-cld --enable-checking=release 
--build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.3.2 (Debian 4.3.2-1.1)

+

The Makefile.conf of Dalton is:

++

ARCH    = linux
#
#
CPPFLAGS  = -DVAR_G77 -DSYS_LINUX -DVAR_MFDS -DVAR_SPLITFILES 
-D'INSTALL_WRKMEM=6000' -D'INSTALL_BASDIR="/root/Fig/dalton-2.0/basis/"' 
-DVAR_MPI -DIMPLICIT_NONE
F77   = mpif77.openmpi
CC    = mpicc.openmpi
RM    = rm -f
FFLAGS    = -march=x86-64 -O3 -ffast-math -fexpensive-optimizations 
-funroll-loops -fno-range-check -fsecond-underscore
SAFEFFLAGS    = -march=x86-64 -O3 -ffast-math -fexpensive-optimizations 
-funroll-loops -fno-range-check -fsecond-underscore
CFLAGS    = -march=x86-64 -O3 -ffast-math -fexpensive-optimizations 
-funroll-loops -std=c99 -DRESTRICT=restrict
INCLUDES  = -I../include 
LIBS  = -L/usr/lib -llapack -lblas 
INSTALLDIR    = /root/Fig/dalton-2.0/bin
PDPACK_EXTRAS = linpack.o eispack.o
GP_EXTRAS = 
AR    = ar
ARFLAGS   = rvs
# flags for ftnchek on Dalton /hjaaj
CHEKFLAGS  = -nopure -nopretty -nocommon -nousage -noarray -notruncation -quiet 
 -noargumants -arguments=number  -usage=var-unitialized
# -usage=var-unitialized:arg-const-modified:arg-alias
# -usage=var-unitialized:var-set-unused:arg-unused:arg-const-modified:arg-alias
#
default : linuxparallel.x
#
# Parallel initialization
#
MPI_INCLUDE_DIR = -I/usr/lib/openmpi/include
MPI_LIB_PATH    = -L/usr/lib/openmpi/lib
MPI_LIB = -lmpi
#
#
# Suffix rules
# hjaaj Oct 04: .g is a "cheat" suffix, for debugging.
#   'make x.g' will create x.o from x.F or x.c with -g debug flag 
set.
#
..SUFFIXES : .F .o .c .i .g

..F.o:
    $(F77) $(INCLUDES) $(CPPFLAGS) $(FFLAGS) -c $*.F 

..F.g:
    $(F77) $(INCLUDES) $(CPPFLAGS) $(FFLAGS) -g -c $*.F 

..c.o:
    $(CC) $(INCLUDES) $(CPPFLAGS) $(CFLAGS) -c $*.c 

..c.g:
    $(CC) $(INCLUDES) $(CPPFLAGS) $(CFLAGS) -g -c $*.c 

..F.i:
    $(F77) $(INCLUDES) $(CPPFLAGS) -E $*.F > $*.i



"make" command gives me the error:

+++

---> Linking sequential dalton.x ...
mpif77.openmpi -march=x86-64 -O3 -ffast-math -fexpensive-optimizations 
-funroll-loops -fno-range-check -fsecond-underscore \
    -o /root/Fig/dalton-2.0/bin/dalton.x abacus/dalton.o cc/crayio.o 
abacus/linux_mem_allo.o \
    abacus/herpar.o eri/eri2par.o amfi/amfi.o amfi/symtra..o gp/mpi_dummy.o 
-Labacus -labacus -Lrsp -lrsp -Lsirius -lsirius -labacus -Leri -leri -Ldensfit 
-ldensfit -Lcc  -lcc -Ldft -ldft -Lgp -lgp -Lpdpack -lpdpack -L/usr/lib 
-llapack -lblas
dft/libdft.a(general.o): In function `mpi_sync_data':
general.c:(.text+0x78): undefined reference to `ompi_mpi_comm_world'
general.c:(.text+0xc3): undefined reference to `ompi_mpi_comm_world'
general.c:(.text+0xdc): undefined reference to `ompi_mpi_comm_world'
general.c:(.text+0xff): undefined reference to `ompi_mpi_comm_world'
general.c:(.text+0x122): undefined reference to `ompi_mpi_comm_world'
dft/libdft.a(general.o):general.c:(.text+0x136): more undefined references to 
`ompi_mpi_comm_world' follow
dft/libdft.a(general.o): In function `dft_cslave__':
general.c:(.text+0x44e): undefined reference to `ompi_mpi_int'
dft/libdft.a(general.o): In function `dft_wake_slaves':
general.c:(.text+0x485): undefined reference to `ompi_mpi_comm_world'
general.c:(.text+0x4e7): undefined reference to `ompi_mpi_comm_world'
general.c:(.text+0x4ee): undefined reference to `ompi_mpi_int'
general.c:(.te

Re: [OMPI users] mpirun fails on remote applications

2009-05-12 Thread Micha Feigin
It is usually best to separate the cluster (mpi) interfaces from the internet
interface.

Usually on a dedicated cluster it is best to have a master node that is
connected to the internet and client nodes that are connected to the master
node (and if needed tunnel the connection through it to the internet), or via a
gateway machine. That way the cluster machines don't need a firewall.

I case all machines are connected directly to the internet it is better to have
one (usually cheap) connection to the internet that can be firewalled, and a
(highend) connection inside the cluster that doesn't need a firewall.

On Tue, 12 May 2009 10:22:28 -0400
Jeff Squyres  wrote:

> Open MPI requires that each MPI process be able to connect to any  
> other MPI process in the same job with random TCP ports.  It is  
> usually easiest to leave the firewall off, or setup trust  
> relationships between your cluster nodes.
> 
> 
> On May 12, 2009, at 6:04 AM, feng chen wrote:
> 
> > thanks a lot. firewall it is.. It works with firewall's off, while  
> > that brings another questions from me. Is there anyway we can run  
> > mpirun while firwall 's on? If yes, how do we setup firewall or  
> > iptables?
> >
> > thank you
> >
> > From: Micha Feigin 
> > To: us...@open-mpi.org
> > Sent: Tuesday, May 12, 2009 4:30:30 AM
> > Subject: Re: [OMPI users] mpirun fails on remote applications
> >
> > On Tue, 12 May 2009 11:54:57 +0300
> > Lenny Verkhovsky  wrote:
> >
> > > sounds like firewall problems to or from anfield04.
> > > Lenny,
> > >
> > > On Tue, May 12, 2009 at 8:18 AM, feng chen   
> > wrote:
> > >
> >
> > I'm having a similar problem, not sure if it's related (gave up for  
> > the moment
> > on 1.3+ openmpi, 1.2.8 works fine nothing above that).
> >
> > 1. Try taking down the firewall and see if it works
> > 2. Make sure that passwordless ssh is working (not sure if it's  
> > needed for all
> > things but still ...)
> > 3. can you test it maybe with openmpi 1.2.8?
> > 4. also, does posting the job in the other direction work? (4 -> 5  
> > instead of 5 -> 4)
> > [fch6699@anfield04 test]$ mpirun -host anfield05 -np 4 ./hello
> >
> > >From what it seems on my cluster for my specific problem is that  
> > machines have
> > different addresses based on which machine you are connecting from  
> > (they are
> > connected directly to each other, not through a switch with a  
> > central name
> > server), and name lookup seems to happen on the master instead of  
> > the client
> > node so it is getting the wrong address.
> >
> > > >  hi all,
> > > >
> > > > First of all,i'm new to openmpi. So i don't know much about mpi  
> > setting.
> > > > That's why i'm following manual and FAQ suggestions from the  
> > beginning.
> > > > Everything went well untile i try to run a pllication on a  
> > remote node by
> > > > using 'mpirun -np' command. It just hanging there without doing  
> > anything, no
> > > > error messanges, no
> > > > complaining or whatsoever. What confused me is that i can run  
> > application
> > > > over ssh with no problem, while it comes to mpirun, just stuck  
> > in there does
> > > > nothing.
> > > > I'm pretty sure i got everyting setup in the right way manner,  
> > including no
> > > > password signin over ssh, environment variables for bot  
> > interactive and
> > > > non-interactive logons.
> > > > A sample list of commands been used list as following:
> > > >
> > > >
> > > >
> > > >
> > > >  [fch6699@anfield05 test]$ mpicc -o hello hello.f
> > > > [fch6699@anfield05 test]$ ssh anfield04 ./hello
> > > > 0 of 1: Hello world!
> > > > [fch6699@anfield05 test]$ mpirun -host anfield05 -np 4 ./hello
> > > > 0 of 4: Hello world!
> > > > 2 of 4: Hello world!
> > > > 3 of 4: Hello world!
> > > > 1 of 4: Hello world!
> > > > [fch6699@anfield05 test]$ mpirun -host anfield04 -np 4 ./hello
> > > > just hanging there for years!!!
> > > > need help to fix this !!
> > > > if u try it in another way
> > > > [fch6699@anfield05 test]$ mpirun -hostfile my_hostfile -np 4 ./ 
> > hell
> > > > still nothing happened, no warnnings, no complains, no error  
> > messages.. !!
> > > >
> > > > All other files related to this issue can be found in  
> > my_files.tar.gz in
> > > > attachment.
> > > >
> > > > .cshrc
> > > > The output of the "ompi_info --all" command.
> > > > my_hostfile
> > > > hello.c
> > > > output of iptables
> > > >
> > > > The only thing i've noticed is that the port of our ssh has been  
> > changed
> > > > from 22 to other number for security issues.
> > > > Don't know will that have anything to with it or not.
> > > >
> > > >
> > > > Any help will be highly appreciated!!
> > > >
> > > > thanks in advance!
> > > >
> > > > Kevin
> > > >
> > > >
> > > >
> > > >
> > > > ___
> > > > users mailing list
> > > > us...@open-mpi.org
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > >
> >
> > ___
> > users mailing 

Re: [OMPI users] strange bug

2009-05-12 Thread Edgar Gabriel
hm, so I am out of ideas. I created multiple variants of test-programs 
which did what you basically described, and they all passed and did not 
generate problems. I compiled the MUMPS library and ran the tests that 
they have in the examples directory, and they all worked.


Additionally, I checked in the source code of Open MPI. In comm_dup 
there is only a single location where we raise the error MPI_ERR_INTERN 
(which was reported in your email). I am fairly positive, that this can 
not occur, else we would segfault prior to that (it is a stupid check, 
don't ask). Furthermore, the code segment that has been modified does 
not raise anywhere MPI_ERR_INTERN. Of course, it could be a secondary 
effect and be created somewhere else (PML_ADD or collective module 
selection) and comm_dup just passes the error code up.


One way or the other, I need more hints on what the code does. Any 
chance of getting a smaller code fragment which replicates the problem? 
It could use the MUMPS library, I am fine with that since I just 
compiled and installed it with the current ompi trunk...


Thanks
Edgar

Edgar Gabriel wrote:
I would say the probability is large that it is due to the recent 'fix'. 
 I will try to create a testcase similar to what you suggested. Could 
you give us maybe some hints on which functionality of MUMPS you are 
using, or even share the code/ a code fragment?


Thanks
Edgar

Jeff Squyres wrote:

Hey Edgar --

Could this have anything to do with your recent fixes?

On May 12, 2009, at 8:30 AM, Anton Starikov wrote:


hostfile from torque PBS_NODEFILE (OMPI is compilled with torque
support)

It happens with or without rankfile.
Started with
mpirun -np 16 ./somecode

mca parameters:

btl = self,sm,openib
mpi_maffinity_alone = 1
rmaps_base_no_oversubscribe = 1 (rmaps_base_no_oversubscribe = 0
doesn't change it)

I tested with both: "btl=self,sm" on 16c-core nodes and
"btl=self,sm,openib" on 8x dual-core nodes , result is the same.

It looks like it always occurs exactly at the same point in the
execution, not at the beginning, it is not first MPI_Comm_dup in the
code.

I can't say too much about particular piece of the code, where it is
happening, because it is in the 3rd-party library (MUMPS).  When error
occurs, MPI_Comm_dup in every task deals with single-task communicator
(MPI_Comm_split of initial MPI_Comm_world for 16 processes into 16
groups, 1 process per group). And I  can guess that before this error,
MPI_Comm_dup is called something like 100 of times by the same piece
of code on the same communicators without any problems.

I can say that it used to work correctly with all previous versions of
openmpi we used (1.2.8-1.3.2 and some earlier versions). It also works
correctly on other platforms/MPI implementations.

All environmental variables (PATH, LD_LIBRARY_PATH) are correct.
I recompiled code and 3rd-party libraries with this version of OMPI.











--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335


Re: [OMPI users] strange bug

2009-05-12 Thread Anton Starikov

I will try to prepare test-case.


--
Anton Starikov.


On May 12, 2009, at 6:57 PM, Edgar Gabriel wrote:

hm, so I am out of ideas. I created multiple variants of test- 
programs which did what you basically described, and they all passed  
and did not generate problems. I compiled the MUMPS library and ran  
the tests that they have in the examples directory, and they all  
worked.


Additionally, I checked in the source code of Open MPI. In comm_dup  
there is only a single location where we raise the error  
MPI_ERR_INTERN (which was reported in your email). I am fairly  
positive, that this can not occur, else we would segfault prior to  
that (it is a stupid check, don't ask). Furthermore, the code  
segment that has been modified does not raise anywhere  
MPI_ERR_INTERN. Of course, it could be a secondary effect and be  
created somewhere else (PML_ADD or collective module selection) and  
comm_dup just passes the error code up.


One way or the other, I need more hints on what the code does. Any  
chance of getting a smaller code fragment which replicates the  
problem? It could use the MUMPS library, I am fine with that since I  
just compiled and installed it with the current ompi trunk...


Thanks
Edgar

Edgar Gabriel wrote:
I would say the probability is large that it is due to the recent  
'fix'.  I will try to create a testcase similar to what you  
suggested. Could you give us maybe some hints on which  
functionality of MUMPS you are using, or even share the code/ a  
code fragment?

Thanks
Edgar
Jeff Squyres wrote:

Hey Edgar --

Could this have anything to do with your recent fixes?

On May 12, 2009, at 8:30 AM, Anton Starikov wrote:


hostfile from torque PBS_NODEFILE (OMPI is compilled with torque
support)

It happens with or without rankfile.
Started with
mpirun -np 16 ./somecode

mca parameters:

btl = self,sm,openib
mpi_maffinity_alone = 1
rmaps_base_no_oversubscribe = 1 (rmaps_base_no_oversubscribe = 0
doesn't change it)

I tested with both: "btl=self,sm" on 16c-core nodes and
"btl=self,sm,openib" on 8x dual-core nodes , result is the same.

It looks like it always occurs exactly at the same point in the
execution, not at the beginning, it is not first MPI_Comm_dup in  
the

code.

I can't say too much about particular piece of the code, where it  
is
happening, because it is in the 3rd-party library (MUMPS).  When  
error
occurs, MPI_Comm_dup in every task deals with single-task  
communicator

(MPI_Comm_split of initial MPI_Comm_world for 16 processes into 16
groups, 1 process per group). And I  can guess that before this  
error,
MPI_Comm_dup is called something like 100 of times by the same  
piece

of code on the same communicators without any problems.

I can say that it used to work correctly with all previous  
versions of
openmpi we used (1.2.8-1.3.2 and some earlier versions). It also  
works

correctly on other platforms/MPI implementations.

All environmental variables (PATH, LD_LIBRARY_PATH) are correct.
I recompiled code and 3rd-party libraries with this version of  
OMPI.










--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-05-12 Thread Ralph Castain

Okay, I fixed this today toor21219


On May 11, 2009, at 11:27 PM, Anton Starikov wrote:


Now there is another problem :)

You can try oversubscribe node. At least by 1 task.
If you hostfile and rank file limit you at N procs, you can ask  
mpirun for N+1 and it wil be not rejected.

Although in reality there will be N tasks.
So, if your hostfile limit is 4, then "mpirun -np 4" and "mpirun -np  
5" both works, but in both cases there are only 4 tasks. It isn't  
crucial, because there is nor real oversubscription, but there is  
still some bug which can affect something in future.


--
Anton Starikov.

On May 12, 2009, at 1:45 AM, Ralph Castain wrote:


This is fixed as of r21208.

Thanks for reporting it!
Ralph


On May 11, 2009, at 12:51 PM, Anton Starikov wrote:

Although removing this check solves problem of having more slots  
in rankfile than necessary, there is another problem.


If I set rmaps_base_no_oversubscribe=1 then if, for example:


hostfile:

node01
node01
node02
node02

rankfile:

rank 0=node01 slot=1
rank 1=node01 slot=0
rank 2=node02 slot=1
rank 3=node02 slot=0

mpirun -np 4 ./something

complains with:

"There are not enough slots available in the system to satisfy the  
4 slots

that were requested by the application"

but "mpirun -np 3 ./something" will work though. It works, when  
you ask for 1 CPU less. And the same behavior in any case (shared  
nodes, non-shared nodes, multi-node)


If you switch off rmaps_base_no_oversubscribe, then it works and  
all affinities set as it requested in rankfile, there is no  
oversubscription.



Anton.

On May 5, 2009, at 3:08 PM, Ralph Castain wrote:

Ah - thx for catching that, I'll remove that check. It no longer  
is required.


Thx!

On Tue, May 5, 2009 at 7:04 AM, Lenny Verkhovsky > wrote:

According to the code it does cares.

$vi orte/mca/rmaps/rank_file/rmaps_rank_file.c +572

ival = orte_rmaps_rank_file_value.ival;
if ( ival > (np-1) ) {
orte_show_help("help-rmaps_rank_file.txt", "bad-rankfile", true,  
ival, rankfile);

rc = ORTE_ERR_BAD_PARAM;
goto unlock;
}

If I remember correctly, I used an array to map ranks, and since  
the length of array is NP, maximum index must be less than np, so  
if you have the number of rank > NP, you have no place to put it  
inside array.


"Likewise, if you have more procs than the rankfile specifies, we  
map the additional procs either byslot (default) or bynode (if  
you specify that option). So the rankfile doesn't need to contain  
an entry for every proc."  - Correct point.



Lenny.


On 5/5/09, Ralph Castain  wrote: Sorry Lenny,  
but that isn't correct. The rankfile mapper doesn't care if the  
rankfile contains additional info - it only maps up to the number  
of processes, and ignores anything beyond that number. So there  
is no need to remove the additional info.


Likewise, if you have more procs than the rankfile specifies, we  
map the additional procs either byslot (default) or bynode (if  
you specify that option). So the rankfile doesn't need to contain  
an entry for every proc.


Just don't want to confuse folks.
Ralph




On Tue, May 5, 2009 at 5:59 AM, Lenny Verkhovsky > wrote:

Hi,
maximum rank number must be less then np.
if np=1 then there is only rank 0 in the system, so rank 1 is  
invalid.

please remove "rank 1=node2 slot=*" from the rankfile
Best regards,
Lenny.

On Mon, May 4, 2009 at 11:14 AM, Geoffroy Pignot > wrote:

Hi ,

I got the openmpi-1.4a1r21095.tar.gz tarball, but unfortunately  
my command doesn't work


cat rankf:
rank 0=node1 slot=*
rank 1=node2 slot=*

cat hostf:
node1 slots=2
node2 slots=2

mpirun  --rankfile rankf --hostfile hostf  --host node1 -n 1  
hostname : --host node2 -n 1 hostname


Error, invalid rank (1) in the rankfile (rankf)

--
[r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in  
file rmaps_rank_file.c at line 403
[r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in  
file base/rmaps_base_map_job.c at line 86
[r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in  
file base/plm_base_launch_support.c at line 86
[r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in  
file plm_rsh_module.c at line 1016



Ralph, could you tell me if my command syntax is correct or not ?  
if not, give me the expected one ?


Regards

Geoffroy




2009/4/30 Geoffroy Pignot 

Immediately Sir !!! :)

Thanks again Ralph

Geoffroy





--

Message: 2
Date: Thu, 30 Apr 2009 06:45:39 -0600
From: Ralph Castain 
Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
To: Open MPI Users 
Message-ID:
<71d2d8cc0904300545v61a42fe1k50086d2704d0f...@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

I believe this is fixed now in our development trunk - you can  
download any
tarball starting from last night and give it a try, if you like.  
Any

feedback would be appreciated.

Ralph


On Apr 14, 2009, at 

[OMPI users] ****---How to configure NIS and MPI on spread NICs?----****

2009-05-12 Thread shan axida
Hello all,
I want to configure NIS and MPI with different network.
For example, NIS uses eth0 and MPI uses eth1 some thing like that.
How can I do that?


Axida