Re: [OMPI users] mpirun hangs when launching job on remote node

2009-03-18 Thread Raymond Wan


Hi Ron,


Ron Babich wrote:
Thanks for your response.  I had noticed your thread, which is why I'm 
embarrassed (but happy) to say that it looks like my problem was the 
same as yours.  I mentioned in my original email that there was no 
firewall running, which it turns out was a lie.  I think that when I 
checked before, I must have forgotten "sudo."  Instead of "permission 
denied" or the like, I got this misleading response:



Opps!  My turn to apologize -- I have to admit that I read through your post 
very quickly and skipped the sentence about the lack of a firewall.  However, 
everything else sounded exactly like what I was getting.

Actually, it would be good to find out what the problem with the firewall is 
and to not simply turn it off.  As I'm not the sysadmin, I can't really play 
with the settings to find out.

In your message (which I will now read more carefully :-) ), it says that you are using 
CentOS, which is based on RH.  Our systems are using Fedora.  Perhaps it has something to 
do with RH's defaults for the firewall settings?  Another system that worked 
"immediately" was a Debian system.  Anyway, if you find out a solution that 
doesn't require the firewall to be turned off, please let me know -- I think our sysadmin 
would be interested, too.

Ray



Re: [OMPI users] openmpi 1.3 and gridengine tight integration problem

2009-03-18 Thread Reuti

Hi,

it shouldn't be necessary to supply a machinefile, as the one  
generated by SGE is taken automatically (i.e. the granted nodes are  
honored). You submitted the job requesting a PE?


-- Reuti


Am 18.03.2009 um 04:51 schrieb Salmon, Rene:



Hi,

I have looked through the list archives and google but could not  
find anything related to what I am seeing. I am simply trying to  
run the basic cpi.c code using SGE and tight integration.


If run outside SGE i can run my jobs just fine:
hpcp7781(salmr0)132:mpiexec -np 2 --machinefile x a.out
Process 0 on hpcp7781
Process 1 on hpcp7782
pi is approximately 3.1416009869231241, Error is 0.0809
wall clock time = 0.032325


If I submit to SGE I get this:

[hpcp7781:08527] mca: base: components_open: Looking for plm  
components

[hpcp7781:08527] mca: base: components_open: opening plm components
[hpcp7781:08527] mca: base: components_open: found loaded component  
rsh
[hpcp7781:08527] mca: base: components_open: component rsh has no  
register function
[hpcp7781:08527] mca: base: components_open: component rsh open  
function successful
[hpcp7781:08527] mca: base: components_open: found loaded component  
slurm
[hpcp7781:08527] mca: base: components_open: component slurm has no  
register function
[hpcp7781:08527] mca: base: components_open: component slurm open  
function successful

[hpcp7781:08527] mca:base:select: Auto-selecting plm components
[hpcp7781:08527] mca:base:select:(  plm) Querying component [rsh]
[hpcp7781:08527] [[INVALID],INVALID] plm:rsh: using /hpc/SGE/bin/ 
lx24-amd64/qrsh for launching
[hpcp7781:08527] mca:base:select:(  plm) Query of component [rsh]  
set priority to 10

[hpcp7781:08527] mca:base:select:(  plm) Querying component [slurm]
[hpcp7781:08527] mca:base:select:(  plm) Skipping component  
[slurm]. Query failed to return a module

[hpcp7781:08527] mca:base:select:(  plm) Selected component [rsh]
[hpcp7781:08527] mca: base: close: component slurm closed
[hpcp7781:08527] mca: base: close: unloading component slurm
Starting server daemon at host "hpcp7782"
error: executing task of job 1702026 failed:
-- 


A daemon (pid 8528) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed  
shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to  
have the

location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
-- 

-- 


mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
-- 


mpirun: clean termination accomplished

[hpcp7781:08527] mca: base: close: component rsh closed
[hpcp7781:08527] mca: base: close: unloading component rsh




Seems to me orted is not starting on the remote node.  I have  
LD_LIBRARY_PATH set on my shell startup files.  If I do an ldd on  
orted i see this:


hpcp7781(salmr0)135:ldd /bphpc7/vol0/salmr0/ompi/bin/orted
libopen-rte.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen- 
rte.so.0 (0x2ac5b14e2000)
libopen-pal.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen- 
pal.so.0 (0x2ac5b1628000)

libdl.so.2 => /lib64/libdl.so.2 (0x2ac5b17a9000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x2ac5b18ad000)
libutil.so.1 => /lib64/libutil.so.1 (0x2ac5b19c4000)
libm.so.6 => /lib64/libm.so.6 (0x2ac5b1ac7000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x2ac5b1c1c000)
libc.so.6 => /lib64/libc.so.6 (0x2ac5b1d34000)
/lib64/ld-linux-x86-64.so.2 (0x2ac5b13c6000)


Looks like gridengine is using qrsh to start orted on the remote  
nodes. qrsh might not be reading my shell startup file and setting  
LD_LIBRARY_PATH.


Thanks for any help with this.

Rene


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] open mpi on non standard ssh port

2009-03-18 Thread Bernhard Knapp

come on, it must be somehow possible to use openmpi not on port 22!? ;-)



--

Message: 3
Date: Tue, 17 Mar 2009 09:45:29 +0100
From: Bernhard Knapp 
Subject: [OMPI users] open mpi on non standard ssh port
To: us...@open-mpi.org
Message-ID: <49bf6329.8090...@meduniwien.ac.at>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Hi

I want to start a gromacs simulation on a small cluster where non 
standard ports are used for ssh. If I just use a "normal" maschinelist 
file (with the ips of the nodes), consequently, the following error 
comes up:

ssh: connect to host 192.168.0.103 port 22: Connection refused

I guess that I need to somehow tell him to use the other ports. I tried 
it in the following way (maschinelist file):

192.168.0.101 -p 5101
192.168.0.102 -p 5102
192.168.0.103 -p 5103
192.168.0.104 -p 5104

But it seems this is not the correct way to specifiy the port:
Open RTE detected a parse error in the hostfile:
   /home/bknapp/scripts/machinefile.txt
It occured on line number 1 on token 5:
   -p

How can I tell him to use port 5101 on machine 192.168.0.101?
May be the question is stupid but I could not find a solution via google 
or search function ...


cheers
Bernhard


 



Re: [OMPI users] open mpi on non standard ssh port

2009-03-18 Thread Reuti

Bernhard,

Am 18.03.2009 um 09:19 schrieb Bernhard Knapp:

come on, it must be somehow possible to use openmpi not on port  
22!? ;-)


it's not an issue of Open MPI but ssh. You need in your home a file  
~/.ssh/config with two lines:


host *
   port 1234

or whatever port you need.

-- Reuti




--

Message: 3
Date: Tue, 17 Mar 2009 09:45:29 +0100
From: Bernhard Knapp 
Subject: [OMPI users] open mpi on non standard ssh port
To: us...@open-mpi.org
Message-ID: <49bf6329.8090...@meduniwien.ac.at>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Hi

I want to start a gromacs simulation on a small cluster where non  
standard ports are used for ssh. If I just use a "normal"  
maschinelist file (with the ips of the nodes), consequently, the  
following error comes up:

ssh: connect to host 192.168.0.103 port 22: Connection refused

I guess that I need to somehow tell him to use the other ports. I  
tried it in the following way (maschinelist file):

192.168.0.101 -p 5101
192.168.0.102 -p 5102
192.168.0.103 -p 5103
192.168.0.104 -p 5104

But it seems this is not the correct way to specifiy the port:
Open RTE detected a parse error in the hostfile:
   /home/bknapp/scripts/machinefile.txt
It occured on line number 1 on token 5:
   -p

How can I tell him to use port 5101 on machine 192.168.0.101?
May be the question is stupid but I could not find a solution via  
google or search function ...


cheers
Bernhard




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] mpirun hangs when launching job on remote node

2009-03-18 Thread Bogdan Costescu

On Wed, 18 Mar 2009, Raymond Wan wrote:


Perhaps it has something to do with RH's defaults for the firewall settings?


If your sysadmin uses kickstart to configure the systems, (s)he has to 
add 'firewall --disabled'; similar for SELinux which seems to have 
caused problems to another person on this list. OTOH, if (s)he blindly 
copied the config for a workstation to a cluster node, maybe some more 
education is needed first...



Another system that worked "immediately" was a Debian system.


That's because Debian doesn't configure a firewall or SELinux, leaving 
the admin the responsability to do it.


Anyway, if you find out a solution that doesn't require the firewall 
to be turned off, please let me know -- I think our sysadmin would 
be interested, too.


Depending on your definition of 'firewall turned off', the new feature 
of restricting ports used by OpenMPI will help. The firewall can stay 
on, but it should be configured to open a range of ports used by 
OpenMPI.


--
Bogdan Costescu

IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: bogdan.coste...@iwr.uni-heidelberg.de


Re: [OMPI users] mpirun hangs when launching job on remote node

2009-03-18 Thread Raymond Wan


Hi Bogdan,

Thanks for the information and looking forward to the new OpenMPI feature of 
port restriction...

About Debian, I was wondering about that...I've had no problems with it and I 
was thinking everything was just done for me; of course, another possibility is 
that there was no firewall to begin with and I didn't know about it.  Alas, 
it's the latter...I better look into it as I was basically oblivious to the 
lack of a firewall...

Ray



Bogdan Costescu wrote:

On Wed, 18 Mar 2009, Raymond Wan wrote:

Perhaps it has something to do with RH's defaults for the firewall 
settings?


If your sysadmin uses kickstart to configure the systems, (s)he has to 
add 'firewall --disabled'; similar for SELinux which seems to have 
caused problems to another person on this list. OTOH, if (s)he blindly 
copied the config for a workstation to a cluster node, maybe some more 
education is needed first...



Another system that worked "immediately" was a Debian system.


That's because Debian doesn't configure a firewall or SELinux, leaving 
the admin the responsability to do it.


Anyway, if you find out a solution that doesn't require the firewall 
to be turned off, please let me know -- I think our sysadmin would be 
interested, too.


Depending on your definition of 'firewall turned off', the new feature 
of restricting ports used by OpenMPI will help. The firewall can stay 
on, but it should be configured to open a range of ports used by OpenMPI.






[OMPI users] [Fwd: Re: open mpi on non standard ssh port]

2009-03-18 Thread Bernhard Knapp

Hey again,

I tried to build a work around  via port redirection: iptables -t nat -A 
PREROUTING -i eth1 -p tcp --dport 22 -j REDIRECT --to-port 5101



If I do that then I can start the job:

mpirun -np 2 -machinefile /home/bknapp/scripts/machinefile.txt mdrun 
-np 2 -nice 0 -s 1fyt_PKYVKQNTLELAT_bindingRegionsOnly.md.tpr -o 
1fyt_PKYVKQNTLELAT_bindingRegionsOnly.md.trr -c 
1fyt_PKYVKQNTLELAT_bindingRegionsOnly.md.pdb -g 
1fyt_PKYVKQNTLELAT_bindingRegionsOnly.md.log -e 
1fyt_PKYVKQNTLELAT_bindingRegionsOnly.md.edr -v

bknapp@192.168.0.104's password:
NNODES=2, MYRANK=0, HOSTNAME=quoVadis01
NNODES=2, MYRANK=1, HOSTNAME=quoVadis04

but it comes up with 
"[quoVadis01][[24802,1],0][btl_tcp_endpoint.c:631:mca_btl_tcp_endpoint_complete_connect] 
connect() failed: No route to host (113)". The CPUs are calculating on 
both (physically different machines) but unfortunately no results are 
written ...


Was the port redirection of 22 not enough or is there another problem?

thx
Bernhard





 Original Message 
Subject:Re: open mpi on non standard ssh port
List-Post: users@lists.open-mpi.org
Date:   Wed, 18 Mar 2009 09:19:18 +0100
From:   Bernhard Knapp 
To: us...@open-mpi.org
References: 



come on, it must be somehow possible to use openmpi not on port 22!? ;-)



--

Message: 3
Date: Tue, 17 Mar 2009 09:45:29 +0100
From: Bernhard Knapp 
Subject: [OMPI users] open mpi on non standard ssh port
To: us...@open-mpi.org
Message-ID: <49bf6329.8090...@meduniwien.ac.at>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Hi

I want to start a gromacs simulation on a small cluster where non 
standard ports are used for ssh. If I just use a "normal" maschinelist 
file (with the ips of the nodes), consequently, the following error 
comes up:

ssh: connect to host 192.168.0.103 port 22: Connection refused

I guess that I need to somehow tell him to use the other ports. I tried 
it in the following way (maschinelist file):

192.168.0.101 -p 5101
192.168.0.102 -p 5102
192.168.0.103 -p 5103
192.168.0.104 -p 5104

But it seems this is not the correct way to specifiy the port:
Open RTE detected a parse error in the hostfile:
   /home/bknapp/scripts/machinefile.txt
It occured on line number 1 on token 5:
   -p

How can I tell him to use port 5101 on machine 192.168.0.101?
May be the question is stupid but I could not find a solution via google 
or search function ...


cheers
Bernhard


 




--
Dipl.-Ing. (FH) Bernhard Knapp
Univ.-Ass.postgrad.
Unit for Medical Statistics and Informatics - Section for Biomedical 
Computersimulation and Bioinformatics
Medical University of Vienna - General Hospital
Spitalgasse 23 A-1090 WIEN / AUSTRIA
Room: BT88 - 88.03.712
Phone: +43(1) 40400-6673



Re: [OMPI users] openmpi 1.3 and gridengine tight integrationproblem

2009-03-18 Thread Rene Salmon
Hi,

Thanks for the help.  I only use the machine file to run outside of SGE
just to test/prove that things work outside of SGE.

When I run with in SGE here is what the job script looks like:

hpcp7781(salmr0)128:cat simple-job.sh
#!/bin/csh
#
#$ -S /bin/csh
setenv LD_LIBRARY_PATH /bphpc7/vol0/salmr0/ompi/lib
mpirun --mca plm_base_verbose 20 --prefix /bphpc7/vol0/salmr0/ompi -np
16 /bphpc7/vol0/salmr0/SGE/a.out


We are using PEs.  Here is what the PE looks like:

hpcp7781(salmr0)129:qconf -sp pavtest
pe_name   pavtest
slots 16
user_listsNONE
xuser_lists   NONE
start_proc_args   /bin/true
stop_proc_args/bin/true
allocation_rule   8
control_slavesFALSE
job_is_first_task FALSE
urgency_slots min


here is he qsub line to submit the job:

>>qsub -pe pavtest 16 simple-job.sh


The job seems to run fine with no problems with in SGE if I contain the
job with in one node.  As soon as the job has to use more than one one
things stop working with the message I posted about LD_LIBRARY_PATH and
orted seems not to start on the remote nodes.  

Thanks
Rene




On Wed, 2009-03-18 at 07:45 +, Reuti wrote:
> Hi,
> 
> it shouldn't be necessary to supply a machinefile, as the one 
> generated by SGE is taken automatically (i.e. the granted nodes are 
> honored). You submitted the job requesting a PE?
> 
> -- Reuti
> 
> 
> Am 18.03.2009 um 04:51 schrieb Salmon, Rene:
> 
> >
> > Hi,
> >
> > I have looked through the list archives and google but could not 
> > find anything related to what I am seeing. I am simply trying to 
> > run the basic cpi.c code using SGE and tight integration.
> >
> > If run outside SGE i can run my jobs just fine:
> > hpcp7781(salmr0)132:mpiexec -np 2 --machinefile x a.out
> > Process 0 on hpcp7781
> > Process 1 on hpcp7782
> > pi is approximately 3.1416009869231241, Error is 0.0809
> > wall clock time = 0.032325
> >
> >
> > If I submit to SGE I get this:
> >
> > [hpcp7781:08527] mca: base: components_open: Looking for plm 
> > components
> > [hpcp7781:08527] mca: base: components_open: opening plm components
> > [hpcp7781:08527] mca: base: components_open: found loaded component 
> > rsh
> > [hpcp7781:08527] mca: base: components_open: component rsh has no 
> > register function
> > [hpcp7781:08527] mca: base: components_open: component rsh open 
> > function successful
> > [hpcp7781:08527] mca: base: components_open: found loaded component 
> > slurm
> > [hpcp7781:08527] mca: base: components_open: component slurm has no 
> > register function
> > [hpcp7781:08527] mca: base: components_open: component slurm open 
> > function successful
> > [hpcp7781:08527] mca:base:select: Auto-selecting plm components
> > [hpcp7781:08527] mca:base:select:(  plm) Querying component [rsh]
> > [hpcp7781:08527] [[INVALID],INVALID] plm:rsh: using /hpc/SGE/bin/
> > lx24-amd64/qrsh for launching
> > [hpcp7781:08527] mca:base:select:(  plm) Query of component [rsh] 
> > set priority to 10
> > [hpcp7781:08527] mca:base:select:(  plm) Querying component [slurm]
> > [hpcp7781:08527] mca:base:select:(  plm) Skipping component 
> > [slurm]. Query failed to return a module
> > [hpcp7781:08527] mca:base:select:(  plm) Selected component [rsh]
> > [hpcp7781:08527] mca: base: close: component slurm closed
> > [hpcp7781:08527] mca: base: close: unloading component slurm
> > Starting server daemon at host "hpcp7782"
> > error: executing task of job 1702026 failed:
> >
> --
> > 
> > A daemon (pid 8528) died unexpectedly with status 1 while attempting
> > to launch so we are aborting.
> >
> > There may be more information reported by the environment (see
> above).
> >
> > This may be because the daemon was unable to find all the needed 
> > shared
> > libraries on the remote node. You may set your LD_LIBRARY_PATH to 
> > have the
> > location of the shared libraries on the remote nodes and this will
> > automatically be forwarded to the remote nodes.
> >
> --
> > 
> >
> --
> > 
> > mpirun noticed that the job aborted, but has no info as to the
> process
> > that caused that situation.
> >
> --
> > 
> > mpirun: clean termination accomplished
> >
> > [hpcp7781:08527] mca: base: close: component rsh closed
> > [hpcp7781:08527] mca: base: close: unloading component rsh
> >
> >
> >
> >
> > Seems to me orted is not starting on the remote node.  I have 
> > LD_LIBRARY_PATH set on my shell startup files.  If I do an ldd on 
> > orted i see this:
> >
> > hpcp7781(salmr0)135:ldd /bphpc7/vol0/salmr0/ompi/bin/orted
> > libopen-rte.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen-
> > rte.so.0 (0x2ac5b14e2000)
> > libopen-pal.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen-
> > pal.so.0 (0x00

Re: [OMPI users] openmpi 1.3 and gridengine tight integrationproblem

2009-03-18 Thread Reuti

Hi,

Am 18.03.2009 um 14:25 schrieb Rene Salmon:

Thanks for the help.  I only use the machine file to run outside of  
SGE

just to test/prove that things work outside of SGE.


aha. Did you compile Open MPI 1.3 with the SGE option?



When I run with in SGE here is what the job script looks like:

hpcp7781(salmr0)128:cat simple-job.sh
#!/bin/csh
#
#$ -S /bin/csh


-S will only work if the queue configuration is set to  
posix_compliant. If it's set to unix_behavior, the first line of the  
script is already sufficient.




setenv LD_LIBRARY_PATH /bphpc7/vol0/salmr0/ompi/lib


Maybe you have to set this LD_LIBRARY_PATH in your .cshrc, so it's  
known automatically on the nodes.



mpirun --mca plm_base_verbose 20 --prefix /bphpc7/vol0/salmr0/ompi -np
16 /bphpc7/vol0/salmr0/SGE/a.out


Do you use --mca... only for debugging or why is it added here?

-- Reuti




We are using PEs.  Here is what the PE looks like:

hpcp7781(salmr0)129:qconf -sp pavtest
pe_name   pavtest
slots 16
user_listsNONE
xuser_lists   NONE
start_proc_args   /bin/true
stop_proc_args/bin/true
allocation_rule   8
control_slavesFALSE
job_is_first_task FALSE
urgency_slots min


here is he qsub line to submit the job:


qsub -pe pavtest 16 simple-job.sh



The job seems to run fine with no problems with in SGE if I contain  
the

job with in one node.  As soon as the job has to use more than one one
things stop working with the message I posted about LD_LIBRARY_PATH  
and

orted seems not to start on the remote nodes.

Thanks
Rene




On Wed, 2009-03-18 at 07:45 +, Reuti wrote:

Hi,

it shouldn't be necessary to supply a machinefile, as the one
generated by SGE is taken automatically (i.e. the granted nodes are
honored). You submitted the job requesting a PE?

-- Reuti


Am 18.03.2009 um 04:51 schrieb Salmon, Rene:



Hi,

I have looked through the list archives and google but could not
find anything related to what I am seeing. I am simply trying to
run the basic cpi.c code using SGE and tight integration.

If run outside SGE i can run my jobs just fine:
hpcp7781(salmr0)132:mpiexec -np 2 --machinefile x a.out
Process 0 on hpcp7781
Process 1 on hpcp7782
pi is approximately 3.1416009869231241, Error is 0.0809
wall clock time = 0.032325


If I submit to SGE I get this:

[hpcp7781:08527] mca: base: components_open: Looking for plm
components
[hpcp7781:08527] mca: base: components_open: opening plm components
[hpcp7781:08527] mca: base: components_open: found loaded component
rsh
[hpcp7781:08527] mca: base: components_open: component rsh has no
register function
[hpcp7781:08527] mca: base: components_open: component rsh open
function successful
[hpcp7781:08527] mca: base: components_open: found loaded component
slurm
[hpcp7781:08527] mca: base: components_open: component slurm has no
register function
[hpcp7781:08527] mca: base: components_open: component slurm open
function successful
[hpcp7781:08527] mca:base:select: Auto-selecting plm components
[hpcp7781:08527] mca:base:select:(  plm) Querying component [rsh]
[hpcp7781:08527] [[INVALID],INVALID] plm:rsh: using /hpc/SGE/bin/
lx24-amd64/qrsh for launching
[hpcp7781:08527] mca:base:select:(  plm) Query of component [rsh]
set priority to 10
[hpcp7781:08527] mca:base:select:(  plm) Querying component [slurm]
[hpcp7781:08527] mca:base:select:(  plm) Skipping component
[slurm]. Query failed to return a module
[hpcp7781:08527] mca:base:select:(  plm) Selected component [rsh]
[hpcp7781:08527] mca: base: close: component slurm closed
[hpcp7781:08527] mca: base: close: unloading component slurm
Starting server daemon at host "hpcp7782"
error: executing task of job 1702026 failed:

- 
-


A daemon (pid 8528) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see

above).


This may be because the daemon was unable to find all the needed
shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to
have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.

- 
-



- 
-


mpirun noticed that the job aborted, but has no info as to the

process

that caused that situation.

- 
-


mpirun: clean termination accomplished

[hpcp7781:08527] mca: base: close: component rsh closed
[hpcp7781:08527] mca: base: close: unloading component rsh




Seems to me orted is not starting on the remote node.  I have
LD_LIBRARY_PATH set on my shell startup files.  If I do an ldd on
orted i see this:

hpcp7781(salmr0)135:ldd /bphpc7/vol0/salmr0/ompi/bin/orted
libopen-rte.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen-

Re: [OMPI users] openmpi 1.3 and gridengine tight integrationproblem

2009-03-18 Thread Rene Salmon

> 
> aha. Did you compile Open MPI 1.3 with the SGE option?
> 

Yes I did.

hpcp7781(salmr0)142:ompi_info |grep grid
 MCA ras: gridengine (MCA v2.0, API v2.0, Component
v1.3)


> 
> > setenv LD_LIBRARY_PATH /bphpc7/vol0/salmr0/ompi/lib
> 
> Maybe you have to set this LD_LIBRARY_PATH in your .cshrc, so it's 
> known automatically on the nodes.
> 

Yes. I also have "setenv LD_LIBRARY_PATH /bphpc7/vol0/salmr0/ompi/lib"
on my .cshrc as well.  I just wanted to make double sure that it was
there.

I also even tried putting "/bphpc7/vol0/salmr0/ompi/lib"
in /etc/ld.so.conf system wide just to test and see if that would help
but still same results.

> > mpirun --mca plm_base_verbose 20 --prefix /bphpc7/vol0/salmr0/ompi
> -np
> > 16 /bphpc7/vol0/salmr0/SGE/a.out
> 
> Do you use --mca... only for debugging or why is it added here?
> 


I only put that there for debugging.  Is there a different flag I should
use to get more debug info?

Thanks
Rene




> -- Reuti
> 
> 
> >
> > We are using PEs.  Here is what the PE looks like:
> >
> > hpcp7781(salmr0)129:qconf -sp pavtest
> > pe_name   pavtest
> > slots 16
> > user_listsNONE
> > xuser_lists   NONE
> > start_proc_args   /bin/true
> > stop_proc_args/bin/true
> > allocation_rule   8
> > control_slavesFALSE
> > job_is_first_task FALSE
> > urgency_slots min
> >
> >
> > here is he qsub line to submit the job:
> >
> >>> qsub -pe pavtest 16 simple-job.sh
> >
> >
> > The job seems to run fine with no problems with in SGE if I contain 
> > the
> > job with in one node.  As soon as the job has to use more than one
> one
> > things stop working with the message I posted about LD_LIBRARY_PATH 
> > and
> > orted seems not to start on the remote nodes.
> >
> > Thanks
> > Rene
> >
> >
> >
> >
> > On Wed, 2009-03-18 at 07:45 +, Reuti wrote:
> >> Hi,
> >>
> >> it shouldn't be necessary to supply a machinefile, as the one
> >> generated by SGE is taken automatically (i.e. the granted nodes are
> >> honored). You submitted the job requesting a PE?
> >>
> >> -- Reuti
> >>
> >>
> >> Am 18.03.2009 um 04:51 schrieb Salmon, Rene:
> >>
> >>>
> >>> Hi,
> >>>
> >>> I have looked through the list archives and google but could not
> >>> find anything related to what I am seeing. I am simply trying to
> >>> run the basic cpi.c code using SGE and tight integration.
> >>>
> >>> If run outside SGE i can run my jobs just fine:
> >>> hpcp7781(salmr0)132:mpiexec -np 2 --machinefile x a.out
> >>> Process 0 on hpcp7781
> >>> Process 1 on hpcp7782
> >>> pi is approximately 3.1416009869231241, Error is
> 0.0809
> >>> wall clock time = 0.032325
> >>>
> >>>
> >>> If I submit to SGE I get this:
> >>>
> >>> [hpcp7781:08527] mca: base: components_open: Looking for plm
> >>> components
> >>> [hpcp7781:08527] mca: base: components_open: opening plm
> components
> >>> [hpcp7781:08527] mca: base: components_open: found loaded
> component
> >>> rsh
> >>> [hpcp7781:08527] mca: base: components_open: component rsh has no
> >>> register function
> >>> [hpcp7781:08527] mca: base: components_open: component rsh open
> >>> function successful
> >>> [hpcp7781:08527] mca: base: components_open: found loaded
> component
> >>> slurm
> >>> [hpcp7781:08527] mca: base: components_open: component slurm has
> no
> >>> register function
> >>> [hpcp7781:08527] mca: base: components_open: component slurm open
> >>> function successful
> >>> [hpcp7781:08527] mca:base:select: Auto-selecting plm components
> >>> [hpcp7781:08527] mca:base:select:(  plm) Querying component [rsh]
> >>> [hpcp7781:08527] [[INVALID],INVALID] plm:rsh: using /hpc/SGE/bin/
> >>> lx24-amd64/qrsh for launching
> >>> [hpcp7781:08527] mca:base:select:(  plm) Query of component [rsh]
> >>> set priority to 10
> >>> [hpcp7781:08527] mca:base:select:(  plm) Querying component
> [slurm]
> >>> [hpcp7781:08527] mca:base:select:(  plm) Skipping component
> >>> [slurm]. Query failed to return a module
> >>> [hpcp7781:08527] mca:base:select:(  plm) Selected component [rsh]
> >>> [hpcp7781:08527] mca: base: close: component slurm closed
> >>> [hpcp7781:08527] mca: base: close: unloading component slurm
> >>> Starting server daemon at host "hpcp7782"
> >>> error: executing task of job 1702026 failed:
> >>>
> >>
> -
> >> -
> >>> 
> >>> A daemon (pid 8528) died unexpectedly with status 1 while
> attempting
> >>> to launch so we are aborting.
> >>>
> >>> There may be more information reported by the environment (see
> >> above).
> >>>
> >>> This may be because the daemon was unable to find all the needed
> >>> shared
> >>> libraries on the remote node. You may set your LD_LIBRARY_PATH to
> >>> have the
> >>> location of the shared libraries on the remote nodes and this will
> >>> automatically be forwarded to the remote nodes.
> >>>
> >>
> -
> >> -
> >>> 
> >>>
> >

Re: [OMPI users] openmpi 1.3 and gridengine tight integrationproblem

2009-03-18 Thread Rolf Vandevaart

On 03/18/09 09:52, Reuti wrote:

Hi,

Am 18.03.2009 um 14:25 schrieb Rene Salmon:


Thanks for the help.  I only use the machine file to run outside of SGE
just to test/prove that things work outside of SGE.


aha. Did you compile Open MPI 1.3 with the SGE option?



When I run with in SGE here is what the job script looks like:

hpcp7781(salmr0)128:cat simple-job.sh
#!/bin/csh
#
#$ -S /bin/csh


-S will only work if the queue configuration is set to posix_compliant. 
If it's set to unix_behavior, the first line of the script is already 
sufficient.




setenv LD_LIBRARY_PATH /bphpc7/vol0/salmr0/ompi/lib


Maybe you have to set this LD_LIBRARY_PATH in your .cshrc, so it's known 
automatically on the nodes.



mpirun --mca plm_base_verbose 20 --prefix /bphpc7/vol0/salmr0/ompi -np
16 /bphpc7/vol0/salmr0/SGE/a.out


Do you use --mca... only for debugging or why is it added here?

-- Reuti




We are using PEs.  Here is what the PE looks like:

hpcp7781(salmr0)129:qconf -sp pavtest
pe_name   pavtest
slots 16
user_listsNONE
xuser_lists   NONE
start_proc_args   /bin/true
stop_proc_args/bin/true
allocation_rule   8
control_slavesFALSE
job_is_first_task FALSE
urgency_slots min


At this FAQ, we show an example of a parallel environment setup.
http://www.open-mpi.org/faq/?category=running#run-n1ge-or-sge

I am wondering if the control_slaves needs to be TRUE.
And double check the that the PE (pavtest) is on the list for the queue 
(also mentioned at the FAQ).  And perhaps start trying to run hostname 
first.


Rolf





here is he qsub line to submit the job:


qsub -pe pavtest 16 simple-job.sh



The job seems to run fine with no problems with in SGE if I contain the
job with in one node.  As soon as the job has to use more than one one
things stop working with the message I posted about LD_LIBRARY_PATH and
orted seems not to start on the remote nodes.

Thanks
Rene




On Wed, 2009-03-18 at 07:45 +, Reuti wrote:

Hi,

it shouldn't be necessary to supply a machinefile, as the one
generated by SGE is taken automatically (i.e. the granted nodes are
honored). You submitted the job requesting a PE?

-- Reuti


Am 18.03.2009 um 04:51 schrieb Salmon, Rene:



Hi,

I have looked through the list archives and google but could not
find anything related to what I am seeing. I am simply trying to
run the basic cpi.c code using SGE and tight integration.

If run outside SGE i can run my jobs just fine:
hpcp7781(salmr0)132:mpiexec -np 2 --machinefile x a.out
Process 0 on hpcp7781
Process 1 on hpcp7782
pi is approximately 3.1416009869231241, Error is 0.0809
wall clock time = 0.032325


If I submit to SGE I get this:

[hpcp7781:08527] mca: base: components_open: Looking for plm
components
[hpcp7781:08527] mca: base: components_open: opening plm components
[hpcp7781:08527] mca: base: components_open: found loaded component
rsh
[hpcp7781:08527] mca: base: components_open: component rsh has no
register function
[hpcp7781:08527] mca: base: components_open: component rsh open
function successful
[hpcp7781:08527] mca: base: components_open: found loaded component
slurm
[hpcp7781:08527] mca: base: components_open: component slurm has no
register function
[hpcp7781:08527] mca: base: components_open: component slurm open
function successful
[hpcp7781:08527] mca:base:select: Auto-selecting plm components
[hpcp7781:08527] mca:base:select:(  plm) Querying component [rsh]
[hpcp7781:08527] [[INVALID],INVALID] plm:rsh: using /hpc/SGE/bin/
lx24-amd64/qrsh for launching
[hpcp7781:08527] mca:base:select:(  plm) Query of component [rsh]
set priority to 10
[hpcp7781:08527] mca:base:select:(  plm) Querying component [slurm]
[hpcp7781:08527] mca:base:select:(  plm) Skipping component
[slurm]. Query failed to return a module
[hpcp7781:08527] mca:base:select:(  plm) Selected component [rsh]
[hpcp7781:08527] mca: base: close: component slurm closed
[hpcp7781:08527] mca: base: close: unloading component slurm
Starting server daemon at host "hpcp7782"
error: executing task of job 1702026 failed:


--


A daemon (pid 8528) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see

above).


This may be because the daemon was unable to find all the needed
shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to
have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.


--




--


mpirun noticed that the job aborted, but has no info as to the

process

that caused that situation.


--


mpirun: clean termination accomplished

[hpcp7781:08527] mca: bas

[OMPI users] Fwd: New MPI-2.1 standard in hardcover - the yellow book

2009-03-18 Thread Jeff Squyres
I can't remember if I've forwarded this to the OMPI lists before;  
pardon if you have seen this before.  I have one of these books and I  
find it quite handy.  IMHO: it's quite a steal for US$25 (~600 pages).


Begin forwarded message:


From: "Rolf Rabenseifner" 
Date: March 18, 2009 10:21:31 AM EDT
Subject: New MPI-2.1 standard in hardcover - the yellow book

-
Please forward this announcement to any colleagues who may be
interested. Our apologies if you receive multiple copies.
-

Dear MPI user,

now, the new official MPI-2.1 standard (June 2008)
in hardcover can be shipped to anywhere in the world.

As a service (at costs) for users of the Message Passing Interface,
HLRS has printed the new Standard, Version 2.1, June 23, 2008
in hardcover. The price is only 17 Euro or 25 US-$.

You can find a picture of the book and the text of the standard at
http://www.mpi-forum.org/docs/docs.html
Selling & shipping of the book is done through

https://fs.hlrs.de/projects/par/mpi/mpi21/

It is the complete MPI in one book! This was one of the goals of  
MPI-2.1.

I was responsible for this project in the international MPI Forum.

HLRS is not a commercial publisher. Therefore, in the first months,
the book was only available at some conferences.
Now, the international shipping is organized through a web store:
https://fs.hlrs.de/projects/par/mpi/mpi21/

Only a limited number of books is available.

Best regards
Rolf Rabenseifner


-
Dr. Rolf Rabenseifner .. . . . . . . . . . email rabenseif...@hlrs.de
High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530
University of Stuttgart .. . . . . . . . . fax : ++49(0)711/685-65832
Head of Dpmt Parallel Computing .. .. www.hlrs.de/people/rabenseifner
Nobelstr. 19, D-70550 Stuttgart, Germany . . (Office: Allmandring 30)
-




--
Jeff Squyres
Cisco Systems



Re: [OMPI users] openmpi 1.3 and gridengine tight integrationproblem

2009-03-18 Thread Rene Salmon

> 
> At this FAQ, we show an example of a parallel environment setup.
> http://www.open-mpi.org/faq/?category=running#run-n1ge-or-sge
> 
> I am wondering if the control_slaves needs to be TRUE.
> And double check the that the PE (pavtest) is on the list for the
> queue
> (also mentioned at the FAQ).  And perhaps start trying to run hostname
> first.


Changing control_slaves to true did not make thing work but it did
provide me with a bit more info on this. Enough to figure things out.
Now when I run i get a message about "rcmd:socket: Permission denied" :



Starting server daemon at host "hpcp7782"
Server daemon successfully started with task id "1.hpcp7782"
Establishing /hpc/SGE/utilbin/lx24-amd64/rsh session to host
hpcp7782 ...
rcmd: socket: Permission denied
/hpc/SGE/utilbin/lx24-amd64/rsh exited with exit code 1
reading exit code from shepherd ... timeout (60 s) expired while waiting
on socket fd 4
error: error reading returncode of remote command
--
A daemon (pid 31961) died unexpectedly with status 255 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have
the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--
mpirun: clean termination accomplished

[hpcp7781:31960] mca: base: close: component rsh closed
[hpcp7781:31960] mca: base: close: unloading component rsh




So it turns out the NFS mount for SGE on the clients had option "nosuid"
set which does not allow the qrsh/rsh SGE binaries to run because they
are setuid.  Got rid of the "nosuid" and now things work just fine.

Thank you for the help

Rene



Re: [OMPI users] open mpi on non standard ssh port

2009-03-18 Thread Jeff Squyres

FWIW, two other people said the same thing already:

http://www.open-mpi.org/community/lists/users/2009/03/8479.php
http://www.open-mpi.org/community/lists/users/2009/03/8481.php

:-)


On Mar 18, 2009, at 4:51 AM, Reuti wrote:


Bernhard,

Am 18.03.2009 um 09:19 schrieb Bernhard Knapp:

> come on, it must be somehow possible to use openmpi not on port
> 22!? ;-)

it's not an issue of Open MPI but ssh. You need in your home a file
~/.ssh/config with two lines:

host *
port 1234

or whatever port you need.

-- Reuti


>>
>> --
>>
>> Message: 3
>> Date: Tue, 17 Mar 2009 09:45:29 +0100
>> From: Bernhard Knapp 
>> Subject: [OMPI users] open mpi on non standard ssh port
>> To: us...@open-mpi.org
>> Message-ID: <49bf6329.8090...@meduniwien.ac.at>
>> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>>
>> Hi
>>
>> I want to start a gromacs simulation on a small cluster where non
>> standard ports are used for ssh. If I just use a "normal"
>> maschinelist file (with the ips of the nodes), consequently, the
>> following error comes up:
>> ssh: connect to host 192.168.0.103 port 22: Connection refused
>>
>> I guess that I need to somehow tell him to use the other ports. I
>> tried it in the following way (maschinelist file):
>> 192.168.0.101 -p 5101
>> 192.168.0.102 -p 5102
>> 192.168.0.103 -p 5103
>> 192.168.0.104 -p 5104
>>
>> But it seems this is not the correct way to specifiy the port:
>> Open RTE detected a parse error in the hostfile:
>>/home/bknapp/scripts/machinefile.txt
>> It occured on line number 1 on token 5:
>>-p
>>
>> How can I tell him to use port 5101 on machine 192.168.0.101?
>> May be the question is stupid but I could not find a solution via
>> google or search function ...
>>
>> cheers
>> Bernhard
>>
>>
>>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



Re: [OMPI users] [Fwd: Re: open mpi on non standard ssh port]

2009-03-18 Thread Jeff Squyres
It means you started the jobs ok (via ssh) but Open MPI wasn't able to  
open TCP sockets between the two MPI processes.  Open MPI needs to be  
able to communicate via random TCP ports between its MPI processes.



On Mar 18, 2009, at 8:39 AM, Bernhard Knapp wrote:


Hey again,

I tried to build a work around  via port redirection: iptables -t  
nat -A PREROUTING -i eth1 -p tcp --dport 22 -j REDIRECT --to-port 5101



If I do that then I can start the job:

 mpirun -np 2 -machinefile /home/bknapp/scripts/machinefile.txt  
mdrun -np 2 -nice 0 -s 1fyt_PKYVKQNTLELAT_bindingRegionsOnly.md.tpr - 
o 1fyt_PKYVKQNTLELAT_bindingRegionsOnly.md.trr -c  
1fyt_PKYVKQNTLELAT_bindingRegionsOnly.md.pdb -g  
1fyt_PKYVKQNTLELAT_bindingRegionsOnly.md.log -e  
1fyt_PKYVKQNTLELAT_bindingRegionsOnly.md.edr -v

bknapp@192.168.0.104's password:
NNODES=2, MYRANK=0, HOSTNAME=quoVadis01
NNODES=2, MYRANK=1, HOSTNAME=quoVadis04

but it comes up with "[quoVadis01][[24802,1],0][btl_tcp_endpoint.c: 
631:mca_btl_tcp_endpoint_complete_connect] connect() failed: No  
route to host (113)". The CPUs are calculating on both (physically  
different machines) but unfortunately no results are written ...


Was the port redirection of 22 not enough or is there another problem?

thx
Bernhard





 Original Message 
Subject:
Re: open mpi on non standard ssh port
Date:
Wed, 18 Mar 2009 09:19:18 +0100
From:
Bernhard Knapp 
To:
us...@open-mpi.org
References:



come on, it must be somehow possible to use openmpi not on port  
22!? ;-)


>
>--
>
>Message: 3
>Date: Tue, 17 Mar 2009 09:45:29 +0100
>From: Bernhard Knapp 
>Subject: [OMPI users] open mpi on non standard ssh port
>To: us...@open-mpi.org
>Message-ID: <49bf6329.8090...@meduniwien.ac.at>
>Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
>Hi
>
>I want to start a gromacs simulation on a small cluster where non
>standard ports are used for ssh. If I just use a "normal"  
maschinelist

>file (with the ips of the nodes), consequently, the following error
>comes up:
>ssh: connect to host 192.168.0.103 port 22: Connection refused
>
>I guess that I need to somehow tell him to use the other ports. I  
tried

>it in the following way (maschinelist file):
>192.168.0.101 -p 5101
>192.168.0.102 -p 5102
>192.168.0.103 -p 5103
>192.168.0.104 -p 5104
>
>But it seems this is not the correct way to specifiy the port:
>Open RTE detected a parse error in the hostfile:
>/home/bknapp/scripts/machinefile.txt
>It occured on line number 1 on token 5:
>-p
>
>How can I tell him to use port 5101 on machine 192.168.0.101?
>May be the question is stupid but I could not find a solution via  
google

>or search function ...
>
>cheers
>Bernhard
>
>
>
>


--
Dipl.-Ing. (FH) Bernhard Knapp
Univ.-Ass.postgrad.
Unit for Medical Statistics and Informatics - Section for Biomedical  
Computersimulation and Bioinformatics

Medical University of Vienna - General Hospital
Spitalgasse 23 A-1090 WIEN / AUSTRIA
Room: BT88 - 88.03.712
Phone: +43(1) 40400-6673
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



[OMPI users] selected pml cm, but peer [[2469, 1], 0] on compute-0-0 selected pml ob1

2009-03-18 Thread Gary Draving

Hi all,

anyone ever seen an error like this? Seems like I have some setting 
wrong in opemmpi.  I thought I had it setup like the other machines but 
seems as though I have missed something. I only get the error when 
adding machine "fs1" to the hostfile list.  The other 40+ machines seem 
fine.


[fs1.calvin.edu:01750] [[2469,1],6] selected pml cm, but peer 
[[2469,1],0] on compute-0-0 selected pml ob1


When I use ompi_info the output looks like my other machines:

[root@fs1 openmpi-1.3]# ompi_info | grep btl
MCA btl: ofud (MCA v2.0, API v2.0, Component v1.3)
MCA btl: openib (MCA v2.0, API v2.0, Component v1.3)
MCA btl: self (MCA v2.0, API v2.0, Component v1.3)
MCA btl: sm (MCA v2.0, API v2.0, Component v1.3)

The whole error is below, any help would be greatly appreciated.

Gary

[admin@dahl 00.greetings]$ /usr/local/bin/mpirun --mca btl ^tcp 
--hostfile machines -np 7 greetings
[fs1.calvin.edu:01959] [[2212,1],6] selected pml cm, but peer 
[[2212,1],0] on compute-0-0 selected pml ob1

--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

 PML add procs failed
 --> Returned "Unreachable" (-12) instead of "Success" (0)
--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[fs1.calvin.edu:1959] Abort before MPI_INIT completed successfully; not 
able to guarantee that all other processes were killed!

--
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

 Process 1 ([[2212,1],3]) is on host: dahl.calvin.edu
 Process 2 ([[2212,1],0]) is on host: compute-0-0
 BTLs attempted: openib self sm

Your MPI job is now going to abort; sorry.
--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[dahl.calvin.edu:16884] Abort before MPI_INIT completed successfully; 
not able to guarantee that all other processes were killed!

*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[compute-0-0.local:1591] Abort before MPI_INIT completed successfully; 
not able to guarantee that all other processes were killed!

*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[fs2.calvin.edu:8826] Abort before MPI_INIT completed successfully; not 
able to guarantee that all other processes were killed!

--
mpirun has exited due to process rank 3 with PID 16884 on
node dahl.calvin.edu exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--
[dahl.calvin.edu:16879] 3 more processes have sent help message 
help-mpi-runtime / mpi_init:startup:internal-failure
[dahl.calvin.edu:16879] Set MCA parameter "orte_base_help_aggregate" to 
0 to see all help / error messages
[dahl.calvin.edu:16879] 2 more processes have sent help message 
help-mca-bml-r2.txt / unreachable proc