Hi Ralph,

Thanks a lot for the fast response.

Could you give me more instructions on which command do I put 
"--display-allocation" and "--display-map" with? mpirun? ./configure?...

Also,we have tested that in our PBS script, if we put node=1, the helloworld 
works. But, when I put node=2 or more, it will hang until timeout . And the 
error message will be something like:
 node0006 - daemon did not report back when launched

However, if we don't go through the scheduler and run mpi manually, everything 
works fine too.
/home/software/ompi/1.3.2-pgi/bin/mpirun -machinefile ./nodes -np 16 ./a.out

What do you think the problem would be? It's not the network issue, because 
manually running MPI works. That is why we question about torque compatibility.

Thanks again,

Kai

--------------------
Kai Song
<ks...@lbl.gov> 1.510.486.4894
High Performance Computing Services (HPCS) Intern
Lawrence Berkeley National Laboratory - http://scs.lbl.gov


----- Original Message -----
From: Ralph Castain <r...@open-mpi.org>
List-Post: users@lists.open-mpi.org
Date: Tuesday, July 21, 2009 12:12 pm
Subject: Re: [OMPI users] Open-MPI-1.3.2 compatibility with old torque?
To: Open MPI Users <us...@open-mpi.org>

> I'm afraid I have no idea - I've never seen a Torque version that old,
> however, so it is quite possible that we don't work with it. It 
> also looks
> like it may have been modified (given the p2-aspen3 on the end), so 
> I have
> no idea how the system would behave.
> 
> First thing you could do is verify that the allocation is being read
> correctly. Add a --display-allocation to the cmd line and see what 
> we think
> Torque gave us. Then add --display-map to see where it plans to 
> place the
> processes.
> 
> If all that looks okay, and if you allow ssh, then try -mca plm rsh 
> on the
> cmd line and see if that works.
> 
> HTH
> Ralph
> 
> 
> On Tue, Jul 21, 2009 at 12:57 PM, Song, Kai Song <ks...@lbl.gov> 
> wrote:
> > Hi All,
> >
> > I am building open-mpi-1.3.2 on centos-3.4, with torque-1.1.0p2-
> aspen3 and
> > myrinet. I compiled it just fine with this configuration:
> > ./configure --prefix=/home/software/ompi/1.3.2-pgi --with-
> gm=/usr/local/> --with-gm-libdir=/usr/local/lib64/ --enable-static -
> -disable-shared
> > --with-tm=/usr/ --without-threads CC=pgcc CXX=pgCC FC=pgf90 
> F77=pgf77> LDFLAGS=-L/usr/lib64/torque/
> >
> > However, when I submit jobs for 2 or more nodes through the torque
> > schedular, the jobs just hang here. It shows the RUN state, but no
> > communication between the nodes, then jobs will die with timeout.
> >
> > We have comfirmed that the myrinet is working because our lam-mpi-
> 7.1 works
> > just fine. We are having a really hard time determining what are 
> the causes
> > for this problem. So, we suspect it's because our torque is too old.
> >
> > What is the lowest version requirement of torque for open-mpi-
> 1.3.2? The
> > README file didn't specify this detail. Does anyone know more 
> about it?
> >
> > Thanks in advance,
> >
> > Kai
> > --------------------
> > Kai Song
> > <ks...@lbl.gov> 1.510.486.4894
> > High Performance Computing Services (HPCS) Intern
> > Lawrence Berkeley National Laboratory - http://scs.lbl.gov
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> 

Reply via email to