Hi Ralph, Thanks a lot for the fast response.
Could you give me more instructions on which command do I put "--display-allocation" and "--display-map" with? mpirun? ./configure?... Also,we have tested that in our PBS script, if we put node=1, the helloworld works. But, when I put node=2 or more, it will hang until timeout . And the error message will be something like: node0006 - daemon did not report back when launched However, if we don't go through the scheduler and run mpi manually, everything works fine too. /home/software/ompi/1.3.2-pgi/bin/mpirun -machinefile ./nodes -np 16 ./a.out What do you think the problem would be? It's not the network issue, because manually running MPI works. That is why we question about torque compatibility. Thanks again, Kai -------------------- Kai Song <ks...@lbl.gov> 1.510.486.4894 High Performance Computing Services (HPCS) Intern Lawrence Berkeley National Laboratory - http://scs.lbl.gov ----- Original Message ----- From: Ralph Castain <r...@open-mpi.org> List-Post: users@lists.open-mpi.org Date: Tuesday, July 21, 2009 12:12 pm Subject: Re: [OMPI users] Open-MPI-1.3.2 compatibility with old torque? To: Open MPI Users <us...@open-mpi.org> > I'm afraid I have no idea - I've never seen a Torque version that old, > however, so it is quite possible that we don't work with it. It > also looks > like it may have been modified (given the p2-aspen3 on the end), so > I have > no idea how the system would behave. > > First thing you could do is verify that the allocation is being read > correctly. Add a --display-allocation to the cmd line and see what > we think > Torque gave us. Then add --display-map to see where it plans to > place the > processes. > > If all that looks okay, and if you allow ssh, then try -mca plm rsh > on the > cmd line and see if that works. > > HTH > Ralph > > > On Tue, Jul 21, 2009 at 12:57 PM, Song, Kai Song <ks...@lbl.gov> > wrote: > > Hi All, > > > > I am building open-mpi-1.3.2 on centos-3.4, with torque-1.1.0p2- > aspen3 and > > myrinet. I compiled it just fine with this configuration: > > ./configure --prefix=/home/software/ompi/1.3.2-pgi --with- > gm=/usr/local/> --with-gm-libdir=/usr/local/lib64/ --enable-static - > -disable-shared > > --with-tm=/usr/ --without-threads CC=pgcc CXX=pgCC FC=pgf90 > F77=pgf77> LDFLAGS=-L/usr/lib64/torque/ > > > > However, when I submit jobs for 2 or more nodes through the torque > > schedular, the jobs just hang here. It shows the RUN state, but no > > communication between the nodes, then jobs will die with timeout. > > > > We have comfirmed that the myrinet is working because our lam-mpi- > 7.1 works > > just fine. We are having a really hard time determining what are > the causes > > for this problem. So, we suspect it's because our torque is too old. > > > > What is the lowest version requirement of torque for open-mpi- > 1.3.2? The > > README file didn't specify this detail. Does anyone know more > about it? > > > > Thanks in advance, > > > > Kai > > -------------------- > > Kai Song > > <ks...@lbl.gov> 1.510.486.4894 > > High Performance Computing Services (HPCS) Intern > > Lawrence Berkeley National Laboratory - http://scs.lbl.gov > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > >