I'm afraid I have no idea - I've never seen a Torque version that old, however, so it is quite possible that we don't work with it. It also looks like it may have been modified (given the p2-aspen3 on the end), so I have no idea how the system would behave.
First thing you could do is verify that the allocation is being read correctly. Add a --display-allocation to the cmd line and see what we think Torque gave us. Then add --display-map to see where it plans to place the processes. If all that looks okay, and if you allow ssh, then try -mca plm rsh on the cmd line and see if that works. HTH Ralph On Tue, Jul 21, 2009 at 12:57 PM, Song, Kai Song <ks...@lbl.gov> wrote: > Hi All, > > I am building open-mpi-1.3.2 on centos-3.4, with torque-1.1.0p2-aspen3 and > myrinet. I compiled it just fine with this configuration: > ./configure --prefix=/home/software/ompi/1.3.2-pgi --with-gm=/usr/local/ > --with-gm-libdir=/usr/local/lib64/ --enable-static --disable-shared > --with-tm=/usr/ --without-threads CC=pgcc CXX=pgCC FC=pgf90 F77=pgf77 > LDFLAGS=-L/usr/lib64/torque/ > > However, when I submit jobs for 2 or more nodes through the torque > schedular, the jobs just hang here. It shows the RUN state, but no > communication between the nodes, then jobs will die with timeout. > > We have comfirmed that the myrinet is working because our lam-mpi-7.1 works > just fine. We are having a really hard time determining what are the causes > for this problem. So, we suspect it's because our torque is too old. > > What is the lowest version requirement of torque for open-mpi-1.3.2? The > README file didn't specify this detail. Does anyone know more about it? > > Thanks in advance, > > Kai > -------------------- > Kai Song > <ks...@lbl.gov> 1.510.486.4894 > High Performance Computing Services (HPCS) Intern > Lawrence Berkeley National Laboratory - http://scs.lbl.gov > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >