I'm afraid I have no idea - I've never seen a Torque version that old,
however, so it is quite possible that we don't work with it. It also looks
like it may have been modified (given the p2-aspen3 on the end), so I have
no idea how the system would behave.

First thing you could do is verify that the allocation is being read
correctly. Add a --display-allocation to the cmd line and see what we think
Torque gave us. Then add --display-map to see where it plans to place the
processes.

If all that looks okay, and if you allow ssh, then try -mca plm rsh on the
cmd line and see if that works.

HTH
Ralph


On Tue, Jul 21, 2009 at 12:57 PM, Song, Kai Song <ks...@lbl.gov> wrote:

> Hi All,
>
> I am building open-mpi-1.3.2 on centos-3.4, with torque-1.1.0p2-aspen3 and
> myrinet. I compiled it just fine with this configuration:
> ./configure --prefix=/home/software/ompi/1.3.2-pgi --with-gm=/usr/local/
> --with-gm-libdir=/usr/local/lib64/ --enable-static --disable-shared
> --with-tm=/usr/ --without-threads CC=pgcc CXX=pgCC FC=pgf90 F77=pgf77
> LDFLAGS=-L/usr/lib64/torque/
>
> However, when I submit jobs for 2 or more nodes through the torque
> schedular, the jobs just hang here. It shows the RUN state, but no
> communication between the nodes, then jobs will die with timeout.
>
> We have comfirmed that the myrinet is working because our lam-mpi-7.1 works
> just fine. We are having a really hard time determining what are the causes
> for this problem. So, we suspect it's because our torque is too old.
>
> What is the lowest version requirement of torque for open-mpi-1.3.2? The
> README file didn't specify this detail. Does anyone know more about it?
>
> Thanks in advance,
>
> Kai
> --------------------
> Kai Song
> <ks...@lbl.gov> 1.510.486.4894
> High Performance Computing Services (HPCS) Intern
> Lawrence Berkeley National Laboratory - http://scs.lbl.gov
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to