Hi Jeff,

I didn't have much time to test this morning. So, I checked it again
now. Then, the trouble seems to depend on the number of nodes to use.

This works(nodes < 4):
mpiexec -bynode -np 4 ./my_program   &&   #PBS -l nodes=2:ppn=8
(OMP_NUM_THREADS=4)

This causes error(nodes >= 4):
mpiexec -bynode -np 8 ./my_program   &&   #PBS -l nodes=4:ppn=8
(OMP_NUM_THREADS=4)

Regards,
Tetsuya Mishima

> Oy; that's weird.
>
> I'm afraid we're going to have to wait for Ralph to answer why that is
happening -- sorry!
>
>
> On Mar 18, 2013, at 4:45 PM, <tmish...@jcity.maeda.co.jp> wrote:
>
> >
> >
> > Hi Correa and Jeff,
> >
> > Thank you for your comments. I quickly checked your suggestion.
> >
> > As a result, my simple example case worked well.
> > export OMP_NUM_THREADS=4
> > mpiexec -bynode -np 2 ./my_program   &&   #PBS -l nodes=2:ppn=4
> >
> > But, practical case that more than 1 process was allocated to a node
like
> > below did not work.
> > export OMP_NUM_THREADS=4
> > mpiexec -bynode -np 4 ./my_program   &&   #PBS -l nodes=2:ppn=8
> >
> > The error message is as follows:
> > [node08.cluster:11946] [[30666,0],3] ORTE_ERROR_LOG: A message is
> > attempting to be sent to a process whose contact infor
> > mation is unknown in file rml_oob_send.c at line 316
> > [node08.cluster:11946] [[30666,0],3] unable to find address for
> > [[30666,0],1]
> > [node08.cluster:11946] [[30666,0],3] ORTE_ERROR_LOG: A message is
> > attempting to be sent to a process whose contact infor
> > mation is unknown in file base/grpcomm_base_rollup.c at line 123
> >
> > Here is our openmpi configuration:
> > ./configure \
> > --prefix=/home/mishima/opt/mpi/openmpi-1.7rc8-pgi12.9 \
> > --with-tm \
> > --with-verbs \
> > --disable-ipv6 \
> > CC=pgcc CFLAGS="-fast -tp k8-64e" \
> > CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \
> > F77=pgfortran FFLAGS="-fast -tp k8-64e" \
> > FC=pgfortran FCFLAGS="-fast -tp k8-64e"
> >
> > Regards,
> > Tetsuya Mishima
> >
> >> On Mar 17, 2013, at 10:55 PM, Gustavo Correa <g...@ldeo.columbia.edu>
> > wrote:
> >>
> >>> In your example, have you tried not to modify the node file,
> >>> launch two mpi processes with mpiexec, and request a "-bynode"
> > distribution of processes:
> >>>
> >>> mpiexec -bynode -np 2 ./my_program
> >>
> >> This should work in 1.7, too (I use these kinds of options with SLURM
all
> > the time).
> >>
> >> However, we should probably verify that the hostfile functionality in
> > batch jobs hasn't been broken in 1.7, too, because I'm pretty sure that
> > what you described should work.  However, Ralph, our
> >> run-time guy, is on vacation this week.  There might be a delay in
> > checking into this.
> >>
> >> --
> >> Jeff Squyres
> >> jsquy...@cisco.com
> >> For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >>
> >>
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to