Hi Tetsuya Mishima

Mpiexec offers you a number of possibilities that you could try:
--bynode,
--pernode,
--npernode,
--bysocket,
--bycore,
--cpus-per-proc,
--cpus-per-rank,
--rankfile
and more.

Most likely one or more of them will fit your needs.

There are also associated flags to bind processes to cores,
to sockets, etc, to report the bindings, and so on.

Check the mpiexec man page for details.

Nevertheless, I am surprised that modifying the
$PBS_NODEFILE doesn't work for you in OMPI 1.7.
I have done this many times in older versions of OMPI.

Would it work for you to go back to the stable OMPI 1.6.X,
or does it lack any special feature that you need?

I hope this helps,
Gus Correa

On 03/19/2013 03:00 AM, tmish...@jcity.maeda.co.jp wrote:


Hi Jeff,

I didn't have much time to test this morning. So, I checked it again
now. Then, the trouble seems to depend on the number of nodes to use.

This works(nodes<  4):
mpiexec -bynode -np 4 ./my_program&&    #PBS -l nodes=2:ppn=8
(OMP_NUM_THREADS=4)

This causes error(nodes>= 4):
mpiexec -bynode -np 8 ./my_program&&    #PBS -l nodes=4:ppn=8
(OMP_NUM_THREADS=4)

Regards,
Tetsuya Mishima

Oy; that's weird.

I'm afraid we're going to have to wait for Ralph to answer why that is
happening -- sorry!


On Mar 18, 2013, at 4:45 PM,<tmish...@jcity.maeda.co.jp>  wrote:



Hi Correa and Jeff,

Thank you for your comments. I quickly checked your suggestion.

As a result, my simple example case worked well.
export OMP_NUM_THREADS=4
mpiexec -bynode -np 2 ./my_program&&    #PBS -l nodes=2:ppn=4

But, practical case that more than 1 process was allocated to a node
like
below did not work.
export OMP_NUM_THREADS=4
mpiexec -bynode -np 4 ./my_program&&    #PBS -l nodes=2:ppn=8

The error message is as follows:
[node08.cluster:11946] [[30666,0],3] ORTE_ERROR_LOG: A message is
attempting to be sent to a process whose contact infor
mation is unknown in file rml_oob_send.c at line 316
[node08.cluster:11946] [[30666,0],3] unable to find address for
[[30666,0],1]
[node08.cluster:11946] [[30666,0],3] ORTE_ERROR_LOG: A message is
attempting to be sent to a process whose contact infor
mation is unknown in file base/grpcomm_base_rollup.c at line 123

Here is our openmpi configuration:
./configure \
--prefix=/home/mishima/opt/mpi/openmpi-1.7rc8-pgi12.9 \
--with-tm \
--with-verbs \
--disable-ipv6 \
CC=pgcc CFLAGS="-fast -tp k8-64e" \
CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \
F77=pgfortran FFLAGS="-fast -tp k8-64e" \
FC=pgfortran FCFLAGS="-fast -tp k8-64e"

Regards,
Tetsuya Mishima

On Mar 17, 2013, at 10:55 PM, Gustavo Correa<g...@ldeo.columbia.edu>
wrote:

In your example, have you tried not to modify the node file,
launch two mpi processes with mpiexec, and request a "-bynode"
distribution of processes:

mpiexec -bynode -np 2 ./my_program

This should work in 1.7, too (I use these kinds of options with SLURM
all
the time).

However, we should probably verify that the hostfile functionality in
batch jobs hasn't been broken in 1.7, too, because I'm pretty sure that
what you described should work.  However, Ralph, our
run-time guy, is on vacation this week.  There might be a delay in
checking into this.

--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to