On 6/1/2015 2:45 PM, Nathan Hjelm wrote:
It looks to me like the default queue pair settings are different on the
different systems. You can try setting the mca_btl_openib_receive_queues
variable by hand. If this is infiniband I recommend not using any
per-peer queue pairs and use something like:

This is iWARP. The reason the QP settings are different is due to the device ID being interpreted incorrectly by the one machine. I think we tried manually overriding the RQ settings but still no dice.



S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64

otherwise use either one of the settings that were printed out:

P,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64

or

P,65536,64

-Nathan

On Mon, Jun 01, 2015 at 09:28:28AM -0500, Steve Wise wrote:
Hello,

I'm seeing an error trying to run a simple OMPI job on a 2 node cluster where 
one node is a PPC64 BE byte order and the other is a
X86_64 LE byte order node.  OMPI 1.8.4 is configured with 
--enable-heterogeneous:

./configure --with-openib=/usr  CC=gcc CXX=g++ F77=gfortran FC=gfortran
--enable-mpirun-prefix-by-default --prefix=/usr/mpi/gcc/openmpi-1.8.4/
--with-openib-libdir=/usr/lib64/ --libdir=/usr/mpi/gcc/openmpi-1.8.4/lib64/
--with-contrib-vt-flags=--disable-iotrace --enable-mpi-thread-multiple
--with-threads=posix --enable-heterogeneous && make -j8 && make -j8 install

And the job started this way:

/usr/mpi/gcc/openmpi-1.8.4/bin/mpirun -np 2 -host
ppc64,atlas3 --allow-run-as-root --mca btl_openib_addr_include 102.1.1.0/24
--mca btl openib,sm,self /usr/mpi/gcc/openmpi-1.8.4/tests/IMB-3.2/IMB-MPI1
pingpong

But we see the following error.  Note atlas3 is using the vendor ID that is in 
the wrong byte order (0x25140000 instead of 0x1425):

The Open MPI receive queue configuration for the OpenFabrics devices
on two nodes are incompatible, meaning that MPI processes on two
specific nodes were unable to communicate with each other.  This
generally happens when you are using OpenFabrics devices from
different vendors on the same network.  You should be able to use the
mca_btl_openib_receive_queues MCA parameter to set a uniform receive
queue configuration for all the devices in the MPI job, and therefore
be able to run successfully.

   Local host:       ppc64-rhel71
   Local adapter:    cxgb4_0 (vendor 0x1425, part ID 21505)
   Local queues:     P,65536,64

   Remote host:      atlas3
   Remote adapter:   (vendor 0x25140000, part ID 22282240)
   Remote queues:
P,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64


Am I missing some OMPI parameter to allow this job to run?

Thanks,

Steve.

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/06/27010.php


_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/06/27019.php

Reply via email to