It looks to me like the default queue pair settings are different on the different systems. You can try setting the mca_btl_openib_receive_queues variable by hand. If this is infiniband I recommend not using any per-peer queue pairs and use something like:
S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64 otherwise use either one of the settings that were printed out: P,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64 or P,65536,64 -Nathan On Mon, Jun 01, 2015 at 09:28:28AM -0500, Steve Wise wrote: > Hello, > > I'm seeing an error trying to run a simple OMPI job on a 2 node cluster where > one node is a PPC64 BE byte order and the other is a > X86_64 LE byte order node. OMPI 1.8.4 is configured with > --enable-heterogeneous: > > ./configure --with-openib=/usr CC=gcc CXX=g++ F77=gfortran FC=gfortran > --enable-mpirun-prefix-by-default --prefix=/usr/mpi/gcc/openmpi-1.8.4/ > --with-openib-libdir=/usr/lib64/ --libdir=/usr/mpi/gcc/openmpi-1.8.4/lib64/ > --with-contrib-vt-flags=--disable-iotrace --enable-mpi-thread-multiple > --with-threads=posix --enable-heterogeneous && make -j8 && make -j8 install > > And the job started this way: > > /usr/mpi/gcc/openmpi-1.8.4/bin/mpirun -np 2 -host > ppc64,atlas3 --allow-run-as-root --mca btl_openib_addr_include 102.1.1.0/24 > --mca btl openib,sm,self /usr/mpi/gcc/openmpi-1.8.4/tests/IMB-3.2/IMB-MPI1 > pingpong > > But we see the following error. Note atlas3 is using the vendor ID that is > in the wrong byte order (0x25140000 instead of 0x1425): > > The Open MPI receive queue configuration for the OpenFabrics devices > on two nodes are incompatible, meaning that MPI processes on two > specific nodes were unable to communicate with each other. This > generally happens when you are using OpenFabrics devices from > different vendors on the same network. You should be able to use the > mca_btl_openib_receive_queues MCA parameter to set a uniform receive > queue configuration for all the devices in the MPI job, and therefore > be able to run successfully. > > Local host: ppc64-rhel71 > Local adapter: cxgb4_0 (vendor 0x1425, part ID 21505) > Local queues: P,65536,64 > > Remote host: atlas3 > Remote adapter: (vendor 0x25140000, part ID 22282240) > Remote queues: > P,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64 > > > Am I missing some OMPI parameter to allow this job to run? > > Thanks, > > Steve. > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/06/27010.php
pgpWuVYC1LO3j.pgp
Description: PGP signature