Re: [OMPI users] problem starting a ompi job in a mix BE/LE cluster

Ralph Castain Mon, 1 Jun 2015 10:53:48 -0400 (EDT)

Well, I checked and it looks to me like —hetero-apps is a stale option in the 
master at least - I don’t see where it gets used.


Looking at the code, I would suspect that something didn’t get configured 
correctly - either the —enable-heterogeneous flag didn’t get set on one side, 
or we incorrectly failed to identify the BE machine, or both. You might run 
ompi_info on the two sides and verify they both were built correctly


> On Jun 1, 2015, at 7:40 AM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> Just to check the obvious: I assume that the /usr/mpi directory is not 
> network mounted, and both application and OMPI code are appropriately 
> compiled on each side?
> 
> There is another mpirun flag —hetero-apps that you may need to provide. It 
> has been so long since someone tried this that I’d have to look to remember 
> what it does.
> 
> 
> 
>> On Jun 1, 2015, at 7:28 AM, Steve Wise <sw...@opengridcomputing.com> wrote:
>> 
>> Hello,
>> 
>> I'm seeing an error trying to run a simple OMPI job on a 2 node cluster 
>> where one node is a PPC64 BE byte order and the other is a
>> X86_64 LE byte order node.  OMPI 1.8.4 is configured with 
>> --enable-heterogeneous:
>> 
>> ./configure --with-openib=/usr  CC=gcc CXX=g++ F77=gfortran FC=gfortran
>> --enable-mpirun-prefix-by-default --prefix=/usr/mpi/gcc/openmpi-1.8.4/
>> --with-openib-libdir=/usr/lib64/ --libdir=/usr/mpi/gcc/openmpi-1.8.4/lib64/
>> --with-contrib-vt-flags=--disable-iotrace --enable-mpi-thread-multiple
>> --with-threads=posix --enable-heterogeneous && make -j8 && make -j8 install
>> 
>> And the job started this way:
>> 
>> /usr/mpi/gcc/openmpi-1.8.4/bin/mpirun -np 2 -host
>> ppc64,atlas3 --allow-run-as-root --mca btl_openib_addr_include 102.1.1.0/24
>> --mca btl openib,sm,self /usr/mpi/gcc/openmpi-1.8.4/tests/IMB-3.2/IMB-MPI1
>> pingpong
>> 
>> But we see the following error.  Note atlas3 is using the vendor ID that is 
>> in the wrong byte order (0x25140000 instead of 0x1425): 
>> 
>> The Open MPI receive queue configuration for the OpenFabrics devices
>> on two nodes are incompatible, meaning that MPI processes on two
>> specific nodes were unable to communicate with each other.  This
>> generally happens when you are using OpenFabrics devices from
>> different vendors on the same network.  You should be able to use the
>> mca_btl_openib_receive_queues MCA parameter to set a uniform receive
>> queue configuration for all the devices in the MPI job, and therefore
>> be able to run successfully.
>> 
>> Local host:       ppc64-rhel71
>> Local adapter:    cxgb4_0 (vendor 0x1425, part ID 21505)
>> Local queues:     P,65536,64
>> 
>> Remote host:      atlas3
>> Remote adapter:   (vendor 0x25140000, part ID 22282240)
>> Remote queues:   
>> P,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64
>> 
>> 
>> Am I missing some OMPI parameter to allow this job to run?
>> 
>> Thanks,
>> 
>> Steve.
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/06/27010.php
>

Re: [OMPI users] problem starting a ompi job in a mix BE/LE cluster

Reply via email to