Re: [OMPI users] Heterogeneous cluster problem - mixing AMD and Intel nodes

Brice Goglin Sun, 2 Mar 2014 02:19:47 -0500 (EST)

What's your mpirun or mpiexec command-line?
The error "BTLs attempted: self sm tcp" says that it didn't even try the
MX BTL (for Open-MX). Did you use the MX MTL instead?
Are you sure that you actually use Open-MX when not mixing AMD and Intel
nodes?


Brice



Le 02/03/2014 08:06, Victor a écrit :
> I got 4 x AMD A-10 6800K nodes on loan for a few months and added them
> to my existing Intel nodes.
>
> All nodes share the relevant directories via NFS. I have OpenMPI 1.6.5
> which was build with Open-MX 1.5.3 support networked via GbE.
>
> All nodes run Ubuntu 12.04.
>
> Problem:
>
> I can run a job EITHER on 4 x AMD nodes OR on 2 x Intel nodes, but I
> cannot run a job on any combination of an AMD and Intel node, ie. 1 x
> AMD node + 1 x Intel node = error below.
>
> The error that I get during job setup is:
>
>
>     --------------------------------------------------------------------------
>     At least one pair of MPI processes are unable to reach each other for
>     MPI communications.  This means that no Open MPI device has indicated
>     that it can be used to communicate between these processes.  This is
>     an error; Open MPI requires that all MPI processes be able to reach
>     each other.  This error can sometimes be the result of forgetting to
>     specify the "self" BTL.
>       Process 1 ([[2229,1],1]) is on host: AMD-Node-1
>       Process 2 ([[2229,1],8]) is on host: Intel-Node-1
>       BTLs attempted: self sm tcp
>     Your MPI job is now going to abort; sorry.
>     --------------------------------------------------------------------------
>     --------------------------------------------------------------------------
>     MPI_INIT has failed because at least one MPI process is unreachable
>     from another.  This *usually* means that an underlying communication
>     plugin -- such as a BTL or an MTL -- has either not loaded or not
>     allowed itself to be used.  Your MPI job will now abort.
>     You may wish to try to narrow down the problem;
>      * Check the output of ompi_info to see which BTL/MTL plugins are
>        available.
>      * Run your application with MPI_THREAD_SINGLE.
>      * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
>        if using MTL-based communications) to see exactly which
>        communication plugins were considered and/or discarded.
>     --------------------------------------------------------------------------
>     [AMD-Node-1:3932] *** An error occurred in MPI_Init
>     [AMD-Node-1:3932] *** on a NULL communicator
>     [AMD-Node-1:3932] *** Unknown error
>     [AMD-Node-1:3932] *** MPI_ERRORS_ARE_FATAL: your MPI job will now
>     abort
>     --------------------------------------------------------------------------
>     An MPI process is aborting at a time when it cannot guarantee that all
>     of its peer processes in the job will be killed properly.  You should
>     double check that everything has shut down cleanly.
>       Reason:     Before MPI_INIT completed
>       Local host: AMD-Node-1
>       PID:        3932
>     --------------------------------------------------------------------------
>
>
>
> What I would like to know is, is it actually difficult (impossible) to
> mix AMD and Intel machines in the same cluster and have them run the
> same job, or am I missing something obvious, or not so obvious when it
> comes to the communication stack on the Intel nodes for example. 
>
> I set up the AMD nodes just yesterday, but I used the same OpenMPI and
> Open-MX versions, however I may have inadvertently done something
> different, so I am thinking (hoping) that it is possible to run such a
> heterogeneous cluster, and that all I need to do is ensure that all
> OpenMPI modules are correctly installed on all nodes.
>
> I need the extra 32 Gb RAM and the AMD nodes bring as I need to
> validate our CFD application, and our additional Intel nodes are still
> not here (ETA 2 weeks).
>
> Thank you,
>
> Victor
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Heterogeneous cluster problem - mixing AMD and Intel nodes

Reply via email to