What's your mpirun or mpiexec command-line? The error "BTLs attempted: self sm tcp" says that it didn't even try the MX BTL (for Open-MX). Did you use the MX MTL instead? Are you sure that you actually use Open-MX when not mixing AMD and Intel nodes?
Brice Le 02/03/2014 08:06, Victor a écrit : > I got 4 x AMD A-10 6800K nodes on loan for a few months and added them > to my existing Intel nodes. > > All nodes share the relevant directories via NFS. I have OpenMPI 1.6.5 > which was build with Open-MX 1.5.3 support networked via GbE. > > All nodes run Ubuntu 12.04. > > Problem: > > I can run a job EITHER on 4 x AMD nodes OR on 2 x Intel nodes, but I > cannot run a job on any combination of an AMD and Intel node, ie. 1 x > AMD node + 1 x Intel node = error below. > > The error that I get during job setup is: > > > -------------------------------------------------------------------------- > At least one pair of MPI processes are unable to reach each other for > MPI communications. This means that no Open MPI device has indicated > that it can be used to communicate between these processes. This is > an error; Open MPI requires that all MPI processes be able to reach > each other. This error can sometimes be the result of forgetting to > specify the "self" BTL. > Process 1 ([[2229,1],1]) is on host: AMD-Node-1 > Process 2 ([[2229,1],8]) is on host: Intel-Node-1 > BTLs attempted: self sm tcp > Your MPI job is now going to abort; sorry. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > MPI_INIT has failed because at least one MPI process is unreachable > from another. This *usually* means that an underlying communication > plugin -- such as a BTL or an MTL -- has either not loaded or not > allowed itself to be used. Your MPI job will now abort. > You may wish to try to narrow down the problem; > * Check the output of ompi_info to see which BTL/MTL plugins are > available. > * Run your application with MPI_THREAD_SINGLE. > * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose, > if using MTL-based communications) to see exactly which > communication plugins were considered and/or discarded. > -------------------------------------------------------------------------- > [AMD-Node-1:3932] *** An error occurred in MPI_Init > [AMD-Node-1:3932] *** on a NULL communicator > [AMD-Node-1:3932] *** Unknown error > [AMD-Node-1:3932] *** MPI_ERRORS_ARE_FATAL: your MPI job will now > abort > -------------------------------------------------------------------------- > An MPI process is aborting at a time when it cannot guarantee that all > of its peer processes in the job will be killed properly. You should > double check that everything has shut down cleanly. > Reason: Before MPI_INIT completed > Local host: AMD-Node-1 > PID: 3932 > -------------------------------------------------------------------------- > > > > What I would like to know is, is it actually difficult (impossible) to > mix AMD and Intel machines in the same cluster and have them run the > same job, or am I missing something obvious, or not so obvious when it > comes to the communication stack on the Intel nodes for example. > > I set up the AMD nodes just yesterday, but I used the same OpenMPI and > Open-MX versions, however I may have inadvertently done something > different, so I am thinking (hoping) that it is possible to run such a > heterogeneous cluster, and that all I need to do is ensure that all > OpenMPI modules are correctly installed on all nodes. > > I need the extra 32 Gb RAM and the AMD nodes bring as I need to > validate our CFD application, and our additional Intel nodes are still > not here (ETA 2 weeks). > > Thank you, > > Victor > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users