[OMPI users] unknown BTL transport in openmpi 1.5.4 and 1.6.2
We have recently encountered a problem with using openmpi 1.5.3, 1.5.4, and 1.6.2 over compute nodes with two different generations of Infiniband (DDR and QDR). This error is very similar to one posted to the list in 2011: http://www.open-mpi.org/community/lists/users/2011/06/16773.php This issue was never resolved on the mailing list. Here is the error: # iwtf-k43-28$ which mpirun /usr/local/packages/openmpi/1.5.4/gcc-4.4.5/bin/mpirun iwtf-k43-28$cat machinefile iwtf-k43-28 iwm-k43-30 iwtf-k43-28$ mpirun -np 2 -hostfile machinefile ./a.out 0 -- Open MPI detected two different OpenFabrics transport types in the same Infiniband network. Such mixed network trasport configuration is not supported by Open MPI. Local host:iwm-k43-30.pace.gatech.edu Local adapter: mthca0 (vendor 0x2c9, part ID 25204) Local transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN Remote host: iwtf-k43-28 Remote Adapter:(vendor 0x2c9, part ID 26428) Remote transport type: MCA_BTL_OPENIB_TRANSPORT_IB -- Hello from iwtf-k43-28.pace.gatech.edu: 0 of 2 Hello from iwm-k43-30.pace.gatech.edu: 1 of 2 [iwtf-k43-28.pace.gatech.edu:12695] 1 more process has sent help message help-mpi-btl-openib.txt / conflicting transport types [iwtf-k43-28.pace.gatech.edu:12695] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages -- iwtf-k43-28$ mpirun -np 2 -hostfile machinefile --mca btl openib,self ./a.out 0 -- Open MPI detected two different OpenFabrics transport types in the same Infiniband network. Such mixed network trasport configuration is not supported by Open MPI. Local host:iwm-k43-30.pace.gatech.edu Local adapter: mthca0 (vendor 0x2c9, part ID 25204) Local transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN Remote host: iwtf-k43-28 Remote Adapter:(vendor 0x2c9, part ID 26428) Remote transport type: MCA_BTL_OPENIB_TRANSPORT_IB -- -- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[34066,1],1]) is on host: iwm-k43-30.pace.gatech.edu Process 2 ([[34066,1],0]) is on host: iwtf-k43-28 BTLs attempted: self openib Your MPI job is now going to abort; sorry. -- -- MPI_INIT has failed because at least one MPI process is unreachable from another. This *usually* means that an underlying communication plugin -- such as a BTL or an MTL -- has either not loaded or not allowed itself to be used. Your MPI job will now abort. You may wish to try to narrow down the problem; * Check the output of ompi_info to see which BTL/MTL plugins are available. * Run your application with MPI_THREAD_SINGLE. * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose, if using MTL-based communications) to see exactly which communication plugins were considered and/or discarded. -- [iwm-k43-30.pace.gatech.edu:9131] *** An error occurred in MPI_Init [iwm-k43-30.pace.gatech.edu:9131] *** on a NULL communicator [iwm-k43-30.pace.gatech.edu:9131] *** Unknown error [iwm-k43-30.pace.gatech.edu:9131] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort -- An MPI process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Reason: Before MPI_INIT completed Local host: iwm-k43-30.pace.gatech.edu PID:9131 -- -- mpirun has exited due to process rank 1 with PID 9131 on node iwm-k43-30 exiting improperly. There are two reasons this could occur: 1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init",
Re: [OMPI users] [EXTERNAL] Re: problem building 32-bitopenmpi-1.9a1r27979 with Sun C
Hi thank you very much for your patch atomic.diff. I applied it and get now the following error. sunpc1 openmpi-1.9-SunOS.x86_64.32_cc 21 grep Error log.make.SunOS.x86_64.32_cc Creating mpi/man/man3/MPI_Error_class.3 man page... Creating mpi/man/man3/MPI_Error_string.3 man page... "../../../../openmpi-1.9a1r27979/opal/include/opal/sys/atomic_impl.h", line 106: Error: opal_atomic_add_64(volatile long long*, long long) was previously declared "extern", not "inline". "../../../../openmpi-1.9a1r27979/opal/include/opal/sys/atomic_impl.h", line 121: Error: opal_atomic_sub_64(volatile long long*, long long) was previously declared "extern", not "inline". 2 Error(s) detected. make[2]: *** [mpicxx.lo] Error 1 make[1]: *** [all-recursive] Error 1 make: *** [all-recursive] Error 1 sunpc1 openmpi-1.9-SunOS.x86_64.32_cc 22 linpc1 openmpi-1.9-Linux.x86_64.32_cc 116 grep Error log.make.Linux.x86_64.32_cc Creating mpi/man/man3/MPI_Error_class.3 man page... Creating mpi/man/man3/MPI_Error_string.3 man page... "../../../../openmpi-1.9a1r27979/opal/include/opal/sys/atomic_impl.h", line 106: Error: opal_atomic_add_64(volatile long long*, long long) was previously declared "extern", not "inline". "../../../../openmpi-1.9a1r27979/opal/include/opal/sys/atomic_impl.h", line 121: Error: opal_atomic_sub_64(volatile long long*, long long) was previously declared "extern", not "inline". 2 Error(s) detected. make[2]: *** [mpicxx.lo] Error 1 make[1]: *** [all-recursive] Error 1 make: *** [all-recursive] Error 1 linpc1 openmpi-1.9-Linux.x86_64.32_cc 116 Kind regards Siegmar
Re: [OMPI users] I have still a problem with rankfiles in openmpi-1.6.4rc3
On Jan 31, 2013, at 12:39 PM, Siegmar Gross wrote: > Hi > >> Hmmmwell, it certainly works for me: >> >> [rhc@odin ~/v1.6]$ cat rf >> rank 0=odin093 slot=0:0-1,1:0-1 >> rank 1=odin094 slot=0:0-1 >> rank 2=odin094 slot=1:0 >> rank 3=odin094 slot=1:1 >> >> >> [rhc@odin ~/v1.6]$ mpirun -n 4 -rf ./rf --report-bindings >> -mca opal_paffinity_alone 0 hostname >> [odin093.cs.indiana.edu:04617] MCW rank 0 bound to >> socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1) >> odin093.cs.indiana.edu >> odin094.cs.indiana.edu >> [odin094.cs.indiana.edu:04426] MCW rank 1 bound to >> socket 0[core 0-1]: [B B][. .] (slot list 0:0-1) >> odin094.cs.indiana.edu >> [odin094.cs.indiana.edu:04426] MCW rank 2 bound to >> socket 1[core 0]: [. .][B .] (slot list 1:0) >> [odin094.cs.indiana.edu:04426] MCW rank 3 bound to >> socket 1[core 1]: [. .][. B] (slot list 1:1) >> odin094.cs.indiana.edu > > Interesting that it works on your machines. > > >> I see one thing of concern to me in your output - your second node >> appears to be a Sun computer. Is it the same physical architecture? >> Is it also running Linux? Are you sure it is using the same version >> of OMPI, built for that environment and hardware? > > Both machines (in fact all four machines: sunpc0, sunpc1, linpc0, and > linpc1) use the same hardware. "linpc" uses openSUSE 12.1 and "sunpc" > Solaris 10 x86_64. All machines use the same version of Open MPI, > built for that environment. At the moment I can only use sunpc1 and > linpc1 ("my" developer machines). Next week I will have access to all > machines so that I can test, if I get a different behaviour when I > use two machines with the same operating system (although mixed > operating systems weren't a problem in the past (only machines with > differnt endians)). I let you know my results. I suspect the problem is Solaris being on the remote machine. I don't know how far our Solaris support may have rotted by now. > > > Kind regards > > Siegmar > > > > >> On Jan 30, 2013, at 2:08 AM, Siegmar Gross > wrote: >> >>> Hi >>> >>> I applied your patch "rmaps.diff" to openmpi-1.6.4rc3r27923 and >>> it works for my previous rankfile. >>> >>> #3493: Handle the case where rankfile provides the allocation ---+ Reporter: rhc | Owner: jsquyres Type: changeset move request | Status: new Priority: critical| Milestone: Open MPI 1.6.4 Version: trunk | Keywords: ---+ Please apply the attached patch that corrects the rmaps function for obtaining the available nodes when rankfile is providing the allocation. >>> >>> >>> tyr rankfiles 129 more rf_linpc1 >>> # mpiexec -report-bindings -rf rf_linpc1 hostname >>> rank 0=linpc1 slot=0:0-1,1:0-1 >>> >>> tyr rankfiles 130 mpiexec -report-bindings -rf rf_linpc1 hostname >>> [linpc1:31603] MCW rank 0 bound to socket 0[core 0-1] >>> socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1) >>> >>> >>> >>> Unfortunately I don't get the expected result for the following >>> rankfile. >>> >>> tyr rankfiles 114 more rf_bsp >>> # mpiexec -report-bindings -rf rf_bsp hostname >>> rank 0=linpc1 slot=0:0-1,1:0-1 >>> rank 1=sunpc1 slot=0:0-1 >>> rank 2=sunpc1 slot=1:0 >>> rank 3=sunpc1 slot=1:1 >>> >>> I would expect that rank 0 gets all four cores from linpc1, rank 1 >>> both cores of socket 0 from sunpc1, rank 2 core 0 of socket 1, and >>> rank 3 core 1 of socket 1 from sunpc1. Everything is fine for my >>> processes with rank 0 and 1, but it's wrong for ranks 2 and 3, >>> because they both get all four cores of sunpc1. Is something wrong >>> with my rankfile or with your mapping of processes to cores? I have >>> removed the output from "hostname" and wrapped long lines. >>> >>> tyr rankfiles 115 mpiexec -report-bindings -rf rf_bsp hostname >>> [linpc1:31092] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]: >>> [B B][B B] (slot list 0:0-1,1:0-1) >>> [sunpc1:12916] MCW rank 1 bound to socket 0[core 0-1]: >>> [B B][. .] (slot list 0:0-1) >>> [sunpc1:12916] MCW rank 2 bound to socket 0[core 0-1] socket 1[core 0-1]: >>> [B B][B B] (slot list 1:0) >>> [sunpc1:12916] MCW rank 3 bound to socket 0[core 0-1] socket 1[core 0-1]: >>> [B B][B B] (slot list 1:1) >>> >>> >>> I get the following output, if I add the options which you mentioned >>> in a previous email. >>> >>> tyr rankfiles 124 mpiexec -report-bindings -rf rf_bsp \ >>> -display-allocation -mca ras_base_verbose 5 hostname >>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras) >>> Querying component [cm] >>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras) >>> Skipping component [cm]. Query failed to return a module >>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras) >>> No component selected! >>> [tyr.informatik.hs-ful
Re: [OMPI users] [EXTERNAL] Re: problem building 32-bitopenmpi-1.9a1r27979 with Sun C
Hi > be provided, so it all falls down. Siegmar, can you send me your > opal/include/opal_config.h file when running with the Studio compilers? I > don't have then available on x86 and it's probably faster for you to send > me the files than for me to try to setup a Linux box with the compilers > installed. I have attached both header files. Thank you very much for your help in advance. Kind regards Siegmar /* opal/include/opal_config.h. Generated from opal_config.h.in by configure. */ /* opal/include/opal_config.h.in. Generated from configure.ac by autoheader. */ /* -*- c -*- * * Copyright (c) 2004-2005 The Trustees of Indiana University. * All rights reserved. * Copyright (c) 2004-2005 The Trustees of the University of Tennessee. * All rights reserved. * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, * University of Stuttgart. All rights reserved. * Copyright (c) 2004-2005 The Regents of the University of California. * All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow * * $HEADER$ * * Function: - OS, CPU and compiler dependent configuration */ #ifndef OPAL_CONFIG_H #define OPAL_CONFIG_H #include "opal_config_top.h" /* Define if building universal (internal helper macro) */ /* #undef AC_APPLE_UNIVERSAL_BUILD */ /* enable openib BTL failover */ #define BTL_OPENIB_FAILOVER_ENABLED 0 /* Whether the openib BTL malloc hooks are enabled */ #define BTL_OPENIB_MALLOC_HOOKS_ENABLED 1 /* BLCR cr_request_file check */ /* #undef CRS_BLCR_HAVE_CR_REQUEST */ /* BLCR cr_request_checkpoint check */ /* #undef CRS_BLCR_HAVE_CR_REQUEST_CHECKPOINT */ /* BLCRs cr_checkpoint_info.requester member availability */ /* #undef CRS_BLCR_HAVE_INFO_REQUESTER */ /* Define to 1 if you have the header file. */ #define HAVE_AIO_H 1 /* Define to 1 if you have the header file. */ #define HAVE_ALLOCA_H 1 /* Define to 1 if you have the header file. */ /* #undef HAVE_ALPS_APINFO_H */ /* Define to 1 if you have the header file. */ #define HAVE_ARPA_INET_H 1 /* Define to 1 if you have the `asprintf' function. */ #define HAVE_ASPRINTF 1 /* Define to 1 if you have the `backtrace' function. */ #define HAVE_BACKTRACE 1 /* Define to 1 if the system has the type `CACHE_DESCRIPTOR'. */ /* #undef HAVE_CACHE_DESCRIPTOR */ /* Define to 1 if the system has the type `CACHE_RELATIONSHIP'. */ /* #undef HAVE_CACHE_RELATIONSHIP */ /* Define to 1 if you have the header file. */ /* #undef HAVE_CATAMOUNT_CNOS_MPI_OS_H */ /* Define to 1 if you have the header file. */ /* #undef HAVE_CATAMOUNT_DCLOCK_H */ /* Define to 1 if you have the `ceil' function. */ #define HAVE_CEIL 1 /* Define to 1 if you have the `clz' function. */ /* #undef HAVE_CLZ */ /* Define to 1 if you have the `clzl' function. */ /* #undef HAVE_CLZL */ /* Define to 1 if you have the header file. */ /* #undef HAVE_CNOS_MPI_OS_H */ /* Define to 1 if you have the header file. */ #define HAVE_COMPLEX_H 1 /* Define to 1 if you have the `cpuset_setaffinity' function. */ /* #undef HAVE_CPUSET_SETAFFINITY */ /* Define to 1 if you have the header file. */ /* #undef HAVE_CRT_EXTERNS_H */ /* Define to 1 if you have the `cr_request_checkpoint' function. */ /* #undef HAVE_CR_REQUEST_CHECKPOINT */ /* Define to 1 if you have the `cr_request_file' function. */ /* #undef HAVE_CR_REQUEST_FILE */ /* uDAPL DAT_MEM_TYPE_SO_VIRTUAL check */ /* #undef HAVE_DAT_MEM_TYPE_SO_VIRTUAL */ /* Define to 1 if you have the `dbm_open' function. */ /* #undef HAVE_DBM_OPEN */ /* Define to 1 if you have the `dbopen' function. */ /* #undef HAVE_DBOPEN */ /* Define to 1 if you have the header file. */ #define HAVE_DB_H 1 /* Define to 1 if you have the declaration of `AF_INET6', and to 0 if you don't. */ #define HAVE_DECL_AF_INET6 1 /* Define to 1 if you have the declaration of `AF_UNSPEC', and to 0 if you don't. */ #define HAVE_DECL_AF_UNSPEC 1 /* Define to 1 if you have the declaration of `CTL_HW', and to 0 if you don't. */ #define HAVE_DECL_CTL_HW 0 /* Define to 1 if you have the declaration of `fabsf', and to 0 if you don't. */ #define HAVE_DECL_FABSF 1 /* Define to 1 if you have the declaration of `HW_NCPU', and to 0 if you don't. */ #define HAVE_DECL_HW_NCPU 0 /* Define to 1 if you have the declaration of `HZ', and to 0 if you don't. */ #define HAVE_DECL_HZ 1 /* Define to 1 if you have the declaration of `IBV_ACCESS_SO', and to 0 if you don't. */ /* #undef HAVE_DECL_IBV_ACCESS_SO */ /* Define to 1 if you have the declaration of `IBV_EVENT_CLIENT_REREGISTER', and to 0 if you don't. */ /* #undef HAVE_DECL_IBV_EVENT_CLIENT_REREGISTER */ /* Define to 1 if you have the declaration of `IBV_LINK_LAYER_ETHERNET', and to 0 if you don't. */ /* #undef HAVE_DECL_IBV_LINK_LAYER_ETHERNET */ /* Define to 1 if you have the declaration of `IBV_M_WR_CALC_RDMA_WRITE_WITH_IMM', and to 0 if you don't. */ /* #