[OMPI users] OpenMPI 1.4.2 with Myrinet MX, mpirun seg faults
We are doing a test build of a new cluster. We are re-using our Myrinet 10G gear from a previous cluster. I have built OpenMPI 1.4.2 with PGI 10.4. We use this regularly on our Infiniband based cluster and all the install elements were readily available. With a few go-arounds with the Myrinet MX stack, we are now running MX -1.2.12 with allowances for more than the max of 16 endpoints. Each node has 24 cores. The cluster is running rocks 5.3. As part of the initial build, I installed the Myrinet_MX Rocks Roll from Myricom. With the default limitation of 16 endpoints, we could not run on all nodes. As mentioned above, the MX stack was replaced. Myrinet provided a build of OpenMPI 1.4.1.That build works. It is only compiled with gcc and gfortran and we wanted it built with the compilers we normally use, e.g. PGI, Pathscale and Intel. We can compile with the OpenMPI 1.4.2 / PGI 10.4 build. However, we cannot launch jobs with mpirun. It seg faults. -- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -- [enet1-head2-eth1:29532] *** Process received signal *** [enet1-head2-eth1:29532] Signal: Segmentation fault (11) [enet1-head2-eth1:29532] Signal code: Address not mapped (1) [enet1-head2-eth1:29532] Failing at address: 0x6c [enet1-head2-eth1:29532] *** End of error message *** Segmentation fault However, if we launch the job with the Myricom supplied mpirun in the OpenMPI tree, the job runs successfully. This works even with a test program compiled with the OpenMPI 1.4.2 with PGI 10.4 build.
Re: [OMPI users] OpenMPI 1.4.2 with Myrinet MX, mpirun seg faults
On 10/20/2010 7:59 PM, Ralph Castain wrote: The error message seems to imply that mpirun itself didn't segfault, but that something else did. Is that segfault pid from mpirun? This kind of problem usually is caused by mismatched builds - i.e., you compile against your new build, but you pick up the Myrinet build when you try to run because of path and ld_library_path issues. You might check to ensure you are running against what you built with. The PATH and LD_LIBRARY_PATH are set explicitly (through modules) on the frontend and each node. The PGI compiler and the OpenMPI I am trying to run is set for each. ldd /share/apps/opt/OpenMPI/1.4.2/PGI/10.4/bin/mpirun libopen-rte.so.0 => /share/apps/opt/OpenMPI/1.4.2/PGI/10.4/lib/libopen-rte.so.0 (0x2b6a16552000) libopen-pal.so.0 => /share/apps/opt/OpenMPI/1.4.2/PGI/10.4/lib/libopen-pal.so.0 (0x2b6a167aa000) libdl.so.2 => /lib64/libdl.so.2 (0x003a7dc0) libnsl.so.1 => /lib64/libnsl.so.1 (0x003a8040) libutil.so.1 => /lib64/libutil.so.1 (0x003a88a0) libpthread.so.0 => /lib64/libpthread.so.0 (0x003a7e00) libm.so.6 => /lib64/libm.so.6 (0x003a7d80) libc.so.6 => /lib64/libc.so.6 (0x003a7d40) libpgc.so => /share/apps/opt/PGI/10.4/linux86-64/10.4/libso/libpgc.so (0x2b6a16a28000) /lib64/ld-linux-x86-64.so.2 (0x003a7d00) The one that works from the other tree ldd /opt/openmpi-myrinet_mx/bin/mpirun libopen-rte.so.0 => /opt/openmpi-myrinet_mx/lib/libopen-rte.so.0 (0x2b51c71b) libopen-pal.so.0 => /opt/openmpi-myrinet_mx/lib/libopen-pal.so.0 (0x2b51c743) libdl.so.2 => /lib64/libdl.so.2 (0x003a7dc0) libnsl.so.1 => /lib64/libnsl.so.1 (0x003a8040) libutil.so.1 => /lib64/libutil.so.1 (0x003a88a0) libm.so.6 => /lib64/libm.so.6 (0x003a7d80) libpthread.so.0 => /lib64/libpthread.so.0 (0x003a7e00) libc.so.6 => /lib64/libc.so.6 (0x003a7d40) /lib64/ld-linux-x86-64.so.2 (0x003a7d00)
Re: [OMPI users] OpenMPI 1.4.2 with Myrinet MX, mpirun seg faults
On 10/20/2010 8:30 PM, Scott Atchley wrote: On Oct 20, 2010, at 9:22 PM, Raymond Muno wrote: On 10/20/2010 7:59 PM, Ralph Castain wrote: The error message seems to imply that mpirun itself didn't segfault, but that something else did. Is that segfault pid from mpirun? This kind of problem usually is caused by mismatched builds - i.e., you compile against your new build, but you pick up the Myrinet build when you try to run because of path and ld_library_path issues. You might check to ensure you are running against what you built with. The PATH and LD_LIBRARY_PATH are set explicitly (through modules) on the frontend and each node. The PGI compiler and the OpenMPI I am trying to run is set for each. Are you building OMPI with support for both MX and IB? If not and you only want MX support, try configuring OMPI using --disable-memory-manager (check configure for the exact option). We have fixed this bug in the most recent 1.4.x and 1.5.x releases. Scott I just downloaded 1.4.3 and compiled it with PGI 10.4. I get the same result. I did confirm that the process ID shown is that of mpirun. This cluster only has Myrinet. The install is separate from the IB cluster and a fresh build. I will try the configure option.
Re: [OMPI users] OpenMPI 1.4.2 with Myrinet MX, mpirun seg faults
On 10/20/2010 8:30 PM, Scott Atchley wrote Are you building OMPI with support for both MX and IB? If not and you only want MX support, try configuring OMPI using --disable-memory-manager (check configure for the exact option). We have fixed this bug in the most recent 1.4.x and 1.5.x releases. Scott Hmmm, not sure which configure option you want me to try. $ ./configure --help | grep memory --enable-mem-debug enable memory debugging (debugging only) (default: --enable-mem-profileenable memory profiling (debugging only) (default: --enable-memchecker Enable memory and buffer checks. Note that disabling --with-memory-manager=TYPE Use TYPE for intercepting memory management calls to control memory pinning. ]$ ./configure --help | grep disable --cache-file=FILE cache test results in FILE [disabled] --disable-option-checking ignore unrecognized --enable/--with options --disable-FEATURE do not include FEATURE (same as --enable-FEATURE=no) disabled) disabled) building Open MPI (default: disabled) general MPI users!) (default: disabled) --disable-debug-symbols Disable adding compiler flags to enable debugging --enable-peruse Support PERUSE interface (default: disabled) --enable-pty-supportEnable/disable PTY support for STDIO forwarding. dlopen implies --disable-mca-dso. (default: enabled) support (default: disabled) MPI applications (default: disabled) This option ignores the --disable-binaries option --disable-ipv6 Disable IPv6 support (default: enabled, but only if --disable-dependency-tracking speeds up one-time build disabled) --enable-smp-locks enable smp locks in atomic ops. Do not disable if disabled) --disable-ft-thread Disable fault tolerance thread running inside all disable building all maffinity components and the as static disables it building as a DSO. --enable-mca-no-build list (default: disabled) --disable-executables Using --disable-executables disables building and --disable-included-mode, meaning that the PLPA is in InfiniBand ConnectX adapters, you may disable the (default: disabled) --disable-mpi-ioDisable built-in support for MPI-2 I/O, likely --disable-io-romio Disable the ROMIO MPI-IO component "--enable-contrib-no-build=libtrace,vt" will disable --disable-libtool-lock avoid locking (might break parallel builds) (default: disabled). (default: disabled)
Re: [OMPI users] OpenMPI 1.4.2 with Myrinet MX, mpirun seg faults
On 10/20/2010 8:30 PM, Scott Atchley wrote: We have fixed this bug in the most recent 1.4.x and 1.5.x releases. Scott OK, a few more tests. I was using PGI 10.4 as the compiler. I have now tried OpenMPI 1.4.3 with PGI 10.8 and Intel 11.1. I get the same results in each case, mpirun seg faults. (I really did not expect that to change anything). I tried OpenMPI 1.5. Under PGI, I could not get it to compile. With Intel 11.1, it compiles. When I try to run a simple test, mpirun just seems to hang and I never see anything start on the nodes. I would rather stick with 1.4.x for now since that is what we are running on our other production cluster. I will leave this for a later day. I grabbed the 1.4.3 version from this page. http://www.open-mpi.org/software/ompi/v1.4/ When you say this bug is fixed in recent 1.4.x releases, should I try one from here? http://www.open-mpi.org/nightly/v1.4/ For grins, I compiled the OpenMPI 1.4.1 tree. This what Myricom supplied with the MX roll. Same result. I can still run with their compiled version of mpirun, even when I compile with the other build trees and compilers. I just do not know what options they compiled with. Any insight would be appreciated. -Ray Muno University of Minnesota
[OMPI users] Problem building OpenMPi with SunStudio compilers
We are implementing a new cluster that is InfiniBand based. I am working on getting OpenMPI built for our various compile environments. So far it is working for PGI 7.2 and PathScale 3.1. I found some workarounds for issues with the Pathscale compilers (seg faults) in the OpenMPI FAQ. When I try to build with SunStudio, I cannot even get past the configure stage. It dies in th estage that checks for C++. *** C++ compiler and preprocessor checking whether we are using the GNU C++ compiler... no checking whether CC accepts -g... yes checking dependency style of CC... none checking how to run the C++ preprocessor... CC -E checking for the C++ compiler vendor... sun checking if C++ compiler works... no ** * It appears that your C++ compiler is unable to produce working * executables. A simple test application failed to properly * execute. Note that this is likely not a problem with Open MPI, * but a problem with the local compiler installation. More * information (including exactly what command was given to the * compiler and what error resulted when the command was executed) is * available in the config.log file in this directory. ** configure: error: Could not run a simple C++ program. Aborting. The section in config.log looks to be configure:21722: CC -c -DNDEBUG conftest.cpp >&5 configure:21728: $? = 0 configure:21907: result: sun configure:21929: checking if C++ compiler works configure:22006: CC -o conftest -DNDEBUGconftest.cpp >&5 /usr/lib64/libm.so: file not recognized: File format not recognized configure:22009: $? = 1 configure: program exited with status 1 configure: failed program was: = The attempt to configure was done with. ./configure CC=cc CXX=CC F77=f77 FC=f90 --prefix=path_to_install All the SunStudio binaries are at the front of the path. I find this entry in the FAQthe SunStudio compilers http://www.open-mpi.org/faq/?category=building#build-sun-compilers and followed that as well, with no success. It still dies at the configure step. The SunStudio version is 12. The target (and compilation) platform is AMD Opteron, Barcelona. We have been using the SunStudio compilers on this cluster on a routine basis and have not had issues.
Re: [OMPI users] Problem building OpenMPi with SunStudio compilers
Raymond Muno wrote: We are implementing a new cluster that is InfiniBand based. I am working on getting OpenMPI built for our various compile environments. So far it is working for PGI 7.2 and PathScale 3.1. I found some workarounds for issues with the Pathscale compilers (seg faults) in the OpenMPI FAQ. When I try to build with SunStudio, I cannot even get past the configure stage. It dies in th estage that checks for C++. ... It looks like the problem is with SunStudio itself. Even a simple CC program fails to compile. /usr/lib64/libm.so: file not recognized: File format not recognized
Re: [OMPI users] Problem building OpenMPi with SunStudio compilers
Raymond Muno wrote: Raymond Muno wrote: We are implementing a new cluster that is InfiniBand based. I am working on getting OpenMPI built for our various compile environments. So far it is working for PGI 7.2 and PathScale 3.1. I found some workarounds for issues with the Pathscale compilers (seg faults) in the OpenMPI FAQ. When I try to build with SunStudio, I cannot even get past the configure stage. It dies in th estage that checks for C++. It looks like the problem is with SunStudio itself. Even a simple CC program fails to compile. /usr/lib64/libm.so: file not recognized: File format not recognized OK, I took care of the linker issue fro C++ as recommended on Suns support site (replace Sun supplied ld with /usr/bin/ld) Now I get farther along but the build fails at (small excerpt) mutex.c:(.text+0x30): multiple definition of `opal_atomic_cmpset_32' asm/.libs/libasm.a(asm.o):asm.c:(.text+0x30): first defined here threads/.libs/mutex.o: In function `opal_atomic_cmpset_64': mutex.c:(.text+0x50): multiple definition of `opal_atomic_cmpset_64' asm/.libs/libasm.a(asm.o):asm.c:(.text+0x50): first defined here make[2]: *** [libopen-pal.la] Error 1 make[2]: Leaving directory `/home/muno/OpenMPI/SunStudio/openmpi-1.2.7/opal' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/home/muno/OpenMPI/SunStudio/openmpi-1.2.7/opal' make: *** [all-recursive] Error 1 I based the configure on what was found in the FAQ here. http://www.open-mpi.org/faq/?category=building#build-sun-compilers Perhaps this is much more specific to our platform/OS. The environment is AMD Opteron, Barcelona running Centos 5 (Rocks 5.03) with SunStudio 12 compilers. Does anyone have any insight as to how to successfully build OpenMPI for this OS/compiler selection? As I said in the first post, we have it built for Pathscale 3.1 and PGI 7.2. -Ray Muno University of Minnesota, Aerospace Engineering
[OMPI users] Building OpenMPI with Lustre support using PGI fails
I am trying to build OpenMPI with Lustre support using PGI 18.7 on CentOS 7.5 (1804). It builds successfully with Intel compilers, but fails to find the necessary Lustre components with the PGI compiler. I have tried building OpenMPI 4.0.0, 3.1.3 and 2.1.5. I can build OpenMPI, but configure does not find the proper Lustre files. Lustre is installed from current client RPMS, version 2.10.5 Include files are in /usr/include/lustre When specifying --with-lustre, I get: --- MCA component fs:lustre (m4 configuration macro) checking for MCA component fs:lustre compile mode... dso checking --with-lustre value... simple ok (unspecified value) looking for header without includes checking lustre/lustreapi.h usability... yes checking lustre/lustreapi.h presence... yes checking for lustre/lustreapi.h... yes checking for library containing llapi_file_create... -llustreapi checking if liblustreapi requires libnl v1 or v3... checking for required lustre data structures... no configure: error: Lustre support requested but not found. Aborting -- Ray Muno IT Manager University of Minnesota Aerospace Engineering and Mechanics Mechanical Engineering 110 Union St. S.E. 111 Church Street SE Minneapolis, MN 55455 Minneapolis, MN 55455 ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] Building OpenMPI with Lustre support using PGI fails
I apologize. I did not realize that I did not reply to the list. Going with the view this is a PGI problem, I noticed they recently released version 18.10. I had just installed 18.7 within the last couple weeks. The problem is resolved in 18.10. -Ray Muno On 11/27/18 7:55 PM, Gilles Gouaillardet wrote: Folks, sorry for the late follow-up. The config.log was indeed sent offline. Here is the relevant part : configure:294375: checking for required lustre data structures configure:294394: pgcc -O -DNDEBUG -Iyes/include -c conftest.c PGC-S-0040-Illegal use of symbol, u_int64_t (/usr/include/sys/quota.h: 157) PGC-W-0156-Type not specified, 'int' assumed (/usr/include/sys/quota.h: 157) PGC-S-0040-Illegal use of symbol, u_int64_t (/usr/include/sys/quota.h: 158) PGC-W-0156-Type not specified, 'int' assumed (/usr/include/sys/quota.h: 158) PGC-S-0040-Illegal use of symbol, u_int64_t (/usr/include/sys/quota.h: 159) PGC-W-0156-Type not specified, 'int' assumed (/usr/include/sys/quota.h: 159) PGC-S-0040-Illegal use of symbol, u_int64_t (/usr/include/sys/quota.h: 160) PGC-W-0156-Type not specified, 'int' assumed (/usr/include/sys/quota.h: 160) PGC-S-0040-Illegal use of symbol, u_int64_t (/usr/include/sys/quota.h: 161) PGC-W-0156-Type not specified, 'int' assumed (/usr/include/sys/quota.h: 161) PGC-S-0040-Illegal use of symbol, u_int64_t (/usr/include/sys/quota.h: 162) PGC-W-0156-Type not specified, 'int' assumed (/usr/include/sys/quota.h: 162) PGC-S-0040-Illegal use of symbol, u_int64_t (/usr/include/sys/quota.h: 163) PGC-W-0156-Type not specified, 'int' assumed (/usr/include/sys/quota.h: 163) PGC-S-0040-Illegal use of symbol, u_int64_t (/usr/include/sys/quota.h: 164) PGC-W-0156-Type not specified, 'int' assumed (/usr/include/sys/quota.h: 164) PGC-S-0040-Illegal use of symbol, u_int64_t (/usr/include/sys/quota.h: 211) PGC-W-0156-Type not specified, 'int' assumed (/usr/include/sys/quota.h: 211) PGC-S-0040-Illegal use of symbol, u_int64_t (/usr/include/sys/quota.h: 212) PGC-W-0156-Type not specified, 'int' assumed (/usr/include/sys/quota.h: 212) PGC/x86-64 Linux 18.7-0: compilation completed with severe errors configure:294401: $? = 2 configure:294415: result: no configure:294424: error: Lustre support requested but not found. Aborting Here is the conftest.c that triggers the error #include "lustre/lustreapi.h" void alloc_lum() { int v1, v3; v1 = sizeof(struct lov_user_md_v1) + LOV_MAX_STRIPE_COUNT * sizeof(struct lov_user_ost_data_v1); v3 = sizeof(struct lov_user_md_v3) + LOV_MAX_STRIPE_COUNT * sizeof(struct lov_user_ost_data_v1); } The same code was reported to work with gcc compiler, so at that stage, this looks like a PGI or an environment issue (sometimes the sysadmin has to re-run makelocalrc if some dependencies have changed), so I recommend this error is submitted to PGI support. I reviewed the code and filled a PR that get rids of the "-Iyes/include" flag. Merged or not, that does not fix the real issue here. Cheers, Gilles On 11/28/2018 6:04 AM, Gabriel, Edgar wrote: Gilles submitted a patch for that, and I approved it a couple of days back, I *think* it has not been merged however. This was a bug in the Open MPI Lustre configure logic, should be fixed after this one however. https://github.com/open-mpi/ompi/pull/6080 Thanks Edgar -Original Message- From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Latham, Robert J. via users Sent: Tuesday, November 27, 2018 2:03 PM To: users@lists.open-mpi.org Cc: Latham, Robert J. ; gi...@rist.or.jp Subject: Re: [OMPI users] Building OpenMPI with Lustre support using PGI fails On Tue, 2018-11-13 at 21:57 -0600, gil...@rist.or.jp wrote: Raymond, can you please compress and post your config.log ? I didn't see the config.log in response to this. Maybe Ray and Giles took the discusison off list? As someone who might have introduced the offending configure-time checks, I'm particularly interested in fixing lustre detection. ==rob Cheers, Gilles - Original Message - I am trying to build OpenMPI with Lustre support using PGI 18.7 on CentOS 7.5 (1804). It builds successfully with Intel compilers, but fails to find the necessary Lustre components with the PGI compiler. I have tried building OpenMPI 4.0.0, 3.1.3 and 2.1.5. I can build OpenMPI, but configure does not find the proper Lustre files. Lustre is installed from current client RPMS, version 2.10.5 Include files are in /usr/include/lustre When specifying --with-lustre, I get: --- MCA component fs:lustre (m4 configuration macro) checking for MCA component fs:lustre compile mode... dso checking --with-lustre value... simple ok (unspecified value) looking for header without includes checking lustre/lustreapi.h usability... yes checking lustre/lustreapi.h presence... yes checking for lustre/lustreapi.h... yes checking for library containing llapi_file_create... -llustreapi checking if liblust
[OMPI users] UCX errors after upgrade
We are primarily using OpenMPI 3.1.4 but also have 4.0.1 installed. On our cluster, we were running CentOS 7.5 with updates, alongside MLNX_OFED 4.5.x. OpenMPI was compiled with GCC, Intel, PGI and AOCC compilers. We could run with no issues. To accommodate updates needed to get our IB gear all running at HDR100 (EDR50 previously) we upgraded to CentOS 7.6.1810 and the current MLNX_OFED 4.6.x. We can no longer reliably run on more than two nodes. We see errors like: [epyc-compute-3-2.local:42447] pml_ucx.c:380 Error: ucp_ep_create(proc=276) failed: Destination is unreachable [epyc-compute-3-2.local:42447] pml_ucx.c:447 Error: Failed to resolve UCX endpoint for rank 276 [epyc-compute-3-2:42447] *** An error occurred in MPI_Allreduce [epyc-compute-3-2:42447] *** reported by process [47894553493505,47893180318004] [epyc-compute-3-2:42447] *** on communicator MPI_COMM_WORLD [epyc-compute-3-2:42447] *** MPI_ERR_OTHER: known error not in list [epyc-compute-3-2:42447] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [epyc-compute-3-2:42447] *** and potentially your MPI job) [epyc-compute-3-17.local:36637] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2079 [epyc-compute-3-17.local:37008] pml_ucx.c:380 Error: ucp_ep_create(proc=147) failed: Destination is unreachable [epyc-compute-3-17.local:37008] pml_ucx.c:447 Error: Failed to resolve UCX endpoint for rank 147 [epyc-compute-3-7.local:39776] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal [epyc-compute-3-7.local:39776] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages UCX appears to be part of the MLNX_OFED release, and is version 1.6.0. OpenMPI is is built on the same OS and MLNX_OFED, as we are running on the compute nodes. I have a case open with Mellanox but it is not clear where this error is coming from. -- Ray Muno IT Manager e-mail:m...@aem.umn.edu Phone: (612) 625-9531 University of Minnesota Aerospace Engineering and Mechanics Mechanical Engineering 110 Union St. S.E. 111 Church Street SE Minneapolis, MN 55455 Minneapolis, MN 55455
Re: [OMPI users] UCX errors after upgrade
We are running against 4.0.2RC2 now. This is ussing current Intel compilers, version 2019update4. Still having issues. [epyc-compute-1-3.local:17402] common_ucx.c:149 Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. [epyc-compute-1-3.local:17669] common_ucx.c:149 Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. [epyc-compute-1-3.local:17683] common_ucx.c:149 Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. [epyc-compute-1-3.local:16626] pml_ucx.c:385 Error: ucp_ep_create(proc=265) failed: Destination is unreachable [epyc-compute-1-3.local:16626] pml_ucx.c:452 Error: Failed to resolve UCX endpoint for rank 265 [epyc-compute-1-3:16626] *** An error occurred in MPI_Allreduce [epyc-compute-1-3:16626] *** reported by process [47001162088449,46999827120425] [epyc-compute-1-3:16626] *** on communicator MPI_COMM_WORLD [epyc-compute-1-3:16626] *** MPI_ERR_OTHER: known error not in list [epyc-compute-1-3:16626] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [epyc-compute-1-3:16626] *** and potentially your MPI job) On 9/25/19 1:28 PM, Jeff Squyres (jsquyres) via users wrote: Can you try the latest 4.0.2rc tarball? We're very, very close to releasing v4.0.2... I don't know if there's a specific UCX fix in there, but there are a ton of other good bug fixes in there since v4.0.1. On Sep 25, 2019, at 2:12 PM, Raymond Muno via users mailto:users@lists.open-mpi.org>> wrote: We are primarily using OpenMPI 3.1.4 but also have 4.0.1 installed. On our cluster, we were running CentOS 7.5 with updates, alongside MLNX_OFED 4.5.x. OpenMPI was compiled with GCC, Intel, PGI and AOCC compilers. We could run with no issues. To accommodate updates needed to get our IB gear all running at HDR100 (EDR50 previously) we upgraded to CentOS 7.6.1810 and the current MLNX_OFED 4.6.x. We can no longer reliably run on more than two nodes. We see errors like: [epyc-compute-3-2.local:42447] pml_ucx.c:380 Error: ucp_ep_create(proc=276) failed: Destination is unreachable [epyc-compute-3-2.local:42447] pml_ucx.c:447 Error: Failed to resolve UCX endpoint for rank 276 [epyc-compute-3-2:42447] *** An error occurred in MPI_Allreduce [epyc-compute-3-2:42447] *** reported by process [47894553493505,47893180318004] [epyc-compute-3-2:42447] *** on communicator MPI_COMM_WORLD [epyc-compute-3-2:42447] *** MPI_ERR_OTHER: known error not in list [epyc-compute-3-2:42447] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [epyc-compute-3-2:42447] *** and potentially your MPI job) [epyc-compute-3-17.local:36637] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2079 [epyc-compute-3-17.local:37008] pml_ucx.c:380 Error: ucp_ep_create(proc=147) failed: Destination is unreachable [epyc-compute-3-17.local:37008] pml_ucx.c:447 Error: Failed to resolve UCX endpoint for rank 147 [epyc-compute-3-7.local:39776] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal [epyc-compute-3-7.local:39776] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages UCX appears to be part of the MLNX_OFED release, and is version 1.6.0. OpenMPI is is built on the same OS and MLNX_OFED, as we are running on the compute nodes. I have a case open with Mellanox but it is not clear where this error is coming from. -- -- Jeff Squyres jsquy...@cisco.com <mailto:jsquy...@cisco.com> -- Ray Muno IT Manager University of Minnesota Aerospace Engineering and Mechanics Mechanical Engineering
Re: [OMPI users] UCX errors after upgrade
As a test, I rebooted a set of nodes. The user could run on 480 cores, on 5 nodes. We could not run beyond two nodes previous to that. We still get the VM_UNMAP warning, however. On 9/25/19 2:09 PM, Raymond Muno via users wrote: We are running against 4.0.2RC2 now. This is ussing current Intel compilers, version 2019update4. Still having issues. [epyc-compute-1-3.local:17402] common_ucx.c:149 Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. [epyc-compute-1-3.local:17669] common_ucx.c:149 Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. [epyc-compute-1-3.local:17683] common_ucx.c:149 Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. [epyc-compute-1-3.local:16626] pml_ucx.c:385 Error: ucp_ep_create(proc=265) failed: Destination is unreachable [epyc-compute-1-3.local:16626] pml_ucx.c:452 Error: Failed to resolve UCX endpoint for rank 265 [epyc-compute-1-3:16626] *** An error occurred in MPI_Allreduce [epyc-compute-1-3:16626] *** reported by process [47001162088449,46999827120425] [epyc-compute-1-3:16626] *** on communicator MPI_COMM_WORLD [epyc-compute-1-3:16626] *** MPI_ERR_OTHER: known error not in list [epyc-compute-1-3:16626] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [epyc-compute-1-3:16626] *** and potentially your MPI job) -- Ray Muno IT Manager University of Minnesota Aerospace Engineering and Mechanics Mechanical Engineering
Re: [OMPI users] UCX errors after upgrade
We are now using OpenMPI 4.0.2RC2 and RC3 compiled (with Intel, PGI and GCC) with MLNX_OFED 4.7 (released a couple days ago). This supplies UCX 1.7. So far, it seems like things are working well. Any estimate on when OpenMPI 4.2 will be released? On 9/25/19 2:27 PM, Jeff Squyres (jsquyres) wrote: Thanks Raymond; I have filed an issue for this on Github and tagged the relevant Mellanox people: https://github.com/open-mpi/ompi/issues/7009 On Sep 25, 2019, at 3:09 PM, Raymond Muno via users mailto:users@lists.open-mpi.org>> wrote: We are running against 4.0.2RC2 now. This is ussing current Intel compilers, version 2019update4. Still having issues. [epyc-compute-1-3.local:17402] common_ucx.c:149 Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. [epyc-compute-1-3.local:17669] common_ucx.c:149 Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. [epyc-compute-1-3.local:17683] common_ucx.c:149 Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. [epyc-compute-1-3.local:16626] pml_ucx.c:385 Error: ucp_ep_create(proc=265) failed: Destination is unreachable [epyc-compute-1-3.local:16626] pml_ucx.c:452 Error: Failed to resolve UCX endpoint for rank 265 [epyc-compute-1-3:16626] *** An error occurred in MPI_Allreduce [epyc-compute-1-3:16626] *** reported by process [47001162088449,46999827120425] [epyc-compute-1-3:16626] *** on communicator MPI_COMM_WORLD [epyc-compute-1-3:16626] *** MPI_ERR_OTHER: known error not in list [epyc-compute-1-3:16626] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [epyc-compute-1-3:16626] *** and potentially your MPI job) On 9/25/19 1:28 PM, Jeff Squyres (jsquyres) via users wrote: Can you try the latest 4.0.2rc tarball? We're very, very close to releasing v4.0.2... I don't know if there's a specific UCX fix in there, but there are a ton of other good bug fixes in there since v4.0.1. On Sep 25, 2019, at 2:12 PM, Raymond Muno via users mailto:users@lists.open-mpi.org>> wrote: We are primarily using OpenMPI 3.1.4 but also have 4.0.1 installed. On our cluster, we were running CentOS 7.5 with updates, alongside MLNX_OFED 4.5.x. OpenMPI was compiled with GCC, Intel, PGI and AOCC compilers. We could run with no issues. To accommodate updates needed to get our IB gear all running at HDR100 (EDR50 previously) we upgraded to CentOS 7.6.1810 and the current MLNX_OFED 4.6.x. We can no longer reliably run on more than two nodes. We see errors like: [epyc-compute-3-2.local:42447] pml_ucx.c:380 Error: ucp_ep_create(proc=276) failed: Destination is unreachable [epyc-compute-3-2.local:42447] pml_ucx.c:447 Error: Failed to resolve UCX endpoint for rank 276 [epyc-compute-3-2:42447] *** An error occurred in MPI_Allreduce [epyc-compute-3-2:42447] *** reported by process [47894553493505,47893180318004] [epyc-compute-3-2:42447] *** on communicator MPI_COMM_WORLD [epyc-compute-3-2:42447] *** MPI_ERR_OTHER: known error not in list [epyc-compute-3-2:42447] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [epyc-compute-3-2:42447] *** and potentially your MPI job) [epyc-compute-3-17.local:36637] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2079 [epyc-compute-3-17.local:37008] pml_ucx.c:380 Error: ucp_ep_create(proc=147) failed: Destination is unreachable [epyc-compute-3-17.local:37008] pml_ucx.c:447 Error: Failed to resolve UCX endpoint for rank 147 [epyc-compute-3-7.local:39776] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal [epyc-compute-3-7.local:39776] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages UCX appears to be part of the MLNX_OFED release, and is version 1.6.0. OpenMPI is is built on the same OS and MLNX_OFED, as we are running on the compute nodes. I have a case open with Mellanox but it is not clear where this error is coming from. -- -- Jeff Squyres jsquy...@cisco.com <mailto:jsquy...@cisco.com> -- Ray Muno IT Manager University of Minnesota Aerospace Engineering and Mechanics Mechanical Engineering -- Jeff Squyres jsquy...@cisco.com <mailto:jsquy...@cisco.com> -- Ray Muno IT Manager University of Minnesota Aerospace Engineering and Mechanics Mechanical Engineering
[OMPI users] Parameters at run time
Is there a way to determine, at run time, as to what choices OpenMPI made in terms of transports that are being utilized? We want to verify we are running UCX over Infiniband. I have two users, executing the identical code, with the same mpirun options, getting vastly different execution times on the same cluster. -- Ray Muno IT Manager University of Minnesota Aerospace Engineering and Mechanics Mechanical Engineering
Re: [OMPI users] [External] Re: AMD EPYC 7281: does NOT, support binding memory to the process location
We are running EPYC 7451 and 7702 nodes. I do not recall that CentOS 6 was able to support these. We moved on to CentOS 7.6 at first and are now running 7.7 to support the EPYC2/Rome nodes. The kernel in earlier releases did not support x2APIC and could not handle 256 threads. Not and issue on EPYC/Naples, but it was an issue on dual 64 core EPYC2. Redhat lists 7.4 as the minimum for EPYC(Naples) support and 7.6.6 for EPYC2(Rome). -Ray Muno On 1/8/20 2:51 PM, Prentice Bisbal via users wrote: On 1/8/20 3:30 PM, Brice Goglin via users wrote: Le 08/01/2020 à 21:20, Prentice Bisbal via users a écrit : We just added about a dozen nodes to our cluster, which have AMD EPYC 7281 processors. When a particular users jobs fall on one of these nodes, he gets these error messages: -- WARNING: a request was made to bind a process. While the system supports binding the process itself, at least one node does NOT support binding memory to the process location. Node: dawson205 I wonder if the CentOS 6 kernel properly supports these recent processors. Does lstopo show NUMA nodes as expected? Brice lstopo shows different numa nodes, and it appears to be correct, but I don't use lstopo that much, so I'm not 100% confident that what it's showing is correct. I'm at about 98%. Prentice -- Ray Muno IT Manager University of Minnesota Aerospace Engineering and Mechanics Mechanical Engineering
Re: [OMPI users] [External] Re: AMD EPYC 7281: does NOT, support binding memory to the process location
We are running EPYC 7451 and 7702 nodes. I do not recall that CentOS 6 was able to support these. We moved on to CentOS 7.6 at first and are now running 7.7 to support the EPYC2/Rome nodes. The kernel in earlier releases did not support x2APIC and could not handle 256 threads. Not and issue on EPYC/Naples, but it was an issue on dual 64 core EPYC2. Redhat lists 7.4 as the minimum for EPYC(Naples) support and 7.6.6 for EPYC2(Rome). -Ray Muno On 1/8/20 2:51 PM, Prentice Bisbal via users wrote: On 1/8/20 3:30 PM, Brice Goglin via users wrote: Le 08/01/2020 à 21:20, Prentice Bisbal via users a écrit : We just added about a dozen nodes to our cluster, which have AMD EPYC 7281 processors. When a particular users jobs fall on one of these nodes, he gets these error messages: -- WARNING: a request was made to bind a process. While the system supports binding the process itself, at least one node does NOT support binding memory to the process location. Node: dawson205 I wonder if the CentOS 6 kernel properly supports these recent processors. Does lstopo show NUMA nodes as expected? Brice lstopo shows different numa nodes, and it appears to be correct, but I don't use lstopo that much, so I'm not 100% confident that what it's showing is correct. I'm at about 98%. Prentice -- Ray Muno IT Manager e-mail: m...@aem.umn.edu Phone: (612) 625-9531 University of Minnesota Aerospace Engineering and Mechanics Mechanical Engineering 110 Union St. S.E. 111 Church Street SE Minneapolis, MN 55455 Minneapolis, MN 55455
Re: [OMPI users] [External] Re: AMD EPYC 7281: does NOT, support binding memory to the process location
AMD, list the minimum supported kernel for EPYC/NAPLES as RHEL/Centos kernel 3.10-862, which is RHEL/CentOS 7.5 or later. Upgraded kernels can be used in 7.4. http://developer.amd.com/wp-content/resources/56420.pdf -Ray Muno On 1/8/20 7:37 PM, Raymond Muno wrote: We are running EPYC 7451 and 7702 nodes. I do not recall that CentOS 6 was able to support these. We moved on to CentOS 7.6 at first and are now running 7.7 to support the EPYC2/Rome nodes. The kernel in earlier releases did not support x2APIC and could not handle 256 threads. Not and issue on EPYC/Naples, but it was an issue on dual 64 core EPYC2. Redhat lists 7.4 as the minimum for EPYC(Naples) support and 7.6.6 for EPYC2(Rome). -Ray Muno On 1/8/20 2:51 PM, Prentice Bisbal via users wrote: On 1/8/20 3:30 PM, Brice Goglin via users wrote: Le 08/01/2020 à 21:20, Prentice Bisbal via users a écrit : We just added about a dozen nodes to our cluster, which have AMD EPYC 7281 processors. When a particular users jobs fall on one of these nodes, he gets these error messages: -- WARNING: a request was made to bind a process. While the system supports binding the process itself, at least one node does NOT support binding memory to the process location. Node: dawson205 I wonder if the CentOS 6 kernel properly supports these recent processors. Does lstopo show NUMA nodes as expected? Brice lstopo shows different numa nodes, and it appears to be correct, but I don't use lstopo that much, so I'm not 100% confident that what it's showing is correct. I'm at about 98%. Prentice -- Ray Muno IT Manager University of Minnesota Aerospace Engineering and Mechanics Mechanical Engineering
[OMPI users] OpenMPI 4.0.2 with PGI 19.10, will not build with hcoll
I am having issues building OpenMPI 4.0.2 using the PGI 19.10 compilers. OS is CentOS 7.7, MLNX_OFED 4.7.3 It dies at: PGC/x86-64 Linux 19.10-0: compilation completed with warnings CCLD mca_coll_hcoll.la pgcc-Error-Unknown switch: -pthread make[2]: *** [mca_coll_hcoll.la] Error 1 make[2]: Leaving directory `/project/muno/OpenMPI/PGI/openmpi-4.0.2/ompi/mca/coll/hcoll' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/project/muno/OpenMPI/PGI/openmpi-4.0.2/ompi' make: *** [all-recursive] Error 1 I tried with PGI 19.9 and had the same issue. If I do not include hcoll, it builds. I have successfully built OpenMPI 4.0.2 with GCC, Intel and AOCC compilers, all using the same options. hcoll is provided by MLNX_OFED 4.7.3 and configure is run with --with-hcoll=/opt/mellanox/hcoll -- Ray Muno IT Manager e-mail:m...@aem.umn.edu University of Minnesota Aerospace Engineering and Mechanics Mechanical Engineering
Re: [OMPI users] OpenMPI 4.1.1, CentOS 7.9, nVidia HPC-SDk, build hints?
Added -*-enable-mca-no-build=op-avx *to the configure line. Still dies in the same place. CCLD mca_op_avx.la ./.libs/liblocal_ops_avx512.a(liblocal_ops_avx512_la-op_avx_functions.o):(.data+0x0): multiple definition of `ompi_op_avx_functions_avx2' ./.libs/liblocal_ops_avx2.a(liblocal_ops_avx2_la-op_avx_functions.o):(.data+0x0): first defined here ./.libs/liblocal_ops_avx512.a(liblocal_ops_avx512_la-op_avx_functions.o): In function `ompi_op_avx_2buff_min_uint16_t_avx2': /project/muno/OpenMPI/BUILD/SRC/openmpi-4.1.1/ompi/mca/op/avx/op_avx_functions.c:651: multiple definition of `ompi_op_avx_3buff_functions_avx2' ./.libs/liblocal_ops_avx2.a(liblocal_ops_avx2_la-op_avx_functions.o):/project/muno/OpenMPI/BUILD/SRC/openmpi-4.1.1/ompi/mca/op/avx/op_avx_functions.c:651: first defined here make[2]: *** [mca_op_avx.la] Error 2 make[2]: Leaving directory `/project/muno/OpenMPI/BUILD/4.1.1/ROME/NV-HPC/21.9/ompi/mca/op/avx' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/project/muno/OpenMPI/BUILD/4.1.1/ROME/NV-HPC/21.9/ompi' make: *** [all-recursive] Error 1 On 9/30/21 5:54 AM, Carl Ponder wrote: For now, you can suppress this error building OpenMPI 4.1.1 ./.libs/liblocal_ops_avx512.a(liblocal_ops_avx512_la-op_avx_functions.o):(.data+0x0): multipledefinition of `ompi_op_avx_functions_avx2' ./.libs/liblocal_ops_avx2.a(liblocal_ops_avx2_la-op_avx_functions.o):(.data+0x0): first defined here ./.libs/liblocal_ops_avx512.a(liblocal_ops_avx512_la-op_avx_functions.o): In function`ompi_op_avx_2buff_min_uint16_t_avx2': /project/muno/OpenMPI/BUILD/SRC/openmpi-4.1.1/ompi/mca/op/avx/op_avx_functions.c:651: multipledefinition of `ompi_op_avx_3buff_functions_avx2' ./.libs/liblocal_ops_avx2.a(liblocal_ops_avx2_la-op_avx_functions.o):/project/muno/OpenMPI/BUILD/SRC/openmpi-4.1.1/ompi/mca/op/avx/op_avx_functions.c:651: first defined here with the NVHPC/PGI 21.9 compiler by using the setting configure -*-enable-mca-no-build=op-avx* ... We're still looking at the cause here. I don't have any advice about the problem with 21.7. Subject: Re: [OMPI users] OpenMPI 4.1.1, CentOS 7.9, nVidia HPC-SDk, build hints? Date: Wed, 29 Sep 2021 12:25:43 -0500 From: Ray Muno via users Reply-To: Open MPI Users To: users@lists.open-mpi.org CC: Ray Muno External email: Use caution opening links or attachments Tried this configure CC='nvc -fPIC' CXX='nvc++ -fPIC' FC='nvfortran -fPIC' Configure completes. Compiles quite a way through. Dies in a different place. It does get past the first error, however with libmpi_usempif08.la FCLD libmpi_usempif08.la make[2]: Leaving directory `/project/muno/OpenMPI/BUILD/4.1.1/ROME/NV-HPC/21.9/ompi/mpi/fortran/use-mpi-f08' Making all in mpi/fortran/mpiext-use-mpi-f08 make[2]: Entering directory `/project/muno/OpenMPI/BUILD/4.1.1/ROME/NV-HPC/21.9/ompi/mpi/fortran/mpiext-use-mpi-f08' PPFC mpi-f08-ext-module.lo FCLD libforce_usempif08_module_to_be_built.la make[2]: Leaving directory `/project/muno/OpenMPI/BUILD/4.1.1/ROME/NV-HPC/21.9/ompi/mpi/fortran/mpiext-use-mpi-f08' Dies here now. CCLD liblocal_ops_avx512.la CCLD mca_op_avx.la ./.libs/liblocal_ops_avx512.a(liblocal_ops_avx512_la-op_avx_functions.o):(.data+0x0): multiple definition of `ompi_op_avx_functions_avx2' ./.libs/liblocal_ops_avx2.a(liblocal_ops_avx2_la-op_avx_functions.o):(.data+0x0): first defined here ./.libs/liblocal_ops_avx512.a(liblocal_ops_avx512_la-op_avx_functions.o): In function `ompi_op_avx_2buff_min_uint16_t_avx2': /project/muno/OpenMPI/BUILD/SRC/openmpi-4.1.1/ompi/mca/op/avx/op_avx_functions.c:651: multiple definition of `ompi_op_avx_3buff_functions_avx2' ./.libs/liblocal_ops_avx2.a(liblocal_ops_avx2_la-op_avx_functions.o):/project/muno/OpenMPI/BUILD/SRC/openmpi-4.1.1/ompi/mca/op/avx/op_avx_functions.c:651: first defined here make[2]: *** [mca_op_avx.la] Error 2 make[2]: Leaving directory `/project/muno/OpenMPI/BUILD/4.1.1/ROME/NV-HPC/21.9/ompi/mca/op/avx' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/project/muno/OpenMPI/BUILD/4.1.1/ROME/NV-HPC/21.9/ompi' make: *** [all-recursive] Error 1 On 9/29/21 11:42 AM, Bennet Fauber via users wrote: Ray, If all the errors about not being compiled with -fPIC are still appearing, there may be a bug that is preventing the option from getting through to the compiler(s). It might be worth looking through the logs to see the full compile command for one or more of them to see whether that is true? Say, libs/comm_spawn_multiple_f08.o for example? If -fPIC is missing, you may be able to recompile that manually with the -fPIC in place, then remake and see if that also causes the link error to go away, that would be a good start. Hope this helps,-- bennet On Wed, Sep 29, 2021 at 12:29 PM Ray Muno via users mailto:users@lists.open-mpi.org>> wrote: I did try that an