[OMPI users] Specifying second Ethernet port
Hello, I'm performing some tests with OMPIv4. The initial configuration used one Ethernet port (10 Gibps) but have added a second one (with the same characteristics). The documentation mentions that the OMPI installation will try to use as much network capacity as available. However, my tests show no gain in performance when adding the second port. I was wondering if there is any way to tell the wrapper to use both ports. I was thinking of using the flag 'btl_tcp_if_include' but I'm unsure if it can take two inputs. Could I use something like -mpirun --mca btl_tcp_if_include eno1,eno2d1 -np 128 ...? If not, any recommendation on how to proceed? Thank you, Arturo ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
[OMPI users] CUDA-aware codes not using GPU
Hello OpenMPI Team, I'm trying to use CUDA-aware OpenMPI but the system simply ignores the GPU and the code runs on the CPUs. I've tried different software but will focus on the OSU benchmarks (collective and pt2pt communications). Let me provide some data about the configuration of the system: -OFED v4.17-1-rc2 (the NIC is virtualized but I also tried a Mellanox card with MOFED a few days ago and found the same issue) -CUDA v10.1 -gdrcopy v1.3 -UCX 1.6.0 -OpenMPI 4.0.1 Everything looks like good (CUDA programs work fine, MPI programs run on the CPUs without any problem), and the ompi_info outputs what I was expecting (but maybe I'm missing something): mca:opal:base:param:opal_built_with_cuda_support:synonym:name:mpi_built_with _cuda_support mca:mpi:base:param:mpi_built_with_cuda_support:value:true mca:mpi:base:param:mpi_built_with_cuda_support:source:default mca:mpi:base:param:mpi_built_with_cuda_support:status:read-only mca:mpi:base:param:mpi_built_with_cuda_support:level:4 mca:mpi:base:param:mpi_built_with_cuda_support:help:Whether CUDA GPU buffer support is built into library or not mca:mpi:base:param:mpi_built_with_cuda_support:enumerator:value:0:false mca:mpi:base:param:mpi_built_with_cuda_support:enumerator:value:1:true mca:mpi:base:param:mpi_built_with_cuda_support:deprecated:no mca:mpi:base:param:mpi_built_with_cuda_support:type:bool mca:mpi:base:param:mpi_built_with_cuda_support:synonym_of:name:opal_built_wi th_cuda_support mca:mpi:base:param:mpi_built_with_cuda_support:disabled:false The available btls are the usual self, openib, tcp & vader plus smcuda, uct & usnic. The full output from ompi_info is attached. If I try the flag '--mca opal_cuda_verbose 10,' it doesn't output anything, which seems to agree with the lack of GPU use. If I try with '--mca btl smcuda,' it makes no difference. I have also tried to specify the program to use host and device (e.g. mpirun -np 2 ./osu_latency D H) but the same result. I am probably missing something but not sure where else to look at or what else to try. Thank you, AFernandez $ ompi_info -param all all MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v4.0.1) MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v4.0.1) MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v4.0.1) MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.0.1) MCA btl: openib (MCA v2.1.0, API v3.1.0, Component v4.0.1) MCA btl: smcuda (MCA v2.1.0, API v3.1.0, Component v4.0.1) MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.0.1) MCA btl: uct (MCA v2.1.0, API v3.1.0, Component v4.0.1) MCA btl: usnic (MCA v2.1.0, API v3.1.0, Component v4.0.1) MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.0.1) MCA compress: bzip (MCA v2.1.0, API v2.0.0, Component v4.0.1) MCA compress: gzip (MCA v2.1.0, API v2.0.0, Component v4.0.1) MCA crs: none (MCA v2.1.0, API v2.0.0, Component v4.0.1) MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v4.0.1) MCA event: libevent2022 (MCA v2.1.0, API v2.0.0, Component v4.0.1) MCA hwloc: hwloc201 (MCA v2.1.0, API v2.0.0, Component v4.0.1) MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component v4.0.1) MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component v4.0.1) MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v4.0.1) MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v4.0.1) MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v4.0.1) MCA mpool: hugepage (MCA v2.1.0, API v3.0.0, Component v4.0.1) MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component v4.0.1) MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v4.0.1) MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.0.1) MCA pmix: pmix3x (MCA v2.1.0, API v2.0.0, Component v4.0.1) MCA pstat: linux (MCA v2.1.0, API v2.0.0, Component v4.0.1) MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v4.0.1) MCA rcache: gpusm (MCA v2.1.0, API v3.3.0, Component v4.0.1) MCA rcache: rgpusm (MCA v2.1.0, API v3.3.0, Component v4.0.1) MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v4.0.1) MCA reachable: netlink (MCA v2.1.0, API v2.0.0, Component v4.0.1) MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v4.0.1) MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v4.0.1) MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v4.0.1) MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v4.0.1) MCA dfs: app (MCA v2.1.0, A
Re: [OMPI users] CUDA-aware codes not using GPU
Hi Akshay, I'm building both UCX and OpenMPI as you mention. The portions of the script read: ./configure --prefix=/usr/local/ucx-cuda-install --with-cuda=/usr/local/cuda-10.1 --with-gdrcopy=/home/odyhpc/gdrcopy --disable-numa sudo make install & ./configure --with-cuda=/usr/local/cuda-10.1 --with-cuda-libdir=/usr/local/cuda-10.1/lib64 --with-ucx=/usr/local/ucx-cuda-install --prefix=/opt/openmpi sudo make all install As far as the job submission, I have tried several combinations with different MCAs (yesterday I forgot to include '--mca pml ucx' flag as it had made no difference in the past). I just tried your suggested syntax (mpirun -np 2 --mca pml ucx --mca btl ^smcuda,openib ./osu_latency D H) with the same results. The latency times are of the same order no matter which flags I include. As far as checking GPU usage, I'm not familiar with 'nvprof' and simply using the basic continuous info (nvidia-smi -l). I'm trying all of this in a cloud environment, and my suspicion is that there might be some interference (maybe because of some virtualization component) but cannot pinpoint the cause. Thanks, Arturo From: Akshay Venkatesh Sent: Friday, September 06, 2019 11:14 AM To: Open MPI Users Cc: Joshua Ladd ; Arturo Fernandez Subject: Re: [OMPI users] CUDA-aware codes not using GPU Hi, Arturo. Usually, for OpenMPI+UCX we use the following recipe for UCX: ./configure --prefix=/path/to/ucx-cuda-install --with-cuda=/usr/local/cuda --with-gdrcopy=/usr make -j install then OpenMPI: ./configure --with-cuda=/usr/local/cuda --with-ucx=/path/to/ucx-cuda-install make -j install Can you run with the following to see if it helps: mpirun -np 2 --mca pml ucx --mca btl ^smcuda,openib ./osu_latency D H There are details here that may be useful: https://www.open-mpi.org/faq/?category=runcuda#run-ompi-cuda-ucx Also, note that for short messages D->H path for inter-node may not involve call CUDA API (if you're using nvprof to detect CUDA activity) because GPUDirectRDMA path and gdrcopy is used. On Fri, Sep 6, 2019 at 7:36 AM Arturo Fernandez via users mailto:users@lists.open-mpi.org> > wrote: Josh, Thank you. Yes, I built UCX with CUDA and gdrcopy support. I also had to disable numa (--disable-numa) as requested during the installation. AFernandez Joshua Ladd wrote Did you build UCX with CUDA support (--with-cuda) ? Josh On Thu, Sep 5, 2019 at 8:45 PM AFernandez via users < users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> > wrote: Hello OpenMPI Team, I'm trying to use CUDA-aware OpenMPI but the system simply ignores the GPU and the code runs on the CPUs. I've tried different software but will focus on the OSU benchmarks (collective and pt2pt communications). Let me provide some data about the configuration of the system: -OFED v4.17-1-rc2 (the NIC is virtualized but I also tried a Mellanox card with MOFED a few days ago and found the same issue) -CUDA v10.1 -gdrcopy v1.3 -UCX 1.6.0 -OpenMPI 4.0.1 Everything looks like good (CUDA programs work fine, MPI programs run on the CPUs without any problem), and the ompi_info outputs what I was expecting (but maybe I'm missing something): mca:opal:base:param:opal_built_with_cuda_support:synonym:name:mpi_built_with_cuda_support mca:mpi:base:param:mpi_built_with_cuda_support:value:true mca:mpi:base:param:mpi_built_with_cuda_support:source:default mca:mpi:base:param:mpi_built_with_cuda_support:status:read-only mca:mpi:base:param:mpi_built_with_cuda_support:level:4 mca:mpi:base:param:mpi_built_with_cuda_support:help:Whether CUDA GPU buffer support is built into library or not mca:mpi:base:param:mpi_built_with_cuda_support:enumerator:value:0:false mca:mpi:base:param:mpi_built_with_cuda_support:enumerator:value:1:true mca:mpi:base:param:mpi_built_with_cuda_support:deprecated:no mca:mpi:base:param:mpi_built_with_cuda_support:type:bool mca:mpi:base:param:mpi_built_with_cuda_support:synonym_of:name:opal_built_with_cuda_support mca:mpi:base:param:mpi_built_with_cuda_support:disabled:false The available btls are the usual self, openib, tcp & vader plus smcuda, uct & usnic. The full output from ompi_info is attached. If I try the flag '--mca opal_cuda_verbose 10,' it doesn't output anything, which seems to agree with the lack of GPU use. If I try with '--mca btl smcuda,' it makes no difference. I have also tried to specify the program to use host and device (e.g. mpirun -np 2 ./osu_latency D H) but the same result. I am probably missing something but not sure where else to look at or what else to try. Thank you, AFernandez ___ users mailing list users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> https://lists.open-mpi.org/mailman/listinfo/users &
Re: [OMPI users] Cannot locate PMIx
Please disregard my previous question as the PMIX error was triggered by something else (not sure why ompi_info wasn't outputting any PMIX components but now it does) On Nov 23, 2021, 6:01 PM, at 6:01 PM, Arturo Fernandez via users wrote: >Hello, >This is kind of an odd issue as it had not happened earlier in many >builds. >The configuration (./configure --with-ofi=PATH_TO_LIBFABRIC installed >from >https://github.com/ofiwg/libfabric) for v4.1.1 returns: >... >Miscellaneous >--- >CUDA support: no >HWLOC support: internal >Libevent support: internal >PMIx support: Internal >... >So it was a surprise getting the error 'PMIX ERROR: UNREACHABLE in file > >server/pmix-server.c' for one of the apps being tested (the others were > >working fine). I checked ompi_info and there's no trace of PMIx, which >was >another surprise because similar configurations used to have isolated, >flux >and pmix3x as MCI pmix components. >My questions is twofold: Will OpenMPI build w/o PMIx support even if >the >configuration says the opposite? If so, could the libfabric components >be >causing this behavior? >Thanks, >Arturo
[OMPI users] Seg error when using v5.0.1
Hello, I upgraded one of the systems to v5.0.1 and have compiled everything exactly as dozens of previous times with v4. I wasn't expecting any issue (and the compilations didn't report anything out of ordinary) but running several apps has resulted in error messages such as: Backtrace for this error: #0 0x7f7c9571f51f in ??? at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0 #1 0x7f7c957823fe in __GI___libc_free at ./malloc/malloc.c:3368 #2 0x7f7c93a635c3 in ??? #3 0x7f7c95f84048 in ??? #4 0x7f7c95f1cef1 in ??? #5 0x7f7c95e34b7b in ??? #6 0x6e05be in ??? #7 0x6e58d7 in ??? #8 0x405d2c in ??? #9 0x7f7c95706d8f in __libc_start_call_main at ../sysdeps/nptl/libc_start_call_main.h:58 #10 0x7f7c95706e3f in __libc_start_main_impl at ../csu/libc-start.c:392 #11 0x405d64 in ??? #12 0x in ??? OS is Ubuntu 22.04, OpenMPI was built with GCC13.2, and before building OpenMPI, I had previously built the hwloc (2.10.0) library at /usr/lib/x86_64-linux-gnu. Maybe I'm missing something pretty basic, but the problem seems to be related to memory allocation. Thanks.
Re: [OMPI users] Seg error when using v5.0.1
Hi Joseph, It's happening with several apps including WRF. I was trying to find a quick answer or fix but it seems that I'll have to recompile it in debug mode. Will report back with the extra info. Thanks. Joseph Schuchart via users wrote: Hello, This looks like memory corruption. Do you have more details on what your app is doing? I don't see any MPI calls inside the call stack. Could you rebuild Open MPI with debug information enabled (by adding `--enable-debug` to configure)? If this error occurs on singleton runs (1 process) then you can easily attach gdb to it to get a better stack trace. Also, valgrind may help pin down the problem by telling you which memory block is being free'd here. Thanks Joseph On 1/30/24 07:41, afernandez via users wrote: Hello, I upgraded one of the systems to v5.0.1 and have compiled everything > exactly as dozens of previous times with v4. I wasn't expecting any > issue (and the compilations didn't report anything out of ordinary) > but running several apps has resulted in error messages such as: /Backtrace for this error:/ /#0 0x7f7c9571f51f in ???/ / at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0/ /#1 0x7f7c957823fe in __GI___libc_free/ / at ./malloc/malloc.c:3368/ /#2 0x7f7c93a635c3 in ???/ /#3 0x7f7c95f84048 in ???/ /#4 0x7f7c95f1cef1 in ???/ /#5 0x7f7c95e34b7b in ???/ /#6 0x6e05be in ???/ /#7 0x6e58d7 in ???/ /#8 0x405d2c in ???/ /#9 0x7f7c95706d8f in __libc_start_call_main/ / at ../sysdeps/nptl/libc_start_call_main.h:58/ /#10 0x7f7c95706e3f in __libc_start_main_impl/ / at ../csu/libc-start.c:392/ /#11 0x405d64 in ???/ /#12 0x in ???/ OS is Ubuntu 22.04, OpenMPI was built with GCC13.2, and before > building OpenMPI, I had previously built the hwloc (2.10.0) library at > /usr/lib/x86_64-linux-gnu. Maybe I'm missing something pretty basic, > but the problem seems to be related to memory allocation. Thanks.
Re: [OMPI users] Seg error when using v5.0.1
Hello Joseph, Sorry for the delay but I didn't know if I was missing something yesterday evening and wanted to double check everything this morning. This is for WRF but other apps exhibit the same behavior. * I had no problem with the serial version (and gdb obviously didn't report any issue). * I tried compiling with the --enable-debug flag but it was generating errors during the compilation and never completed. * I went back to my standard flags for debugging: -g -fbacktrace -ggdb -fcheck=bounds,do,mem,pointer -ffpe-trap=invalid,zero,overflow. WRF is still crashing with little extra info vs yesterday: Backtrace for this error: #0 0x7f5a4e54451f in ??? at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0 #1 0x7f5a4e5a73fe in __GI___libc_free at ./malloc/malloc.c:3368 #2 0x7f5a4c7aa5c3 in ??? #3 0x7f5a4e83b048 in ??? #4 0x7f5a4e7d3ef1 in ??? #5 0x7f5a4e8dab7b in ??? #6 0x8f6bbf in __module_dm_MOD_split_communicator at /home/ubuntu/WRF-4.5.2/frame/module_dm.f90:5734 #7 0x1879ebd in init_modules_ at /home/ubuntu/WRF-4.5.2/share/init_modules.f90:63 #8 0x406fe4 in __module_wrf_top_MOD_wrf_init at ../main/module_wrf_top.f90:130 #9 0x405ff3 in wrf at /home/ubuntu/WRF-4.5.2/main/wrf.f90:22 #10 0x40605c in main at /home/ubuntu/WRF-4.5.2/main/wrf.f90:6 -- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -- -- mpirun noticed that process rank 0 with PID 0 on node ip-172-31-31-163 exited on signal 11 (Segmentation fault). -- Any pointers on what might be going on here as this never happened with OMPIv4. Thanks. Joseph Schuchart via users wrote: Hello, This looks like memory corruption. Do you have more details on what your app is doing? I don't see any MPI calls inside the call stack. Could you rebuild Open MPI with debug information enabled (by adding `--enable-debug` to configure)? If this error occurs on singleton runs (1 process) then you can easily attach gdb to it to get a better stack trace. Also, valgrind may help pin down the problem by telling you which memory block is being free'd here. Thanks Joseph On 1/30/24 07:41, afernandez via users wrote: Hello, I upgraded one of the systems to v5.0.1 and have compiled everything > exactly as dozens of previous times with v4. I wasn't expecting any > issue (and the compilations didn't report anything out of ordinary) > but running several apps has resulted in error messages such as: /Backtrace for this error:/ /#0 0x7f7c9571f51f in ???/ / at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0/ /#1 0x7f7c957823fe in __GI___libc_free/ / at ./malloc/malloc.c:3368/ /#2 0x7f7c93a635c3 in ???/ /#3 0x7f7c95f84048 in ???/ /#4 0x7f7c95f1cef1 in ???/ /#5 0x7f7c95e34b7b in ???/ /#6 0x6e05be in ???/ /#7 0x6e58d7 in ???/ /#8 0x405d2c in ???/ /#9 0x7f7c95706d8f in __libc_start_call_main/ / at ../sysdeps/nptl/libc_start_call_main.h:58/ /#10 0x7f7c95706e3f in __libc_start_main_impl/ / at ../csu/libc-start.c:392/ /#11 0x405d64 in ???/ /#12 0x in ???/ OS is Ubuntu 22.04, OpenMPI was built with GCC13.2, and before > building OpenMPI, I had previously built the hwloc (2.10.0) library at > /usr/lib/x86_64-linux-gnu. Maybe I'm missing something pretty basic, > but the problem seems to be related to memory allocation. Thanks.
Re: [OMPI users] Seg error when using v5.0.1
Hi Gilles, I created the ticket (#12296). The crash happened with either 1 or 2 MPI ranks (have not tried with more but I doubt that it would make any difference). Thanks, Arturo Gilles Gouaillardet via users wrote: Hi, please open an issue on GitHub at https://github.com/open-mpi/ompi/issues <https://github.com/open-mpi/ompi/issues> and provide the requested information. If the compilation failed when configured with --enable-debug, please share the logs. the name of the WRF subroutine suggests the crash might occur in MPI_Comm_split(), if so, are you able to craft a reproducer that causes the crash? How many nodes and MPI tasks are needed in order to evidence the crash? Cheers, Gilles On Wed, Jan 31, 2024 at 10:09 PM afernandez via users mailto:users@lists.open-mpi.org> > wrote: Hello Joseph, Sorry for the delay but I didn't know if I was missing something yesterday evening and wanted to double check everything this morning. This is for WRF but other apps exhibit the same behavior. * I had no problem with the serial version (and gdb obviously didn't report any issue). * I tried compiling with the --enable-debug flag but it was generating errors during the compilation and never completed. * I went back to my standard flags for debugging: -g -fbacktrace -ggdb -fcheck=bounds,do,mem,pointer -ffpe-trap=invalid,zero,overflow. WRF is still crashing with little extra info vs yesterday: Backtrace for this error: #0 0x7f5a4e54451f in ??? at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0 #1 0x7f5a4e5a73fe in __GI___libc_free at ./malloc/malloc.c:3368 #2 0x7f5a4c7aa5c3 in ??? #3 0x7f5a4e83b048 in ??? #4 0x7f5a4e7d3ef1 in ??? #5 0x7f5a4e8dab7b in ??? #6 0x8f6bbf in __module_dm_MOD_split_communicator at /home/ubuntu/WRF-4.5.2/frame/module_dm.f90:5734 #7 0x1879ebd in init_modules_ at /home/ubuntu/WRF-4.5.2/share/init_modules.f90:63 #8 0x406fe4 in __module_wrf_top_MOD_wrf_init at ../main/module_wrf_top.f90:130 #9 0x405ff3 in wrf at /home/ubuntu/WRF-4.5.2/main/wrf.f90:22 #10 0x40605c in main at /home/ubuntu/WRF-4.5.2/main/wrf.f90:6 -- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -- -- mpirun noticed that process rank 0 with PID 0 on node ip-172-31-31-163 exited on signal 11 (Segmentation fault). -- Any pointers on what might be going on here as this never happened with OMPIv4. Thanks. Joseph Schuchart via users wrote: Hello, This looks like memory corruption. Do you have more details on what your app is doing? I don't see any MPI calls inside the call stack. Could you rebuild Open MPI with debug information enabled (by adding `--enable-debug` to configure)? If this error occurs on singleton runs (1 process) then you can easily attach gdb to it to get a better stack trace. Also, valgrind may help pin down the problem by telling you which memory block is being free'd here. Thanks Joseph On 1/30/24 07:41, afernandez via users wrote: quote class="gmail_quote" type="cite" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Hello, I upgraded one of the systems to v5.0.1 and have compiled everything > exactly as dozens of previous times with v4. I wasn't expecting any > issue (and the compilations didn't report anything out of ordinary) > but running several apps has resulted in error messages such as: /Backtrace for this error:/ /#0 0x7f7c9571f51f in ???/ / at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0/ /#1 0x7f7c957823fe in __GI___libc_free/ / at ./malloc/malloc.c:3368/ /#2 0x7f7c93a635c3 in ???/ /#3 0x7f7c95f84048 in ???/ /#4 0x7f7c95f1cef1 in ???/ /#5 0x7f7c95e34b7b in ???/ /#6 0x6e05be in ???/ /#7 0x6e58d7 in ???/ /#8 0x405d2c in ???/ /#9 0x7f7c95706d8f in __libc_start_call_main/ / at ../sysdeps/nptl/libc_start_call_main.h:58/ /#10 0x7f7c95706e3f in __libc_start_main_impl/ / at ../csu/libc-start.c:392/ /#11 0x405d64 in ???/ /#12 0x in ???/ OS is Ubuntu 22.04, OpenMPI was built with GCC13.2, and before > building OpenMPI, I had previously built the hwloc (2.10.0) library at > /usr/lib/x86_64-linux-gnu. Maybe I'm missing something pretty basic, > but the problem seems to be related to memory allocation. Thanks.
Re: [OMPI users] Seg error when using v5.0.1
Hello, I'm sorry as I totally messed up here. It turns out that the problem was caused because there's a previous installation of OpenMPI (v4.1.6) and it was trying to run the codes compiled against v5 with the mpirun from v4. I always set up the systems so that the OS picks up the latest MPI version, but it apparently didn't become effective this time prompting me to the wrong conclusion. I should have realized of this fact earlier and not waste everyone's time. My apologies. Arturo Gilles Gouaillardet via users wrote: Hi, please open an issue on GitHub at https://github.com/open-mpi/ompi/issues <https://github.com/open-mpi/ompi/issues> and provide the requested information. If the compilation failed when configured with --enable-debug, please share the logs. the name of the WRF subroutine suggests the crash might occur in MPI_Comm_split(), if so, are you able to craft a reproducer that causes the crash? How many nodes and MPI tasks are needed in order to evidence the crash? Cheers, Gilles On Wed, Jan 31, 2024 at 10:09 PM afernandez via users mailto:users@lists.open-mpi.org> > wrote: Hello Joseph, Sorry for the delay but I didn't know if I was missing something yesterday evening and wanted to double check everything this morning. This is for WRF but other apps exhibit the same behavior. * I had no problem with the serial version (and gdb obviously didn't report any issue). * I tried compiling with the --enable-debug flag but it was generating errors during the compilation and never completed. * I went back to my standard flags for debugging: -g -fbacktrace -ggdb -fcheck=bounds,do,mem,pointer -ffpe-trap=invalid,zero,overflow. WRF is still crashing with little extra info vs yesterday: Backtrace for this error: #0 0x7f5a4e54451f in ??? at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0 #1 0x7f5a4e5a73fe in __GI___libc_free at ./malloc/malloc.c:3368 #2 0x7f5a4c7aa5c3 in ??? #3 0x7f5a4e83b048 in ??? #4 0x7f5a4e7d3ef1 in ??? #5 0x7f5a4e8dab7b in ??? #6 0x8f6bbf in __module_dm_MOD_split_communicator at /home/ubuntu/WRF-4.5.2/frame/module_dm.f90:5734 #7 0x1879ebd in init_modules_ at /home/ubuntu/WRF-4.5.2/share/init_modules.f90:63 #8 0x406fe4 in __module_wrf_top_MOD_wrf_init at ../main/module_wrf_top.f90:130 #9 0x405ff3 in wrf at /home/ubuntu/WRF-4.5.2/main/wrf.f90:22 #10 0x40605c in main at /home/ubuntu/WRF-4.5.2/main/wrf.f90:6 -- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -- -- mpirun noticed that process rank 0 with PID 0 on node ip-172-31-31-163 exited on signal 11 (Segmentation fault). -- Any pointers on what might be going on here as this never happened with OMPIv4. Thanks. Joseph Schuchart via users wrote: Hello, This looks like memory corruption. Do you have more details on what your app is doing? I don't see any MPI calls inside the call stack. Could you rebuild Open MPI with debug information enabled (by adding `--enable-debug` to configure)? If this error occurs on singleton runs (1 process) then you can easily attach gdb to it to get a better stack trace. Also, valgrind may help pin down the problem by telling you which memory block is being free'd here. Thanks Joseph On 1/30/24 07:41, afernandez via users wrote: quote class="gmail_quote" type="cite" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Hello, I upgraded one of the systems to v5.0.1 and have compiled everything > exactly as dozens of previous times with v4. I wasn't expecting any > issue (and the compilations didn't report anything out of ordinary) > but running several apps has resulted in error messages such as: /Backtrace for this error:/ /#0 0x7f7c9571f51f in ???/ / at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0/ /#1 0x7f7c957823fe in __GI___libc_free/ / at ./malloc/malloc.c:3368/ /#2 0x7f7c93a635c3 in ???/ /#3 0x7f7c95f84048 in ???/ /#4 0x7f7c95f1cef1 in ???/ /#5 0x7f7c95e34b7b in ???/ /#6 0x6e05be in ???/ /#7 0x6e58d7 in ???/ /#8 0x405d2c in ???/ /#9 0x7f7c95706d8f in __libc_start_call_main/ / at ../sysdeps/nptl/libc_start_call_main.h:58/ /#10 0x7f7c95706e3f in __libc_start_main_impl/ / at ../csu/libc-start.c:392/ /#11 0x405d64 in ???/ /#12 0x in ???/ OS is Ubuntu 22.04, OpenMPI was built with GCC13.2, and before > building OpenMPI, I had previously built the hwloc (2.10.0) library at > /usr/lib/x86_64-linux-gnu. Maybe I'm missing something pretty basic, > but the problem seems to be related to memory allocation. Thanks.