[OMPI users] Specifying second Ethernet port

2019-05-17 Thread AFernandez via users
Hello,

I'm performing some tests with OMPIv4. The initial configuration used one
Ethernet port (10 Gibps) but have added a second one (with the same
characteristics). The documentation mentions that the OMPI installation will
try to use as much network capacity as available. However, my tests show no
gain in performance when adding the second port. I was wondering if there is
any way to tell the wrapper to use both ports. I was thinking of using the
flag 'btl_tcp_if_include' but I'm unsure if it can take two inputs. Could I
use something like

-mpirun --mca btl_tcp_if_include eno1,eno2d1 -np 128 ...? 

If not, any recommendation on how to proceed?

Thank you,

Arturo

 

 

 

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] CUDA-aware codes not using GPU

2019-09-05 Thread AFernandez via users
Hello OpenMPI Team,

I'm trying to use CUDA-aware OpenMPI but the system simply ignores the GPU
and the code runs on the CPUs. I've tried different software but will focus
on the OSU benchmarks (collective and pt2pt communications). Let me provide
some data about the configuration of the system:

-OFED v4.17-1-rc2 (the NIC is virtualized but I also tried a Mellanox card
with MOFED a few days ago and found the same issue)

-CUDA v10.1

-gdrcopy v1.3

-UCX 1.6.0

-OpenMPI 4.0.1

Everything looks like good (CUDA programs work fine, MPI programs run on the
CPUs without any problem), and the ompi_info outputs what I was expecting
(but maybe I'm missing something):

mca:opal:base:param:opal_built_with_cuda_support:synonym:name:mpi_built_with
_cuda_support

mca:mpi:base:param:mpi_built_with_cuda_support:value:true

mca:mpi:base:param:mpi_built_with_cuda_support:source:default

mca:mpi:base:param:mpi_built_with_cuda_support:status:read-only

mca:mpi:base:param:mpi_built_with_cuda_support:level:4

mca:mpi:base:param:mpi_built_with_cuda_support:help:Whether CUDA GPU buffer
support is built into library or not

mca:mpi:base:param:mpi_built_with_cuda_support:enumerator:value:0:false

mca:mpi:base:param:mpi_built_with_cuda_support:enumerator:value:1:true

mca:mpi:base:param:mpi_built_with_cuda_support:deprecated:no

mca:mpi:base:param:mpi_built_with_cuda_support:type:bool

mca:mpi:base:param:mpi_built_with_cuda_support:synonym_of:name:opal_built_wi
th_cuda_support

mca:mpi:base:param:mpi_built_with_cuda_support:disabled:false

The available btls are the usual self, openib, tcp & vader plus smcuda, uct
& usnic. The full output from ompi_info is attached. If I try the flag
'--mca opal_cuda_verbose 10,' it doesn't output anything, which seems to
agree with the lack of GPU use. If I try with '--mca btl smcuda,' it makes
no difference. I have also tried to specify the program to use host and
device (e.g. mpirun -np 2 ./osu_latency D H) but the same result. I am
probably missing something but not sure where else to look at or what else
to try. 

Thank you,

AFernandez

 

$ ompi_info -param all all
   MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v4.0.1)
   MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v4.0.1)
   MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v4.0.1)
 MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.0.1)
 MCA btl: openib (MCA v2.1.0, API v3.1.0, Component v4.0.1)
 MCA btl: smcuda (MCA v2.1.0, API v3.1.0, Component v4.0.1)
 MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.0.1)
 MCA btl: uct (MCA v2.1.0, API v3.1.0, Component v4.0.1)
 MCA btl: usnic (MCA v2.1.0, API v3.1.0, Component v4.0.1)
 MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.0.1)
MCA compress: bzip (MCA v2.1.0, API v2.0.0, Component v4.0.1)
MCA compress: gzip (MCA v2.1.0, API v2.0.0, Component v4.0.1)
 MCA crs: none (MCA v2.1.0, API v2.0.0, Component v4.0.1)
  MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v4.0.1)
   MCA event: libevent2022 (MCA v2.1.0, API v2.0.0, Component
  v4.0.1)
   MCA hwloc: hwloc201 (MCA v2.1.0, API v2.0.0, Component v4.0.1)
  MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
  v4.0.1)
  MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
  v4.0.1)
 MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v4.0.1)
 MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v4.0.1)
  MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v4.0.1)
   MCA mpool: hugepage (MCA v2.1.0, API v3.0.0, Component v4.0.1)
 MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component
  v4.0.1)
MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v4.0.1)
MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.0.1)
MCA pmix: pmix3x (MCA v2.1.0, API v2.0.0, Component v4.0.1)
   MCA pstat: linux (MCA v2.1.0, API v2.0.0, Component v4.0.1)
  MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v4.0.1)
  MCA rcache: gpusm (MCA v2.1.0, API v3.3.0, Component v4.0.1)
  MCA rcache: rgpusm (MCA v2.1.0, API v3.3.0, Component v4.0.1)
   MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v4.0.1)
   MCA reachable: netlink (MCA v2.1.0, API v2.0.0, Component v4.0.1)
   MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v4.0.1)
   MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v4.0.1)
   MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v4.0.1)
   MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v4.0.1)
 MCA dfs: app (MCA v2.1.0, A

Re: [OMPI users] CUDA-aware codes not using GPU

2019-09-06 Thread AFernandez via users
Hi Akshay,

I'm building both UCX and OpenMPI as you mention. The portions of the script 
read:

./configure --prefix=/usr/local/ucx-cuda-install 
--with-cuda=/usr/local/cuda-10.1  --with-gdrcopy=/home/odyhpc/gdrcopy 
--disable-numa

sudo make install

&

./configure --with-cuda=/usr/local/cuda-10.1 
--with-cuda-libdir=/usr/local/cuda-10.1/lib64 
--with-ucx=/usr/local/ucx-cuda-install --prefix=/opt/openmpi

sudo make all install

As far as the job submission, I have tried several combinations with different 
MCAs (yesterday I forgot to include '--mca pml ucx' flag as it had made no 
difference in the past). I just tried your suggested syntax (mpirun -np 2 --mca 
pml ucx --mca btl ^smcuda,openib ./osu_latency D H) with the same results. The 
latency times are of the same order no matter which flags I include. As far as 
checking GPU usage, I'm not familiar with 'nvprof' and simply using the basic 
continuous info (nvidia-smi -l). I'm trying all of this in a cloud environment, 
and my suspicion is that there might be some interference (maybe because of 
some virtualization component) but cannot pinpoint the cause.

Thanks,

Arturo

 

From: Akshay Venkatesh  
Sent: Friday, September 06, 2019 11:14 AM
To: Open MPI Users 
Cc: Joshua Ladd ; Arturo Fernandez 
Subject: Re: [OMPI users] CUDA-aware codes not using GPU

 

Hi, Arturo.

 

Usually, for OpenMPI+UCX we use the following recipe 

 

for UCX:

 
./configure --prefix=/path/to/ucx-cuda-install --with-cuda=/usr/local/cuda 
--with-gdrcopy=/usr
 
make -j install


then OpenMPI:

 

./configure --with-cuda=/usr/local/cuda --with-ucx=/path/to/ucx-cuda-install
 
make -j install
 

Can you run with the following to see if it helps: 

 
mpirun -np 2 --mca pml ucx --mca btl ^smcuda,openib ./osu_latency D H

There are details here that may be useful: 
https://www.open-mpi.org/faq/?category=runcuda#run-ompi-cuda-ucx  

 

Also, note that for short messages D->H path for inter-node may not involve 
call CUDA API (if you're using nvprof to detect CUDA activity) because 
GPUDirectRDMA path and gdrcopy is used.

 

On Fri, Sep 6, 2019 at 7:36 AM Arturo Fernandez via users 
mailto:users@lists.open-mpi.org> > wrote:

Josh, 

Thank you. Yes, I built UCX with CUDA and gdrcopy support. I also had to 
disable numa (--disable-numa) as requested during the installation. 

AFernandez 

 

Joshua Ladd wrote 

Did you build UCX with CUDA support (--with-cuda) ? 

 

Josh 

 

On Thu, Sep 5, 2019 at 8:45 PM AFernandez via users < users@lists.open-mpi.org  
<mailto:users@lists.open-mpi.org> > wrote: 

Hello OpenMPI Team, 

I'm trying to use CUDA-aware OpenMPI but the system simply ignores the GPU and 
the code runs on the CPUs. I've tried different software but will focus on the 
OSU benchmarks (collective and pt2pt communications). Let me provide some data 
about the configuration of the system: 

-OFED v4.17-1-rc2 (the NIC is virtualized but I also tried a Mellanox card with 
MOFED a few days ago and found the same issue) 

-CUDA v10.1 

-gdrcopy v1.3 

-UCX 1.6.0 

-OpenMPI 4.0.1 

Everything looks like good (CUDA programs work fine, MPI programs run on the 
CPUs without any problem), and the ompi_info outputs what I was expecting (but 
maybe I'm missing something): 

mca:opal:base:param:opal_built_with_cuda_support:synonym:name:mpi_built_with_cuda_support
 

mca:mpi:base:param:mpi_built_with_cuda_support:value:true 

mca:mpi:base:param:mpi_built_with_cuda_support:source:default 

mca:mpi:base:param:mpi_built_with_cuda_support:status:read-only 

mca:mpi:base:param:mpi_built_with_cuda_support:level:4 

mca:mpi:base:param:mpi_built_with_cuda_support:help:Whether CUDA GPU buffer 
support is built into library or not 

mca:mpi:base:param:mpi_built_with_cuda_support:enumerator:value:0:false 

mca:mpi:base:param:mpi_built_with_cuda_support:enumerator:value:1:true 

mca:mpi:base:param:mpi_built_with_cuda_support:deprecated:no 

mca:mpi:base:param:mpi_built_with_cuda_support:type:bool 

mca:mpi:base:param:mpi_built_with_cuda_support:synonym_of:name:opal_built_with_cuda_support
 

mca:mpi:base:param:mpi_built_with_cuda_support:disabled:false 

The available btls are the usual self, openib, tcp & vader plus smcuda, uct & 
usnic. The full output from ompi_info is attached. If I try the flag '--mca 
opal_cuda_verbose 10,' it doesn't output anything, which seems to agree with 
the lack of GPU use. If I try with '--mca btl smcuda,' it makes no difference. 
I have also tried to specify the program to use host and device (e.g. mpirun 
-np 2 ./osu_latency D H) but the same result. I am probably missing something 
but not sure where else to look at or what else to try. 

Thank you, 

AFernandez 

___ 
users mailing list 
users@lists.open-mpi.org  <mailto:users@lists.open-mpi.org> 
https://lists.open-mpi.org/mailman/listinfo/users  
&

Re: [OMPI users] Cannot locate PMIx

2021-11-28 Thread afernandez--- via users
Please disregard my previous question as the PMIX error was triggered by 
something else (not sure why ompi_info wasn't outputting any PMIX components 
but now it does)



On Nov 23, 2021, 6:01 PM, at 6:01 PM, Arturo Fernandez via users 
 wrote:
>Hello,
>This is kind of an odd issue as it had not happened earlier in many
>builds.
>The configuration (./configure --with-ofi=PATH_TO_LIBFABRIC installed
>from
>https://github.com/ofiwg/libfabric) for v4.1.1 returns:
>...
>Miscellaneous
>---
>CUDA support: no
>HWLOC support: internal
>Libevent support: internal
>PMIx support: Internal
>...
>So it was a surprise getting the error 'PMIX ERROR: UNREACHABLE in file
>
>server/pmix-server.c' for one of the apps being tested (the others were
>
>working fine). I checked ompi_info and there's no trace of PMIx, which
>was
>another surprise because similar configurations used to have isolated,
>flux
>and pmix3x as MCI pmix components.
>My questions is twofold: Will OpenMPI build w/o PMIx support even if
>the
>configuration says the opposite? If so, could the libfabric components
>be
>causing this behavior?
>Thanks,
>Arturo


[OMPI users] Seg error when using v5.0.1

2024-01-30 Thread afernandez via users

Hello,
I upgraded one of the systems to v5.0.1 and have compiled everything
exactly as dozens of previous times with v4. I wasn't expecting any issue
(and the compilations didn't report anything out of ordinary) but running
several apps has resulted in error messages such as:
Backtrace for this error:
#0 0x7f7c9571f51f in ???
at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#1 0x7f7c957823fe in __GI___libc_free
at ./malloc/malloc.c:3368
#2 0x7f7c93a635c3 in ???
#3 0x7f7c95f84048 in ???
#4 0x7f7c95f1cef1 in ???
#5 0x7f7c95e34b7b in ???
#6 0x6e05be in ???
#7 0x6e58d7 in ???
#8 0x405d2c in ???
#9 0x7f7c95706d8f in __libc_start_call_main
at ../sysdeps/nptl/libc_start_call_main.h:58
#10 0x7f7c95706e3f in __libc_start_main_impl
at ../csu/libc-start.c:392
#11 0x405d64 in ???
#12 0x in ???
OS is Ubuntu 22.04, OpenMPI was built with GCC13.2, and before building
OpenMPI, I had previously built the hwloc (2.10.0) library at
/usr/lib/x86_64-linux-gnu. Maybe I'm missing something pretty basic, but
the problem seems to be related to memory allocation.
Thanks.


Re: [OMPI users] Seg error when using v5.0.1

2024-01-30 Thread afernandez via users

Hi Joseph,
It's happening with several apps including WRF. I was trying to find a
quick answer or fix but it seems that I'll have to recompile it in debug
mode. Will report back with the extra info.
Thanks.
Joseph Schuchart via users wrote:
Hello,
This looks like memory corruption. Do you have more details on what your
app is doing? I don't see any MPI calls inside the call stack. Could you
rebuild Open MPI with debug information enabled (by adding `--enable-debug`
to configure)? If this error occurs on singleton runs (1 process) then you
can easily attach gdb to it to get a better stack trace. Also, valgrind may
help pin down the problem by telling you which memory block is being free'd
here.
Thanks
Joseph
On 1/30/24 07:41, afernandez via users wrote:

Hello,
I upgraded one of the systems to v5.0.1 and have compiled everything >
exactly as dozens of previous times with v4. I wasn't expecting any > issue
(and the compilations didn't report anything out of ordinary) > but running
several apps has resulted in error messages such as:
/Backtrace for this error:/
/#0 0x7f7c9571f51f in ???/
/ at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0/
/#1 0x7f7c957823fe in __GI___libc_free/
/ at ./malloc/malloc.c:3368/
/#2 0x7f7c93a635c3 in ???/
/#3 0x7f7c95f84048 in ???/
/#4 0x7f7c95f1cef1 in ???/
/#5 0x7f7c95e34b7b in ???/
/#6 0x6e05be in ???/
/#7 0x6e58d7 in ???/
/#8 0x405d2c in ???/
/#9 0x7f7c95706d8f in __libc_start_call_main/
/ at ../sysdeps/nptl/libc_start_call_main.h:58/
/#10 0x7f7c95706e3f in __libc_start_main_impl/
/ at ../csu/libc-start.c:392/
/#11 0x405d64 in ???/
/#12 0x in ???/
OS is Ubuntu 22.04, OpenMPI was built with GCC13.2, and before > building
OpenMPI, I had previously built the hwloc (2.10.0) library at >
/usr/lib/x86_64-linux-gnu. Maybe I'm missing something pretty basic, > but
the problem seems to be related to memory allocation.
Thanks.


Re: [OMPI users] Seg error when using v5.0.1

2024-01-31 Thread afernandez via users

Hello Joseph,
Sorry for the delay but I didn't know if I was missing something yesterday
evening and wanted to double check everything this morning. This is for WRF
but other apps exhibit the same behavior.
* I had no problem with the serial version (and gdb obviously didn't report
any issue).
* I tried compiling with the --enable-debug flag but it was generating
errors during the compilation and never completed.
* I went back to my standard flags for debugging: -g -fbacktrace -ggdb
-fcheck=bounds,do,mem,pointer -ffpe-trap=invalid,zero,overflow. WRF is
still crashing with little extra info vs yesterday:
Backtrace for this error:
#0 0x7f5a4e54451f in ???
at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#1 0x7f5a4e5a73fe in __GI___libc_free
at ./malloc/malloc.c:3368
#2 0x7f5a4c7aa5c3 in ???
#3 0x7f5a4e83b048 in ???
#4 0x7f5a4e7d3ef1 in ???
#5 0x7f5a4e8dab7b in ???
#6 0x8f6bbf in __module_dm_MOD_split_communicator
at /home/ubuntu/WRF-4.5.2/frame/module_dm.f90:5734
#7 0x1879ebd in init_modules_
at /home/ubuntu/WRF-4.5.2/share/init_modules.f90:63
#8 0x406fe4 in __module_wrf_top_MOD_wrf_init
at ../main/module_wrf_top.f90:130
#9 0x405ff3 in wrf
at /home/ubuntu/WRF-4.5.2/main/wrf.f90:22
#10 0x40605c in main
at /home/ubuntu/WRF-4.5.2/main/wrf.f90:6
--
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--
--
mpirun noticed that process rank 0 with PID 0 on node ip-172-31-31-163
exited on signal 11 (Segmentation fault).
--
Any pointers on what might be going on here as this never happened with
OMPIv4. Thanks.
Joseph Schuchart via users wrote:
Hello,
This looks like memory corruption. Do you have more details on what your
app is doing? I don't see any MPI calls inside the call stack. Could you
rebuild Open MPI with debug information enabled (by adding `--enable-debug`
to configure)? If this error occurs on singleton runs (1 process) then you
can easily attach gdb to it to get a better stack trace. Also, valgrind may
help pin down the problem by telling you which memory block is being free'd
here.
Thanks
Joseph
On 1/30/24 07:41, afernandez via users wrote:

Hello,
I upgraded one of the systems to v5.0.1 and have compiled everything >
exactly as dozens of previous times with v4. I wasn't expecting any > issue
(and the compilations didn't report anything out of ordinary) > but running
several apps has resulted in error messages such as:
/Backtrace for this error:/
/#0 0x7f7c9571f51f in ???/
/ at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0/
/#1 0x7f7c957823fe in __GI___libc_free/
/ at ./malloc/malloc.c:3368/
/#2 0x7f7c93a635c3 in ???/
/#3 0x7f7c95f84048 in ???/
/#4 0x7f7c95f1cef1 in ???/
/#5 0x7f7c95e34b7b in ???/
/#6 0x6e05be in ???/
/#7 0x6e58d7 in ???/
/#8 0x405d2c in ???/
/#9 0x7f7c95706d8f in __libc_start_call_main/
/ at ../sysdeps/nptl/libc_start_call_main.h:58/
/#10 0x7f7c95706e3f in __libc_start_main_impl/
/ at ../csu/libc-start.c:392/
/#11 0x405d64 in ???/
/#12 0x in ???/
OS is Ubuntu 22.04, OpenMPI was built with GCC13.2, and before > building
OpenMPI, I had previously built the hwloc (2.10.0) library at >
/usr/lib/x86_64-linux-gnu. Maybe I'm missing something pretty basic, > but
the problem seems to be related to memory allocation.
Thanks.


Re: [OMPI users] Seg error when using v5.0.1

2024-01-31 Thread afernandez via users
Hi Gilles,
I created the ticket (#12296). The crash happened with either 1 or 2 MPI ranks 
(have not tried with more but I doubt that it would make any difference).
Thanks,
Arturo

Gilles Gouaillardet via users wrote:
Hi,
please open an issue on GitHub at https://github.com/open-mpi/ompi/issues 
<https://github.com/open-mpi/ompi/issues>
and provide the requested information.
If the compilation failed when configured with --enable-debug, please share the 
logs.
the name of the WRF subroutine suggests the crash might occur in 
MPI_Comm_split(),
if so, are you able to craft a reproducer that causes the crash?
How many nodes and MPI tasks are needed in order to evidence the crash?
Cheers,
Gilles
On Wed, Jan 31, 2024 at 10:09 PM afernandez via users mailto:users@lists.open-mpi.org> > wrote:
Hello Joseph,
Sorry for the delay but I didn't know if I was missing something yesterday 
evening and wanted to double check everything this morning. This is for WRF but 
other apps exhibit the same behavior.
* I had no problem with the serial version (and gdb obviously didn't report any 
issue).
* I tried compiling with the --enable-debug flag but it was generating errors 
during the compilation and never completed.
* I went back to my standard flags for debugging: -g -fbacktrace -ggdb 
-fcheck=bounds,do,mem,pointer -ffpe-trap=invalid,zero,overflow. WRF is still 
crashing with little extra info vs yesterday:
Backtrace for this error:
#0 0x7f5a4e54451f in ???
at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#1 0x7f5a4e5a73fe in __GI___libc_free
at ./malloc/malloc.c:3368
#2 0x7f5a4c7aa5c3 in ???
#3 0x7f5a4e83b048 in ???
#4 0x7f5a4e7d3ef1 in ???
#5 0x7f5a4e8dab7b in ???
#6 0x8f6bbf in __module_dm_MOD_split_communicator
at /home/ubuntu/WRF-4.5.2/frame/module_dm.f90:5734
#7 0x1879ebd in init_modules_
at /home/ubuntu/WRF-4.5.2/share/init_modules.f90:63
#8 0x406fe4 in __module_wrf_top_MOD_wrf_init
at ../main/module_wrf_top.f90:130
#9 0x405ff3 in wrf
at /home/ubuntu/WRF-4.5.2/main/wrf.f90:22
#10 0x40605c in main
at /home/ubuntu/WRF-4.5.2/main/wrf.f90:6
--
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--
--
mpirun noticed that process rank 0 with PID 0 on node ip-172-31-31-163 exited 
on signal 11 (Segmentation fault).
--
Any pointers on what might be going on here as this never happened with OMPIv4. 
Thanks.
Joseph Schuchart via users wrote:
Hello,
This looks like memory corruption. Do you have more details on what your app is 
doing? I don't see any MPI calls inside the call stack. Could you rebuild Open 
MPI with debug information enabled (by adding `--enable-debug` to configure)? 
If this error occurs on singleton runs (1 process) then you can easily attach 
gdb to it to get a better stack trace. Also, valgrind may help pin down the 
problem by telling you which memory block is being free'd here.
Thanks
Joseph
On 1/30/24 07:41, afernandez via users wrote:
quote class="gmail_quote" type="cite" style="margin:0 0 0 .8ex;border-left:1px 
#ccc solid;padding-left:1ex">
Hello,
I upgraded one of the systems to v5.0.1 and have compiled everything > exactly 
as dozens of previous times with v4. I wasn't expecting any > issue (and the 
compilations didn't report anything out of ordinary) > but running several apps 
has resulted in error messages such as:
/Backtrace for this error:/
/#0 0x7f7c9571f51f in ???/
/ at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0/
/#1 0x7f7c957823fe in __GI___libc_free/
/ at ./malloc/malloc.c:3368/
/#2 0x7f7c93a635c3 in ???/
/#3 0x7f7c95f84048 in ???/
/#4 0x7f7c95f1cef1 in ???/
/#5 0x7f7c95e34b7b in ???/
/#6 0x6e05be in ???/
/#7 0x6e58d7 in ???/
/#8 0x405d2c in ???/
/#9 0x7f7c95706d8f in __libc_start_call_main/
/ at ../sysdeps/nptl/libc_start_call_main.h:58/
/#10 0x7f7c95706e3f in __libc_start_main_impl/
/ at ../csu/libc-start.c:392/
/#11 0x405d64 in ???/
/#12 0x in ???/
OS is Ubuntu 22.04, OpenMPI was built with GCC13.2, and before > building 
OpenMPI, I had previously built the hwloc (2.10.0) library at > 
/usr/lib/x86_64-linux-gnu. Maybe I'm missing something pretty basic, > but the 
problem seems to be related to memory allocation.
Thanks.


Re: [OMPI users] Seg error when using v5.0.1

2024-01-31 Thread afernandez via users
Hello,
I'm sorry as I totally messed up here. It turns out that the problem was caused 
because there's a previous installation of OpenMPI (v4.1.6) and it was trying 
to run the codes compiled against v5 with the mpirun from v4. I always set up 
the systems so that the OS picks up the latest MPI version, but it apparently 
didn't become effective this time prompting me to the wrong conclusion. I 
should have realized of this fact earlier and not waste everyone's time. My 
apologies.
Arturo
Gilles Gouaillardet via users wrote:
Hi,
please open an issue on GitHub at https://github.com/open-mpi/ompi/issues 
<https://github.com/open-mpi/ompi/issues>
and provide the requested information.
If the compilation failed when configured with --enable-debug, please share the 
logs.
the name of the WRF subroutine suggests the crash might occur in 
MPI_Comm_split(),
if so, are you able to craft a reproducer that causes the crash?
How many nodes and MPI tasks are needed in order to evidence the crash?
Cheers,
Gilles
On Wed, Jan 31, 2024 at 10:09 PM afernandez via users mailto:users@lists.open-mpi.org> > wrote:
Hello Joseph,
Sorry for the delay but I didn't know if I was missing something yesterday 
evening and wanted to double check everything this morning. This is for WRF but 
other apps exhibit the same behavior.
* I had no problem with the serial version (and gdb obviously didn't report any 
issue).
* I tried compiling with the --enable-debug flag but it was generating errors 
during the compilation and never completed.
* I went back to my standard flags for debugging: -g -fbacktrace -ggdb 
-fcheck=bounds,do,mem,pointer -ffpe-trap=invalid,zero,overflow. WRF is still 
crashing with little extra info vs yesterday:
Backtrace for this error:
#0 0x7f5a4e54451f in ???
at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#1 0x7f5a4e5a73fe in __GI___libc_free
at ./malloc/malloc.c:3368
#2 0x7f5a4c7aa5c3 in ???
#3 0x7f5a4e83b048 in ???
#4 0x7f5a4e7d3ef1 in ???
#5 0x7f5a4e8dab7b in ???
#6 0x8f6bbf in __module_dm_MOD_split_communicator
at /home/ubuntu/WRF-4.5.2/frame/module_dm.f90:5734
#7 0x1879ebd in init_modules_
at /home/ubuntu/WRF-4.5.2/share/init_modules.f90:63
#8 0x406fe4 in __module_wrf_top_MOD_wrf_init
at ../main/module_wrf_top.f90:130
#9 0x405ff3 in wrf
at /home/ubuntu/WRF-4.5.2/main/wrf.f90:22
#10 0x40605c in main
at /home/ubuntu/WRF-4.5.2/main/wrf.f90:6
--
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--
--
mpirun noticed that process rank 0 with PID 0 on node ip-172-31-31-163 exited 
on signal 11 (Segmentation fault).
--
Any pointers on what might be going on here as this never happened with OMPIv4. 
Thanks.
Joseph Schuchart via users wrote:
Hello,
This looks like memory corruption. Do you have more details on what your app is 
doing? I don't see any MPI calls inside the call stack. Could you rebuild Open 
MPI with debug information enabled (by adding `--enable-debug` to configure)? 
If this error occurs on singleton runs (1 process) then you can easily attach 
gdb to it to get a better stack trace. Also, valgrind may help pin down the 
problem by telling you which memory block is being free'd here.
Thanks
Joseph
On 1/30/24 07:41, afernandez via users wrote:
quote class="gmail_quote" type="cite" style="margin:0 0 0 .8ex;border-left:1px 
#ccc solid;padding-left:1ex">
Hello,
I upgraded one of the systems to v5.0.1 and have compiled everything > exactly 
as dozens of previous times with v4. I wasn't expecting any > issue (and the 
compilations didn't report anything out of ordinary) > but running several apps 
has resulted in error messages such as:
/Backtrace for this error:/
/#0 0x7f7c9571f51f in ???/
/ at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0/
/#1 0x7f7c957823fe in __GI___libc_free/
/ at ./malloc/malloc.c:3368/
/#2 0x7f7c93a635c3 in ???/
/#3 0x7f7c95f84048 in ???/
/#4 0x7f7c95f1cef1 in ???/
/#5 0x7f7c95e34b7b in ???/
/#6 0x6e05be in ???/
/#7 0x6e58d7 in ???/
/#8 0x405d2c in ???/
/#9 0x7f7c95706d8f in __libc_start_call_main/
/ at ../sysdeps/nptl/libc_start_call_main.h:58/
/#10 0x7f7c95706e3f in __libc_start_main_impl/
/ at ../csu/libc-start.c:392/
/#11 0x405d64 in ???/
/#12 0x in ???/
OS is Ubuntu 22.04, OpenMPI was built with GCC13.2, and before > building 
OpenMPI, I had previously built the hwloc (2.10.0) library at > 
/usr/lib/x86_64-linux-gnu. Maybe I'm missing something pretty basic, > but the 
problem seems to be related to memory allocation.
Thanks.