[OMPI users] OpenMPI 4.0.5 error with Omni-path
Hi, I'm trying to deploy OpenMPI 4.0.5 on the university's supercomputer: * Debian GNU/Linux 9 (stretch) * Intel Corporation Omni-Path HFI Silicon 100 Series [discrete] (rev 11) and for several days I have a bug (wrong results using MPI_AllToAllW) on this server when using OmniPath. Running 4 threads on a single node, using OpenMPI 4.0.5 built without omnipath support, the code is working: CC=$(which gcc) CXX=$(which g++) FC=$(which gfortran) ../configure --with-hwloc --enable-mpirun-prefix-by-default \ --prefix=/bettik/begou/OpenMPI405-noib --enable-mpi1-compatibility \ --enable-mpi-cxx --enable-cxx-exceptions --without-verbs --without-ofi --without-psm --without-psm2 --without-openib \ --without-slurm If I use omnipath, still with 4 threads on one node, the test-case does not work (incorrect results): CFLAGS="-O3 -march=native -mtune=native" CXXFLAGS="-O3 -march=native -mtune=native" FCFLAGS="-O3 -march=native -mtune=native" \ CC=$(which gcc) CXX=$(which g++) FC=$(which gfortran) ../configure --with-hwloc --enable-mpirun-prefix-by-default \ --prefix=/bettik/begou/OpenMPI405 --enable-mpi1-compatibility \ --enable-mpi-cxx --enable-cxx-exceptions --without-verbs I do not undestand what could be wrong as the code is running on many architecture with various interconnect and openMPI versions. Thanks for your suggestions. Patrick
[OMPI users] OpenMPI 4.0.5 error with Omni-path
Patrick, You really have to provide us some detailed information if you want assistance. At a minimum we need to know if you're using the PSM2 MTL or the OFI MTL and what the actual error is. Please provide the actual command line you are having problems with, along with any errors. In addition, I recommend adding the following to your command line: -mca mtl_base_verbose 99 If you have a way to reproduce the problem quickly you might also want to add: -x PSM2_TRACEMASK=11 But that will add very detailed debug output to your command and you haven't mentioned that PSM2 is failing, so it may not be useful.
Re: [OMPI users] [EXTERNAL] OpenMPI 4.0.5 error with Omni-path
Hi Patrick, Also it might not hurt to disable the Open IB BTL by setting export OMPI_MCA_btl=^openib in your shell prior to invoking mpirun Howard From: users on behalf of "Heinz, Michael William via users" Reply-To: Open MPI Users Date: Monday, January 25, 2021 at 8:47 AM To: "users@lists.open-mpi.org" Cc: "Heinz, Michael William" Subject: [EXTERNAL] [OMPI users] OpenMPI 4.0.5 error with Omni-path Patrick, You really have to provide us some detailed information if you want assistance. At a minimum we need to know if you’re using the PSM2 MTL or the OFI MTL and what the actual error is. Please provide the actual command line you are having problems with, along with any errors. In addition, I recommend adding the following to your command line: -mca mtl_base_verbose 99 If you have a way to reproduce the problem quickly you might also want to add: -x PSM2_TRACEMASK=11 But that will add very detailed debug output to your command and you haven’t mentioned that PSM2 is failing, so it may not be useful.
Re: [OMPI users] OpenMPI 4.0.5 error with Omni-path
Hi Howard and Michael, thanks for your feedback. I did not want to write a toot long mail with non pertinent information so I just show how the two different builds give different result. I'm using a small test case based on my large code, the same used to show the memory leak with mpi_Alltoallv calls, but just running 2 iterations. It is a 2D case and data storage is moved from distributions "along X axis" to "along Y axis" with mpi_Alltoallv and subarrays types. Datas initialization is based on the location in the array to allow checking for correct exchanges. When the program runs (on 4 processes in my test) it must only show the max rss size of the processes. When it fails it shows the invalid locations. I've drastically reduced the size of the problem with nx=5 and ny=7. Launching the non working setup with more details show: dahu138 : mpirun -np 4 -mca mtl_base_verbose 99 ./test_layout_array [dahu138:115761] mca: base: components_register: registering framework mtl components [dahu138:115763] mca: base: components_register: registering framework mtl components [dahu138:115763] mca: base: components_register: found loaded component psm2 [dahu138:115763] mca: base: components_register: component psm2 register function successful [dahu138:115763] mca: base: components_open: opening mtl components [dahu138:115763] mca: base: components_open: found loaded component psm2 [dahu138:115761] mca: base: components_register: found loaded component psm2 [dahu138:115763] mca: base: components_open: component psm2 open function successful [dahu138:115761] mca: base: components_register: component psm2 register function successful [dahu138:115761] mca: base: components_open: opening mtl components [dahu138:115761] mca: base: components_open: found loaded component psm2 [dahu138:115761] mca: base: components_open: component psm2 open function successful [dahu138:115760] mca: base: components_register: registering framework mtl components [dahu138:115760] mca: base: components_register: found loaded component psm2 [dahu138:115760] mca: base: components_register: component psm2 register function successful [dahu138:115760] mca: base: components_open: opening mtl components [dahu138:115760] mca: base: components_open: found loaded component psm2 [dahu138:115762] mca: base: components_register: registering framework mtl components [dahu138:115762] mca: base: components_register: found loaded component psm2 [dahu138:115760] mca: base: components_open: component psm2 open function successful [dahu138:115762] mca: base: components_register: component psm2 register function successful [dahu138:115762] mca: base: components_open: opening mtl components [dahu138:115762] mca: base: components_open: found loaded component psm2 [dahu138:115762] mca: base: components_open: component psm2 open function successful [dahu138:115760] mca:base:select: Auto-selecting mtl components [dahu138:115760] mca:base:select:( mtl) Querying component [psm2] [dahu138:115760] mca:base:select:( mtl) Query of component [psm2] set priority to 40 [dahu138:115761] mca:base:select: Auto-selecting mtl components [dahu138:115762] mca:base:select: Auto-selecting mtl components [dahu138:115762] mca:base:select:( mtl) Querying component [psm2] [dahu138:115762] mca:base:select:( mtl) Query of component [psm2] set priority to 40 [dahu138:115762] mca:base:select:( mtl) Selected component [psm2] [dahu138:115762] select: initializing mtl component psm2 [dahu138:115761] mca:base:select:( mtl) Querying component [psm2] [dahu138:115761] mca:base:select:( mtl) Query of component [psm2] set priority to 40 [dahu138:115761] mca:base:select:( mtl) Selected component [psm2] [dahu138:115761] select: initializing mtl component psm2 [dahu138:115760] mca:base:select:( mtl) Selected component [psm2] [dahu138:115760] select: initializing mtl component psm2 [dahu138:115763] mca:base:select: Auto-selecting mtl components [dahu138:115763] mca:base:select:( mtl) Querying component [psm2] [dahu138:115763] mca:base:select:( mtl) Query of component [psm2] set priority to 40 [dahu138:115763] mca:base:select:( mtl) Selected component [psm2] [dahu138:115763] select: initializing mtl component psm2 [dahu138:115761] select: init returned success [dahu138:115761] select: component psm2 selected [dahu138:115762] select: init returned success [dahu138:115762] select: component psm2 selected [dahu138:115763] select: init returned success [dahu138:115763] select: component psm2 selected [dahu138:115760] select: init returned success [dahu138:115760] select: component psm2 selected On 1 found 1007 but expect 3007 On 2 found 1007 but expect 4007 and with this setup the code freeze with this dimension of the problem. Below is the same code with my no-ib setup of openMPI on the same node: dahu138 : mpirun -np 4 -mca mtl_base_verbose 99 ./test_layout_array [dahu138:116723] mca: base: components_register: registering framework mtl components [dahu138:116723] mca: base: components_ope
Re: [OMPI users] OpenMPI 4.0.5 error with Omni-path
What happens if you specify -mtl ofi ? -Original Message- From: users On Behalf Of Patrick Begou via users Sent: Monday, January 25, 2021 12:54 PM To: users@lists.open-mpi.org Cc: Patrick Begou Subject: Re: [OMPI users] OpenMPI 4.0.5 error with Omni-path Hi Howard and Michael, thanks for your feedback. I did not want to write a toot long mail with non pertinent information so I just show how the two different builds give different result. I'm using a small test case based on my large code, the same used to show the memory leak with mpi_Alltoallv calls, but just running 2 iterations. It is a 2D case and data storage is moved from distributions "along X axis" to "along Y axis" with mpi_Alltoallv and subarrays types. Datas initialization is based on the location in the array to allow checking for correct exchanges. When the program runs (on 4 processes in my test) it must only show the max rss size of the processes. When it fails it shows the invalid locations. I've drastically reduced the size of the problem with nx=5 and ny=7. Launching the non working setup with more details show: dahu138 : mpirun -np 4 -mca mtl_base_verbose 99 ./test_layout_array [dahu138:115761] mca: base: components_register: registering framework mtl components [dahu138:115763] mca: base: components_register: registering framework mtl components [dahu138:115763] mca: base: components_register: found loaded component psm2 [dahu138:115763] mca: base: components_register: component psm2 register function successful [dahu138:115763] mca: base: components_open: opening mtl components [dahu138:115763] mca: base: components_open: found loaded component psm2 [dahu138:115761] mca: base: components_register: found loaded component psm2 [dahu138:115763] mca: base: components_open: component psm2 open function successful [dahu138:115761] mca: base: components_register: component psm2 register function successful [dahu138:115761] mca: base: components_open: opening mtl components [dahu138:115761] mca: base: components_open: found loaded component psm2 [dahu138:115761] mca: base: components_open: component psm2 open function successful [dahu138:115760] mca: base: components_register: registering framework mtl components [dahu138:115760] mca: base: components_register: found loaded component psm2 [dahu138:115760] mca: base: components_register: component psm2 register function successful [dahu138:115760] mca: base: components_open: opening mtl components [dahu138:115760] mca: base: components_open: found loaded component psm2 [dahu138:115762] mca: base: components_register: registering framework mtl components [dahu138:115762] mca: base: components_register: found loaded component psm2 [dahu138:115760] mca: base: components_open: component psm2 open function successful [dahu138:115762] mca: base: components_register: component psm2 register function successful [dahu138:115762] mca: base: components_open: opening mtl components [dahu138:115762] mca: base: components_open: found loaded component psm2 [dahu138:115762] mca: base: components_open: component psm2 open function successful [dahu138:115760] mca:base:select: Auto-selecting mtl components [dahu138:115760] mca:base:select:( mtl) Querying component [psm2] [dahu138:115760] mca:base:select:( mtl) Query of component [psm2] set priority to 40 [dahu138:115761] mca:base:select: Auto-selecting mtl components [dahu138:115762] mca:base:select: Auto-selecting mtl components [dahu138:115762] mca:base:select:( mtl) Querying component [psm2] [dahu138:115762] mca:base:select:( mtl) Query of component [psm2] set priority to 40 [dahu138:115762] mca:base:select:( mtl) Selected component [psm2] [dahu138:115762] select: initializing mtl component psm2 [dahu138:115761] mca:base:select:( mtl) Querying component [psm2] [dahu138:115761] mca:base:select:( mtl) Query of component [psm2] set priority to 40 [dahu138:115761] mca:base:select:( mtl) Selected component [psm2] [dahu138:115761] select: initializing mtl component psm2 [dahu138:115760] mca:base:select:( mtl) Selected component [psm2] [dahu138:115760] select: initializing mtl component psm2 [dahu138:115763] mca:base:select: Auto-selecting mtl components [dahu138:115763] mca:base:select:( mtl) Querying component [psm2] [dahu138:115763] mca:base:select:( mtl) Query of component [psm2] set priority to 40 [dahu138:115763] mca:base:select:( mtl) Selected component [psm2] [dahu138:115763] select: initializing mtl component psm2 [dahu138:115761] select: init returned success [dahu138:115761] select: component psm2 selected [dahu138:115762] select: init returned success [dahu138:115762] select: component psm2 selected [dahu138:115763] select: init returned success [dahu138:115763] select: component psm2 selected [dahu138:115760] select: init returned success [dahu138:115760] select: component psm2 selected On 1 found 1007 but expect 3007 On 2 found 1007 but expect 4007 and with this
[OMPI users] MCA parameter "orte_base_help_aggregate"
Hello: I am testing a rather large code on several computers. It works fine on all except for a Linux Pop!_OS machine. I tried both OpenMPI 2.1.1 and 4.0.5. I fear there is an issue because of the Pop!_OS but before I contact System76 I would like to explore things further. I get the following message while running the code on a box called jp1: [jp1:3331418] 7 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics [jp1:3331418] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages and then mpirun noticed that process rank 3 with PID 0 on node jp1 exited on signal 9 (Killed). It seems I should set this MCA parameter "orte_base_help_aggregate" to 0 in order to see the error messages. How can I do this? I suppose I should do it before running the code. Is this correct? Thank you, Paul
Re: [OMPI users] OpenMPI 4.0.5 error with Omni-path
I think you mean add "--mca mtl ofi" to the mpirun cmd line > On Jan 25, 2021, at 10:18 AM, Heinz, Michael William via users > wrote: > > What happens if you specify -mtl ofi ? > > -Original Message- > From: users On Behalf Of Patrick Begou via > users > Sent: Monday, January 25, 2021 12:54 PM > To: users@lists.open-mpi.org > Cc: Patrick Begou > Subject: Re: [OMPI users] OpenMPI 4.0.5 error with Omni-path > > Hi Howard and Michael, > > thanks for your feedback. I did not want to write a toot long mail with non > pertinent information so I just show how the two different builds give > different result. I'm using a small test case based on my large code, the > same used to show the memory leak with mpi_Alltoallv calls, but just running > 2 iterations. It is a 2D case and data storage is moved from distributions > "along X axis" to "along Y axis" with mpi_Alltoallv and subarrays types. > Datas initialization is based on the location in the array to allow checking > for correct exchanges. > > When the program runs (on 4 processes in my test) it must only show the max > rss size of the processes. When it fails it shows the invalid locations. I've > drastically reduced the size of the problem with nx=5 and ny=7. > > Launching the non working setup with more details show: > > dahu138 : mpirun -np 4 -mca mtl_base_verbose 99 ./test_layout_array > [dahu138:115761] mca: base: components_register: registering framework mtl > components [dahu138:115763] mca: base: components_register: registering > framework mtl components [dahu138:115763] mca: base: components_register: > found loaded component psm2 [dahu138:115763] mca: base: components_register: > component psm2 register function successful [dahu138:115763] mca: base: > components_open: opening mtl components [dahu138:115763] mca: base: > components_open: found loaded component psm2 [dahu138:115761] mca: base: > components_register: found loaded component psm2 [dahu138:115763] mca: base: > components_open: component psm2 open function successful [dahu138:115761] > mca: base: components_register: component psm2 register function successful > [dahu138:115761] mca: base: components_open: opening mtl components > [dahu138:115761] mca: base: components_open: found loaded component psm2 > [dahu138:115761] mca: base: components_open: component psm2 open function > successful [dahu138:115760] mca: base: components_register: registering > framework mtl components [dahu138:115760] mca: base: components_register: > found loaded component psm2 [dahu138:115760] mca: base: components_register: > component psm2 register function successful [dahu138:115760] mca: base: > components_open: opening mtl components [dahu138:115760] mca: base: > components_open: found loaded component psm2 [dahu138:115762] mca: base: > components_register: registering framework mtl components [dahu138:115762] > mca: base: components_register: found loaded component psm2 [dahu138:115760] > mca: base: components_open: component psm2 open function successful > [dahu138:115762] mca: base: components_register: component psm2 register > function successful [dahu138:115762] mca: base: components_open: opening mtl > components [dahu138:115762] mca: base: components_open: found loaded > component psm2 [dahu138:115762] mca: base: components_open: component psm2 > open function successful [dahu138:115760] mca:base:select: Auto-selecting mtl > components [dahu138:115760] mca:base:select:( mtl) Querying component [psm2] > [dahu138:115760] mca:base:select:( mtl) Query of component [psm2] set > priority to 40 [dahu138:115761] mca:base:select: Auto-selecting mtl > components [dahu138:115762] mca:base:select: Auto-selecting mtl components > [dahu138:115762] mca:base:select:( mtl) Querying component [psm2] > [dahu138:115762] mca:base:select:( mtl) Query of component [psm2] set > priority to 40 [dahu138:115762] mca:base:select:( mtl) Selected component > [psm2] [dahu138:115762] select: initializing mtl component psm2 > [dahu138:115761] mca:base:select:( mtl) Querying component [psm2] > [dahu138:115761] mca:base:select:( mtl) Query of component [psm2] set > priority to 40 [dahu138:115761] mca:base:select:( mtl) Selected component > [psm2] [dahu138:115761] select: initializing mtl component psm2 > [dahu138:115760] mca:base:select:( mtl) Selected component [psm2] > [dahu138:115760] select: initializing mtl component psm2 [dahu138:115763] > mca:base:select: Auto-selecting mtl components [dahu138:115763] > mca:base:select:( mtl) Querying component [psm2] [dahu138:115763] > mca:base:select:( mtl) Query of component [psm2] set priority to 40 > [dahu138:115763] mca:base:select:( mtl) Selected component [psm2] > [dahu138:115763] select: initializing mtl component psm2 [dahu138:115761] > select: init returned success [dahu138:115761] select: component psm2 > selected [dahu138:115762] select: init returned success [dahu138:115762] > select: com
Re: [OMPI users] MCA parameter "orte_base_help_aggregate"
There should have been an error message right above that - all this is saying is that the same error message was output by 7 more processes besides the one that was output. It then indicates that process 3 (which has pid 0?) was killed. Looking at the help message tag, it looks like no NICs were found on the host. You might want to post the full error output. > On Jan 25, 2021, at 10:25 AM, Paul Cizmas via users > wrote: > > Hello: > > I am testing a rather large code on several computers. It works fine on all > except for a Linux Pop!_OS machine. I tried both OpenMPI 2.1.1 and 4.0.5. I > fear there is an issue because of the Pop!_OS but before I contact System76 I > would like to explore things further. > > I get the following message while running the code on a box called jp1: > > [jp1:3331418] 7 more processes have sent help message help-mpi-btl-base.txt / > btl:no-nics > [jp1:3331418] Set MCA parameter "orte_base_help_aggregate" to 0 to see all > help / error messages > > and then > > mpirun noticed that process rank 3 with PID 0 on node jp1 exited on signal 9 > (Killed). > > It seems I should set this MCA parameter "orte_base_help_aggregate" to 0 in > order to see the error messages. > > How can I do this? I suppose I should do it before running the code. Is > this correct? > > Thank you, > Paul
Re: [OMPI users] OpenMPI 4.0.5 error with Omni-path
Patrick, is your application multi-threaded? PSM2 was not originally designed for multiple threads per process. I do know that the OSU alltoallV test does pass when I try it. Sent from my iPad > On Jan 25, 2021, at 12:57 PM, Patrick Begou via users > wrote: > > Hi Howard and Michael, > > thanks for your feedback. I did not want to write a toot long mail with > non pertinent information so I just show how the two different builds > give different result. I'm using a small test case based on my large > code, the same used to show the memory leak with mpi_Alltoallv calls, > but just running 2 iterations. It is a 2D case and data storage is moved > from distributions "along X axis" to "along Y axis" with mpi_Alltoallv > and subarrays types. Datas initialization is based on the location in > the array to allow checking for correct exchanges. > > When the program runs (on 4 processes in my test) it must only show the > max rss size of the processes. When it fails it shows the invalid > locations. I've drastically reduced the size of the problem with nx=5 > and ny=7. > > Launching the non working setup with more details show: > > dahu138 : mpirun -np 4 -mca mtl_base_verbose 99 ./test_layout_array > [dahu138:115761] mca: base: components_register: registering framework > mtl components > [dahu138:115763] mca: base: components_register: registering framework > mtl components > [dahu138:115763] mca: base: components_register: found loaded component psm2 > [dahu138:115763] mca: base: components_register: component psm2 register > function successful > [dahu138:115763] mca: base: components_open: opening mtl components > [dahu138:115763] mca: base: components_open: found loaded component psm2 > [dahu138:115761] mca: base: components_register: found loaded component psm2 > [dahu138:115763] mca: base: components_open: component psm2 open > function successful > [dahu138:115761] mca: base: components_register: component psm2 register > function successful > [dahu138:115761] mca: base: components_open: opening mtl components > [dahu138:115761] mca: base: components_open: found loaded component psm2 > [dahu138:115761] mca: base: components_open: component psm2 open > function successful > [dahu138:115760] mca: base: components_register: registering framework > mtl components > [dahu138:115760] mca: base: components_register: found loaded component psm2 > [dahu138:115760] mca: base: components_register: component psm2 register > function successful > [dahu138:115760] mca: base: components_open: opening mtl components > [dahu138:115760] mca: base: components_open: found loaded component psm2 > [dahu138:115762] mca: base: components_register: registering framework > mtl components > [dahu138:115762] mca: base: components_register: found loaded component psm2 > [dahu138:115760] mca: base: components_open: component psm2 open > function successful > [dahu138:115762] mca: base: components_register: component psm2 register > function successful > [dahu138:115762] mca: base: components_open: opening mtl components > [dahu138:115762] mca: base: components_open: found loaded component psm2 > [dahu138:115762] mca: base: components_open: component psm2 open > function successful > [dahu138:115760] mca:base:select: Auto-selecting mtl components > [dahu138:115760] mca:base:select:( mtl) Querying component [psm2] > [dahu138:115760] mca:base:select:( mtl) Query of component [psm2] set > priority to 40 > [dahu138:115761] mca:base:select: Auto-selecting mtl components > [dahu138:115762] mca:base:select: Auto-selecting mtl components > [dahu138:115762] mca:base:select:( mtl) Querying component [psm2] > [dahu138:115762] mca:base:select:( mtl) Query of component [psm2] set > priority to 40 > [dahu138:115762] mca:base:select:( mtl) Selected component [psm2] > [dahu138:115762] select: initializing mtl component psm2 > [dahu138:115761] mca:base:select:( mtl) Querying component [psm2] > [dahu138:115761] mca:base:select:( mtl) Query of component [psm2] set > priority to 40 > [dahu138:115761] mca:base:select:( mtl) Selected component [psm2] > [dahu138:115761] select: initializing mtl component psm2 > [dahu138:115760] mca:base:select:( mtl) Selected component [psm2] > [dahu138:115760] select: initializing mtl component psm2 > [dahu138:115763] mca:base:select: Auto-selecting mtl components > [dahu138:115763] mca:base:select:( mtl) Querying component [psm2] > [dahu138:115763] mca:base:select:( mtl) Query of component [psm2] set > priority to 40 > [dahu138:115763] mca:base:select:( mtl) Selected component [psm2] > [dahu138:115763] select: initializing mtl component psm2 > [dahu138:115761] select: init returned success > [dahu138:115761] select: component psm2 selected > [dahu138:115762] select: init returned success > [dahu138:115762] select: component psm2 selected > [dahu138:115763] select: init returned success > [dahu138:115763] select: component psm2 selected > [dahu138:115760] select: init returned success > [dahu138:115760