Patrick, is your application multi-threaded? PSM2 was not originally designed for multiple threads per process.
I do know that the OSU alltoallV test does pass when I try it. Sent from my iPad > On Jan 25, 2021, at 12:57 PM, Patrick Begou via users > <users@lists.open-mpi.org> wrote: > > Hi Howard and Michael, > > thanks for your feedback. I did not want to write a toot long mail with > non pertinent information so I just show how the two different builds > give different result. I'm using a small test case based on my large > code, the same used to show the memory leak with mpi_Alltoallv calls, > but just running 2 iterations. It is a 2D case and data storage is moved > from distributions "along X axis" to "along Y axis" with mpi_Alltoallv > and subarrays types. Datas initialization is based on the location in > the array to allow checking for correct exchanges. > > When the program runs (on 4 processes in my test) it must only show the > max rss size of the processes. When it fails it shows the invalid > locations. I've drastically reduced the size of the problem with nx=5 > and ny=7. > > Launching the non working setup with more details show: > > dahu138 : mpirun -np 4 -mca mtl_base_verbose 99 ./test_layout_array > [dahu138:115761] mca: base: components_register: registering framework > mtl components > [dahu138:115763] mca: base: components_register: registering framework > mtl components > [dahu138:115763] mca: base: components_register: found loaded component psm2 > [dahu138:115763] mca: base: components_register: component psm2 register > function successful > [dahu138:115763] mca: base: components_open: opening mtl components > [dahu138:115763] mca: base: components_open: found loaded component psm2 > [dahu138:115761] mca: base: components_register: found loaded component psm2 > [dahu138:115763] mca: base: components_open: component psm2 open > function successful > [dahu138:115761] mca: base: components_register: component psm2 register > function successful > [dahu138:115761] mca: base: components_open: opening mtl components > [dahu138:115761] mca: base: components_open: found loaded component psm2 > [dahu138:115761] mca: base: components_open: component psm2 open > function successful > [dahu138:115760] mca: base: components_register: registering framework > mtl components > [dahu138:115760] mca: base: components_register: found loaded component psm2 > [dahu138:115760] mca: base: components_register: component psm2 register > function successful > [dahu138:115760] mca: base: components_open: opening mtl components > [dahu138:115760] mca: base: components_open: found loaded component psm2 > [dahu138:115762] mca: base: components_register: registering framework > mtl components > [dahu138:115762] mca: base: components_register: found loaded component psm2 > [dahu138:115760] mca: base: components_open: component psm2 open > function successful > [dahu138:115762] mca: base: components_register: component psm2 register > function successful > [dahu138:115762] mca: base: components_open: opening mtl components > [dahu138:115762] mca: base: components_open: found loaded component psm2 > [dahu138:115762] mca: base: components_open: component psm2 open > function successful > [dahu138:115760] mca:base:select: Auto-selecting mtl components > [dahu138:115760] mca:base:select:( mtl) Querying component [psm2] > [dahu138:115760] mca:base:select:( mtl) Query of component [psm2] set > priority to 40 > [dahu138:115761] mca:base:select: Auto-selecting mtl components > [dahu138:115762] mca:base:select: Auto-selecting mtl components > [dahu138:115762] mca:base:select:( mtl) Querying component [psm2] > [dahu138:115762] mca:base:select:( mtl) Query of component [psm2] set > priority to 40 > [dahu138:115762] mca:base:select:( mtl) Selected component [psm2] > [dahu138:115762] select: initializing mtl component psm2 > [dahu138:115761] mca:base:select:( mtl) Querying component [psm2] > [dahu138:115761] mca:base:select:( mtl) Query of component [psm2] set > priority to 40 > [dahu138:115761] mca:base:select:( mtl) Selected component [psm2] > [dahu138:115761] select: initializing mtl component psm2 > [dahu138:115760] mca:base:select:( mtl) Selected component [psm2] > [dahu138:115760] select: initializing mtl component psm2 > [dahu138:115763] mca:base:select: Auto-selecting mtl components > [dahu138:115763] mca:base:select:( mtl) Querying component [psm2] > [dahu138:115763] mca:base:select:( mtl) Query of component [psm2] set > priority to 40 > [dahu138:115763] mca:base:select:( mtl) Selected component [psm2] > [dahu138:115763] select: initializing mtl component psm2 > [dahu138:115761] select: init returned success > [dahu138:115761] select: component psm2 selected > [dahu138:115762] select: init returned success > [dahu138:115762] select: component psm2 selected > [dahu138:115763] select: init returned success > [dahu138:115763] select: component psm2 selected > [dahu138:115760] select: init returned success > [dahu138:115760] select: component psm2 selected > On 1 found 1007 but expect 3007 > On 2 found 1007 but expect 4007 > > and with this setup the code freeze with this dimension of the problem. > > > Below is the same code with my no-ib setup of openMPI on the same node: > > dahu138 : mpirun -np 4 -mca mtl_base_verbose 99 ./test_layout_array > [dahu138:116723] mca: base: components_register: registering framework > mtl components > [dahu138:116723] mca: base: components_open: opening mtl components > [dahu138:116724] mca: base: components_register: registering framework > mtl components > [dahu138:116724] mca: base: components_open: opening mtl components > [dahu138:116726] mca: base: components_register: registering framework > mtl components > [dahu138:116726] mca: base: components_open: opening mtl components > [dahu138:116725] mca: base: components_register: registering framework > mtl components > [dahu138:116725] mca: base: components_open: opening mtl components > [INFO MEMORY] : processor 0 uses 9948 kb max of resident memory > [INFO MEMORY] : processor 0 uses 9948 kb max of resident memory > > The test case used is provides in attachment but as it runs on many > OS/OpenMPI/hardware associations I do not think the problem could be the > tes-case even if it is also a possibility. > > Patrick > > <test_layout_array.tar.gz>