Patrick, is your application multi-threaded? PSM2 was not originally designed 
for multiple threads per process.

I do know that the OSU alltoallV test does pass when I try it.

Sent from my iPad

> On Jan 25, 2021, at 12:57 PM, Patrick Begou via users 
> <users@lists.open-mpi.org> wrote:
> 
> Hi Howard and Michael,
> 
> thanks for your feedback. I did not want to write a toot long mail with
> non pertinent information so I just show how the two different builds
> give different result. I'm using a small test case based on my large
> code, the same used to show the memory leak with mpi_Alltoallv calls,
> but just running 2 iterations. It is a 2D case and data storage is moved
> from distributions "along X axis" to "along Y axis" with mpi_Alltoallv
> and subarrays types. Datas initialization is based on the location in
> the array to allow checking for correct exchanges.
> 
> When the program runs (on 4 processes in my test) it must only show the
> max rss size of the processes. When it fails it shows the invalid
> locations. I've drastically reduced the size of the problem with nx=5
> and ny=7.
> 
> Launching the non working setup with more details show:
> 
> dahu138 : mpirun -np 4 -mca mtl_base_verbose 99 ./test_layout_array
> [dahu138:115761] mca: base: components_register: registering framework
> mtl components
> [dahu138:115763] mca: base: components_register: registering framework
> mtl components
> [dahu138:115763] mca: base: components_register: found loaded component psm2
> [dahu138:115763] mca: base: components_register: component psm2 register
> function successful
> [dahu138:115763] mca: base: components_open: opening mtl components
> [dahu138:115763] mca: base: components_open: found loaded component psm2
> [dahu138:115761] mca: base: components_register: found loaded component psm2
> [dahu138:115763] mca: base: components_open: component psm2 open
> function successful
> [dahu138:115761] mca: base: components_register: component psm2 register
> function successful
> [dahu138:115761] mca: base: components_open: opening mtl components
> [dahu138:115761] mca: base: components_open: found loaded component psm2
> [dahu138:115761] mca: base: components_open: component psm2 open
> function successful
> [dahu138:115760] mca: base: components_register: registering framework
> mtl components
> [dahu138:115760] mca: base: components_register: found loaded component psm2
> [dahu138:115760] mca: base: components_register: component psm2 register
> function successful
> [dahu138:115760] mca: base: components_open: opening mtl components
> [dahu138:115760] mca: base: components_open: found loaded component psm2
> [dahu138:115762] mca: base: components_register: registering framework
> mtl components
> [dahu138:115762] mca: base: components_register: found loaded component psm2
> [dahu138:115760] mca: base: components_open: component psm2 open
> function successful
> [dahu138:115762] mca: base: components_register: component psm2 register
> function successful
> [dahu138:115762] mca: base: components_open: opening mtl components
> [dahu138:115762] mca: base: components_open: found loaded component psm2
> [dahu138:115762] mca: base: components_open: component psm2 open
> function successful
> [dahu138:115760] mca:base:select: Auto-selecting mtl components
> [dahu138:115760] mca:base:select:(  mtl) Querying component [psm2]
> [dahu138:115760] mca:base:select:(  mtl) Query of component [psm2] set
> priority to 40
> [dahu138:115761] mca:base:select: Auto-selecting mtl components
> [dahu138:115762] mca:base:select: Auto-selecting mtl components
> [dahu138:115762] mca:base:select:(  mtl) Querying component [psm2]
> [dahu138:115762] mca:base:select:(  mtl) Query of component [psm2] set
> priority to 40
> [dahu138:115762] mca:base:select:(  mtl) Selected component [psm2]
> [dahu138:115762] select: initializing mtl component psm2
> [dahu138:115761] mca:base:select:(  mtl) Querying component [psm2]
> [dahu138:115761] mca:base:select:(  mtl) Query of component [psm2] set
> priority to 40
> [dahu138:115761] mca:base:select:(  mtl) Selected component [psm2]
> [dahu138:115761] select: initializing mtl component psm2
> [dahu138:115760] mca:base:select:(  mtl) Selected component [psm2]
> [dahu138:115760] select: initializing mtl component psm2
> [dahu138:115763] mca:base:select: Auto-selecting mtl components
> [dahu138:115763] mca:base:select:(  mtl) Querying component [psm2]
> [dahu138:115763] mca:base:select:(  mtl) Query of component [psm2] set
> priority to 40
> [dahu138:115763] mca:base:select:(  mtl) Selected component [psm2]
> [dahu138:115763] select: initializing mtl component psm2
> [dahu138:115761] select: init returned success
> [dahu138:115761] select: component psm2 selected
> [dahu138:115762] select: init returned success
> [dahu138:115762] select: component psm2 selected
> [dahu138:115763] select: init returned success
> [dahu138:115763] select: component psm2 selected
> [dahu138:115760] select: init returned success
> [dahu138:115760] select: component psm2 selected
> On 1 found 1007 but expect 3007
> On 2 found 1007 but expect 4007
> 
> and with this setup the code freeze with this dimension of the problem.
> 
> 
> Below is the same code with my no-ib setup of openMPI on the same node:
> 
> dahu138 : mpirun -np 4 -mca mtl_base_verbose 99 ./test_layout_array
> [dahu138:116723] mca: base: components_register: registering framework
> mtl components
> [dahu138:116723] mca: base: components_open: opening mtl components
> [dahu138:116724] mca: base: components_register: registering framework
> mtl components
> [dahu138:116724] mca: base: components_open: opening mtl components
> [dahu138:116726] mca: base: components_register: registering framework
> mtl components
> [dahu138:116726] mca: base: components_open: opening mtl components
> [dahu138:116725] mca: base: components_register: registering framework
> mtl components
> [dahu138:116725] mca: base: components_open: opening mtl components
> [INFO MEMORY] : processor 0 uses  9948 kb max of resident memory
> [INFO MEMORY] : processor 0 uses  9948 kb max of resident memory
> 
> The test case used is provides in attachment but as it runs on many
> OS/OpenMPI/hardware associations I do not think the problem could be the
> tes-case even if it is also a possibility.
> 
> Patrick
> 
> <test_layout_array.tar.gz>

Reply via email to