Hi Patrick,

Glad to hear you are now able to move forward.

Please keep in mind this is not a fix but a temporary workaround.
At first glance, I did not spot any issue in the current code.
It turned out that the memory leak disappeared when doing things differently

Cheers,

Gilles

On Mon, Dec 14, 2020 at 7:11 PM Patrick Bégou via users <
users@lists.open-mpi.org> wrote:

> Hi Gilles,
>
> you catch the bug! With this patch, on a single node, the memory leak
> disappear. The cluster is actualy overloaded, as soon as possible I will
> launch a multinode test.
> Below the memory used by rank 0 before (blue) and after (red) the patch.
>
> Thanks
>
> Patrick
>
>
> Le 10/12/2020 à 10:15, Gilles Gouaillardet via users a écrit :
>
> Patrick,
>
>
> First, thank you very much for sharing the reproducer.
>
>
> Yes, please open a github issue so we can track this.
>
>
> I cannot fully understand where the leak is coming from, but so far
>
>  - the code fails on master built with --enable-debug (the data engine
> reports an error) but not with the v3.1.x branch
>
>   (this suggests there could be an error in the latest Open MPI ... or in
> the code)
>
>  - the attached patch seems to have a positive effect, can you please give
> it a try?
>
>
> Cheers,
>
>
> Gilles
>
>
>
> On 12/7/2020 6:15 PM, Patrick Bégou via users wrote:
>
> Hi,
>
> I've written a small piece of code to show the problem. Based on my
> application but 2D and using integers arrays for testing.
> The  figure below shows the max RSS size of rank 0 process on 20000
> iterations on 8 and 16 cores, with openib and tcp drivers.
> The more processes I have, the larger the memory leak.  I use the same
> binaries for the 4 runs and OpenMPI 3.1 (same behavior with 4.0.5).
> The code is in attachment. I'll try to check type deallocation as soon as
> possible.
>
> Patrick
>
>
>
>
> Le 04/12/2020 à 01:34, Gilles Gouaillardet via users a écrit :
>
> Patrick,
>
>
> based on George's idea, a simpler check is to retrieve the Fortran index
> via the (standard) MPI_Type_c2() function
>
> after you create a derived datatype.
>
>
> If the index keeps growing forever even after you MPI_Type_free(), then
> this clearly indicates a leak.
>
> Unfortunately, this simple test cannot be used to definitely rule out any
> memory leak.
>
>
> Note you can also
>
> mpirun --mca pml ob1 --mca btl tcp,self ...
>
> in order to force communications over TCP/IP and hence rule out any memory
> leak that could be triggered by your fast interconnect.
>
>
>
> In any case, a reproducer will greatly help us debugging this issue.
>
>
> Cheers,
>
>
> Gilles
>
>
>
> On 12/4/2020 7:20 AM, George Bosilca via users wrote:
>
> Patrick,
>
> I'm afraid there is no simple way to check this. The main reason being
> that OMPI use handles for MPI objects, and these handles are not tracked by
> the library, they are supposed to be provided by the user for each call. In
> your case, as you already called MPI_Type_free on the datatype, you cannot
> produce a valid handle.
>
> There might be a trick. If the datatype is manipulated with any Fortran
> MPI functions, then we convert the handle (which in fact is a pointer) to
> an index into a pointer array structure. Thus, the index will remain used,
> and can therefore be used to convert back into a valid datatype pointer,
> until OMPI completely releases the datatype. Look into
> the ompi_datatype_f_to_c_table table to see the datatypes that exist and
> get their pointers, and then use these pointers as arguments to
> ompi_datatype_dump() to see if any of these existing datatypes are the ones
> you define.
>
> George.
>
>
>
>
> On Thu, Dec 3, 2020 at 4:44 PM Patrick Bégou via users <
> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
> <users@lists.open-mpi.org>> wrote:
>
>     Hi,
>
>     I'm trying to solve a memory leak since my new implementation of
>     communications based on MPI_AllToAllW and MPI_type_Create_SubArray
>     calls.  Arrays of SubArray types are created/destroyed at each
>     time step and used for communications.
>
>     On my laptop the code runs fine (running for 15000 temporal
>     itérations on 32 processes with oversubscription) but on our
>     cluster memory used by the code increase until the OOMkiller stop
>     the job. On the cluster we use IB QDR for communications.
>
>     Same Gcc/Gfortran 7.3 (built from sources), same sources of
>     OpenMPI (3.1 or 4.0.5 tested), same sources of the fortran code on
>     the laptop and on the cluster.
>
>     Using Gcc/Gfortran 4.8 and OpenMPI 1.7.3 on the cluster do not
>     show the problem (resident memory do not increase and we ran
>     100000 temporal iterations)
>
>     MPI_type_free manual says that it "/Marks the datatype object
>     associated with datatype for deallocation/". But  how can I check
>     that the deallocation is really done ?
>
>     Thanks for ant suggestions.
>
>     Patrick
>
>
>
>

Reply via email to