https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121760

            Bug ID: 121760
           Summary: libgomp: Trying to map into device
                    [0x7fffa34613f0..0x7fffa3461420) object when
                    [0x7fffa34613d0..0x7fffa3461400) is already mapped
           Product: gcc
           Version: 15.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: libgomp
          Assignee: unassigned at gcc dot gnu.org
          Reporter: schulz.benjamin at googlemail dot com
                CC: jakub at gcc dot gnu.org
  Target Milestone: ---

Created attachment 62268
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=62268&action=edit
strassen_algorithm

Hi there, according to the OpenMP standard, 

mapping an object twice should just increase the reference counter.

I have now written a little project which has the strassen algorithm
implemented on device. the latter is a matrix multiplication. I've written a
class called datastruct, which contains fields with pointer variables for the
pdata, the pextents, the pstrides, and also a field which indicates whether the
data is a devicepointer. I've written update and alloc functions that recognize
whether the data field is a device ptr, and then they just update the strides
and extents...


In my algorithm, i do not call omp runtime functions which check if an object
is already mapped. 

I think thats a waste of time, since openmp already has these nice reference
counters that should do this job for you...

Well, the strassen algorithm needs many temporary data. I create these with
omp_target_alloc, and then set the variable for is_dev-ptr to true.

On unified shared memory, my code works, of course.  On CPU, it also works..

On separated device memory, it yields this funny output with gcc.15.2





libgomp: Trying to map into device [0x7ffdb0fa1010..0x7ffdb0fa1040) object when
[0x7ffdb0fa0ff0..0x7ffdb0fa1020) is already mapped







I traced the cause to the call of matrix_multiply_dot in a part of
strassen_multiply_h in line 388 of mathfunctions_mpi.h

Here is the snippet:

    bool separate_device_memory;
    if(ongpu)
    {
#if defined(Unified_Shared_Memory)
        separate_device_memory=false;
#else
        separate_device_memory=true;
#endif
    }
    else
    {
        separate_device_memory=false;
    }

    // Base case: if no dimension is divisible by 2, use standard
multiplication
    if ((n%2!=0) || (m%2!=0) || (p%2!=0)  || m<=2 || n<=2|| p<=2 ||
!policy.should_use_recursion(n*p))
    {
        if(ongpu)
        {
           
GPU_Math_Functions<T>::matrix_multiply_dot_g(A,B,C,policy.devicenum,policy.update_host);
            return;
        }
        else
        {
            In_Kernel_Mathfunctions<T>::matrix_multiply_dot_w(A, B, C);
            return;
        }
        return;
    }

//now the rest of the strassen algorithm begins...

     if(separate_device_memory)
    {
       
Datastruct_GPU_Memory_Functions<T>::create_in_struct(A,policy.devicenum);
       
Datastruct_GPU_Memory_Functions<T>::create_in_struct(B,policy.devicenum);
       
Datastruct_GPU_Memory_Functions<T>::create_out_struct(C,policy.devicenum);
    }



The function matrix_multiply_dot_g has the following code:

template <typename T>
void GPU_Math_Functions<T>::matrix_multiply_dot_g(  datastruct<T>& A, 
datastruct<T>& B, datastruct<T>& C,int dev,bool update_host)
{
    const size_t rows=A.dpextents[0];
    const size_t cols=B.dpextents[1];
    const size_t inner_dim=A.dpextents[1];

    //these functions check isdevptr to see whether data was allocated with
malloc. they do only offload if that is not the case.
    typename Datastruct_GPU_Memory_Functions<T>::OffloadHelper offloadA(A, dev,
false, false);
    typename Datastruct_GPU_Memory_Functions<T>::OffloadHelper offloadB(B, dev,
false, false);
    typename Datastruct_GPU_Memory_Functions<T>::OffloadHelper offloadC(C, dev,
true, update_host);

    #pragma omp target teams distribute parallel for collapse(2)
shared(A,B,C,rows,cols,inner_dim) device(dev)
    for (size_t i = 0; i < rows; ++i)
    {
        for (size_t j = 0; j < cols; ++j)
        {
            T sum = 0;
            #pragma omp simd reduction(+:sum)
            for (size_t k = 0; k < inner_dim; ++k)
            {
                sum += A(i,k) *B(k,j);
            }
            C(i,j)= sum;
        }
    }

The created offload helper object is just a temporary object that exists in the
code block, whose constructor calls the mapping functions and whose destructor
calls the release functions:


class OffloadHelper
    {
    protected:
        bool pupdate_host;
        datastruct<T> pdL;
        int pdevicenum;
    public:
        inline OffloadHelper(datastruct<T>& dL, int devicenum, bool just_alloc,
bool update_host_on_exit)
            :pupdate_host(update_host_on_exit), pdL(dL),pdevicenum(devicenum)
        {
#if !defined(Unified_Shared_Memory)
            if (just_alloc)
                Datastruct_GPU_Memory_Functions::create_out_struct(dL,
devicenum);
            else
                Datastruct_GPU_Memory_Functions::create_in_struct(dL,
devicenum);
#endif
        }

        inline  ~OffloadHelper()
        {
#if !defined(Unified_Shared_Memory)
            if (pupdate_host)
            {
                Datastruct_GPU_Memory_Functions::update_host(pdL, pdevicenum);
            }
            Datastruct_GPU_Memory_Functions::release_struct(pdL, pdevicenum);
#endif
        }

        OffloadHelper(const OffloadHelper&) = delete;
        OffloadHelper& operator=(const OffloadHelper&) = delete;
    };


where create_in_struct calls map:alloc, create_out_struct calls map_to and
update_host calls update from and release_struct calls target exit data release
which should just decrease the reference counter, and the functions should
leave the pdata field alone if it is a devptr


template<typename T>
bool Datastruct_GPU_Memory_Functions<T>::update_host_data(datastruct<T>& dL,int
devicenum)
{
#if !defined(Unified_Shared_Memory)
    size_t l=dL.dpdatalength;

    if(!dL.dpdata_is_devptr)
    {
        #pragma omp target update from (dL.dpdata[0:l])device(devicenum)
        return true;
    }
    else
        return false;

#endif
    return true;
}


template<typename T>
  bool Datastruct_GPU_Memory_Functions<T>::update_host(datastruct<T>& dL,int
devicenum)
{
#if !defined(Unified_Shared_Memory)
    size_t l=dL.dpdatalength;
    size_t r=dL.dprank;

    #pragma omp target update from (dL) device(devicenum)
    #pragma omp target update from (dL.dpstrides[0:r])device(devicenum)
    #pragma omp target update from (dL.dpextents[0:r])device(devicenum)
    if(!dL.dpdata_is_devptr)
    {
        #pragma omp target update from (dL.dpdata[0:l])device(devicenum)
        return true;
    }
    else
        return false;
#endif
    return true;
}




template<typename T>
  void  Datastruct_GPU_Memory_Functions<T>::create_out_struct(datastruct<T>&
dA,int devicenum)
{
#if !defined(Unified_Shared_Memory)
    size_t l=dA.dpdatalength;
    size_t r=dA.dprank;
    #pragma omp target enter data map(to: dA) device(devicenum)
    if(!dA.dpdata_is_devptr)
    {
        #pragma omp target enter data map(alloc:
dA.dpdata[0:l])device(devicenum)
    }
    #pragma omp target enter data map(to: dA.dpextents[0:r])device(devicenum)
    #pragma omp target enter data map(to: dA.dpstrides[0:r])device(devicenum)


#endif
}

template<typename T>
  void  Datastruct_GPU_Memory_Functions<T>::create_in_struct(datastruct<T>&
dA,int devicenum)
{
#if !defined(Unified_Shared_Memory)
    size_t l=dA.dpdatalength;
    size_t r=dA.dprank;
    #pragma omp target enter data map(to: dA)device(devicenum)
    if(!dA.dpdata_is_devptr)
    {
        #pragma omp target enter data map(to: dA.dpdata[0:l])device(devicenum)
    }

    #pragma omp target enter data map(to: dA.dpextents[0:r])device(devicenum)

    #pragma omp target enter data map(to: dA.dpstrides[0:r])device(devicenum)

#endif
}



so all in all, the call to

GPU_Math_Functions<T>::matrix_multiply_dot_g

 function will, indeed map the objects A,B,C, And when this appears in a
recursive call, where the temporary matrices data are allocated by omp_target
alloc, as in my implementation of the Strassen algorithm,

the extents and the strides should be mapped twice, so it should just be a
reference counter increase, and then after the naive multiplication, the
counter should be released, while the pdata field should be left as is...

Interestingly, the error message appears only a few computations of other
functions afterwards..



But, if i write the code of  

matrix_multiply_dot_g inside the function, i.e, if i write


bool
ongpu=policy.should_use_gpu(A,B,C,Math_Functions_Policy::default_cubic_treshold,7);

    bool separate_device_memory;
    if(ongpu)
    {
#if defined(Unified_Shared_Memory)
        separate_device_memory=false;
#else
        separate_device_memory=true;
#endif
    }
    else
    {
        separate_device_memory=false;
    }


    if(separate_device_memory)
    {
       
Datastruct_GPU_Memory_Functions<T>::create_in_struct(A,policy.devicenum);
       
Datastruct_GPU_Memory_Functions<T>::create_in_struct(B,policy.devicenum);
       
Datastruct_GPU_Memory_Functions<T>::create_out_struct(C,policy.devicenum);
    }
    // Base case: if no dimension is divisible by 2, use standard
multiplication
    if ((n%2!=0) || (m%2!=0) || (p%2!=0)  || m<=2 || n<=2|| p<=2 ||
!policy.should_use_recursion(n*p))
    {
        if(ongpu)
        {
            const size_t rows=A.dpextents[0];
            const size_t cols=B.dpextents[1];
            const size_t inner_dim=A.dpextents[1];
            #pragma omp target teams distribute parallel for collapse(2)
shared(A,B,C,rows,cols,inner_dim) device(policy.devicenum)
            for (size_t i = 0; i < rows; ++i)
            {
                for (size_t j = 0; j < cols; ++j)
                {
                    T sum = 0;
                    #pragma omp simd reduction(+:sum)
                    for (size_t k = 0; k < inner_dim; ++k)
                    {
                        sum += A(i,k) *B(k,j);
                    }
                    C(i,j)= sum;
                }
            }

            if(separate_device_memory)
            {
               
Datastruct_GPU_Memory_Functions<T>::release_struct(A,policy.devicenum);
               
Datastruct_GPU_Memory_Functions<T>::release_struct(B,policy.devicenum);

                if(policy.update_host)
                {
                   
Datastruct_GPU_Memory_Functions<T>::update_host_data(C,policy.devicenum);
                }
               
Datastruct_GPU_Memory_Functions<T>::release_struct(C,policy.devicenum);
            }

            return;
        }
        else
        {
            In_Kernel_Mathfunctions<T>::matrix_multiply_dot_w(A, B, C);
            return;
        }
    }


//now the strassen algorithm can proceed further....




Then suddenly the error message is gone...

But again, mapping an object twice should be allowed with gcc-15.2 and it
should not cost anything. I dont want to always call omp_get_associated_ptr and
check whether the data is mapped. 




I will now attach the files of my projects... the strassen algorithm can use
openmpi, so the header is required, but for compiling and executing the test
application, ive disabled usage of mpi.

The cmake lists txt will create two applications, the crash the error message
appears in the created arraytest executable with gcc-15.2 and the settings

-fopenmp -foffload=nvptx-none -fno-stack-protector -fno-math-errno
-fno-trapping-math  -Wall

with this linker command.
    target_link_libraries(arraytest PRIVATE rt m c stdc++ mpi)



I am sorry for the large test case. Please believe me that I've tried to
reproduce it with smaller test cases. It did not appear.

Note that the problem also does not appear when i reduce main_omp.cpp to just
do the multiplication... So it seems that the problem appears when doing
computations afterwards. Yet the issue seems to be this call.

Interestingly, the error also only appears if I afterwards, long after the
strassen algorithm, later make the call

 cout<<"On gpu"<<std::endl;
    p5.mode=Math_Functions_Policy::GPU_ONLY;
    p5.memmapped_files=false;

    Math_Functions<double>::qr_decomposition(A4,Q4,R4,&p5);
    Q4.printmatrix();
    R4.printmatrix();

in main_omp.cpp. If i switch here to CPU_ONLY, then the error message
dissapears. But I did not find any problems in the qr:decomposition.

So, after getting desperate, I then inlined the naive multiplication call in
the strassen algorithm, since this was mostly where objects are mapped several
times and then the issue dissapeared. 

Note that, the strides and extents are still mapped twice. Just not inside a
function call. But then the error is gone

Reply via email to