https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121760
Bug ID: 121760 Summary: libgomp: Trying to map into device [0x7fffa34613f0..0x7fffa3461420) object when [0x7fffa34613d0..0x7fffa3461400) is already mapped Product: gcc Version: 15.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libgomp Assignee: unassigned at gcc dot gnu.org Reporter: schulz.benjamin at googlemail dot com CC: jakub at gcc dot gnu.org Target Milestone: --- Created attachment 62268 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=62268&action=edit strassen_algorithm Hi there, according to the OpenMP standard, mapping an object twice should just increase the reference counter. I have now written a little project which has the strassen algorithm implemented on device. the latter is a matrix multiplication. I've written a class called datastruct, which contains fields with pointer variables for the pdata, the pextents, the pstrides, and also a field which indicates whether the data is a devicepointer. I've written update and alloc functions that recognize whether the data field is a device ptr, and then they just update the strides and extents... In my algorithm, i do not call omp runtime functions which check if an object is already mapped. I think thats a waste of time, since openmp already has these nice reference counters that should do this job for you... Well, the strassen algorithm needs many temporary data. I create these with omp_target_alloc, and then set the variable for is_dev-ptr to true. On unified shared memory, my code works, of course. On CPU, it also works.. On separated device memory, it yields this funny output with gcc.15.2 libgomp: Trying to map into device [0x7ffdb0fa1010..0x7ffdb0fa1040) object when [0x7ffdb0fa0ff0..0x7ffdb0fa1020) is already mapped I traced the cause to the call of matrix_multiply_dot in a part of strassen_multiply_h in line 388 of mathfunctions_mpi.h Here is the snippet: bool separate_device_memory; if(ongpu) { #if defined(Unified_Shared_Memory) separate_device_memory=false; #else separate_device_memory=true; #endif } else { separate_device_memory=false; } // Base case: if no dimension is divisible by 2, use standard multiplication if ((n%2!=0) || (m%2!=0) || (p%2!=0) || m<=2 || n<=2|| p<=2 || !policy.should_use_recursion(n*p)) { if(ongpu) { GPU_Math_Functions<T>::matrix_multiply_dot_g(A,B,C,policy.devicenum,policy.update_host); return; } else { In_Kernel_Mathfunctions<T>::matrix_multiply_dot_w(A, B, C); return; } return; } //now the rest of the strassen algorithm begins... if(separate_device_memory) { Datastruct_GPU_Memory_Functions<T>::create_in_struct(A,policy.devicenum); Datastruct_GPU_Memory_Functions<T>::create_in_struct(B,policy.devicenum); Datastruct_GPU_Memory_Functions<T>::create_out_struct(C,policy.devicenum); } The function matrix_multiply_dot_g has the following code: template <typename T> void GPU_Math_Functions<T>::matrix_multiply_dot_g( datastruct<T>& A, datastruct<T>& B, datastruct<T>& C,int dev,bool update_host) { const size_t rows=A.dpextents[0]; const size_t cols=B.dpextents[1]; const size_t inner_dim=A.dpextents[1]; //these functions check isdevptr to see whether data was allocated with malloc. they do only offload if that is not the case. typename Datastruct_GPU_Memory_Functions<T>::OffloadHelper offloadA(A, dev, false, false); typename Datastruct_GPU_Memory_Functions<T>::OffloadHelper offloadB(B, dev, false, false); typename Datastruct_GPU_Memory_Functions<T>::OffloadHelper offloadC(C, dev, true, update_host); #pragma omp target teams distribute parallel for collapse(2) shared(A,B,C,rows,cols,inner_dim) device(dev) for (size_t i = 0; i < rows; ++i) { for (size_t j = 0; j < cols; ++j) { T sum = 0; #pragma omp simd reduction(+:sum) for (size_t k = 0; k < inner_dim; ++k) { sum += A(i,k) *B(k,j); } C(i,j)= sum; } } The created offload helper object is just a temporary object that exists in the code block, whose constructor calls the mapping functions and whose destructor calls the release functions: class OffloadHelper { protected: bool pupdate_host; datastruct<T> pdL; int pdevicenum; public: inline OffloadHelper(datastruct<T>& dL, int devicenum, bool just_alloc, bool update_host_on_exit) :pupdate_host(update_host_on_exit), pdL(dL),pdevicenum(devicenum) { #if !defined(Unified_Shared_Memory) if (just_alloc) Datastruct_GPU_Memory_Functions::create_out_struct(dL, devicenum); else Datastruct_GPU_Memory_Functions::create_in_struct(dL, devicenum); #endif } inline ~OffloadHelper() { #if !defined(Unified_Shared_Memory) if (pupdate_host) { Datastruct_GPU_Memory_Functions::update_host(pdL, pdevicenum); } Datastruct_GPU_Memory_Functions::release_struct(pdL, pdevicenum); #endif } OffloadHelper(const OffloadHelper&) = delete; OffloadHelper& operator=(const OffloadHelper&) = delete; }; where create_in_struct calls map:alloc, create_out_struct calls map_to and update_host calls update from and release_struct calls target exit data release which should just decrease the reference counter, and the functions should leave the pdata field alone if it is a devptr template<typename T> bool Datastruct_GPU_Memory_Functions<T>::update_host_data(datastruct<T>& dL,int devicenum) { #if !defined(Unified_Shared_Memory) size_t l=dL.dpdatalength; if(!dL.dpdata_is_devptr) { #pragma omp target update from (dL.dpdata[0:l])device(devicenum) return true; } else return false; #endif return true; } template<typename T> bool Datastruct_GPU_Memory_Functions<T>::update_host(datastruct<T>& dL,int devicenum) { #if !defined(Unified_Shared_Memory) size_t l=dL.dpdatalength; size_t r=dL.dprank; #pragma omp target update from (dL) device(devicenum) #pragma omp target update from (dL.dpstrides[0:r])device(devicenum) #pragma omp target update from (dL.dpextents[0:r])device(devicenum) if(!dL.dpdata_is_devptr) { #pragma omp target update from (dL.dpdata[0:l])device(devicenum) return true; } else return false; #endif return true; } template<typename T> void Datastruct_GPU_Memory_Functions<T>::create_out_struct(datastruct<T>& dA,int devicenum) { #if !defined(Unified_Shared_Memory) size_t l=dA.dpdatalength; size_t r=dA.dprank; #pragma omp target enter data map(to: dA) device(devicenum) if(!dA.dpdata_is_devptr) { #pragma omp target enter data map(alloc: dA.dpdata[0:l])device(devicenum) } #pragma omp target enter data map(to: dA.dpextents[0:r])device(devicenum) #pragma omp target enter data map(to: dA.dpstrides[0:r])device(devicenum) #endif } template<typename T> void Datastruct_GPU_Memory_Functions<T>::create_in_struct(datastruct<T>& dA,int devicenum) { #if !defined(Unified_Shared_Memory) size_t l=dA.dpdatalength; size_t r=dA.dprank; #pragma omp target enter data map(to: dA)device(devicenum) if(!dA.dpdata_is_devptr) { #pragma omp target enter data map(to: dA.dpdata[0:l])device(devicenum) } #pragma omp target enter data map(to: dA.dpextents[0:r])device(devicenum) #pragma omp target enter data map(to: dA.dpstrides[0:r])device(devicenum) #endif } so all in all, the call to GPU_Math_Functions<T>::matrix_multiply_dot_g function will, indeed map the objects A,B,C, And when this appears in a recursive call, where the temporary matrices data are allocated by omp_target alloc, as in my implementation of the Strassen algorithm, the extents and the strides should be mapped twice, so it should just be a reference counter increase, and then after the naive multiplication, the counter should be released, while the pdata field should be left as is... Interestingly, the error message appears only a few computations of other functions afterwards.. But, if i write the code of matrix_multiply_dot_g inside the function, i.e, if i write bool ongpu=policy.should_use_gpu(A,B,C,Math_Functions_Policy::default_cubic_treshold,7); bool separate_device_memory; if(ongpu) { #if defined(Unified_Shared_Memory) separate_device_memory=false; #else separate_device_memory=true; #endif } else { separate_device_memory=false; } if(separate_device_memory) { Datastruct_GPU_Memory_Functions<T>::create_in_struct(A,policy.devicenum); Datastruct_GPU_Memory_Functions<T>::create_in_struct(B,policy.devicenum); Datastruct_GPU_Memory_Functions<T>::create_out_struct(C,policy.devicenum); } // Base case: if no dimension is divisible by 2, use standard multiplication if ((n%2!=0) || (m%2!=0) || (p%2!=0) || m<=2 || n<=2|| p<=2 || !policy.should_use_recursion(n*p)) { if(ongpu) { const size_t rows=A.dpextents[0]; const size_t cols=B.dpextents[1]; const size_t inner_dim=A.dpextents[1]; #pragma omp target teams distribute parallel for collapse(2) shared(A,B,C,rows,cols,inner_dim) device(policy.devicenum) for (size_t i = 0; i < rows; ++i) { for (size_t j = 0; j < cols; ++j) { T sum = 0; #pragma omp simd reduction(+:sum) for (size_t k = 0; k < inner_dim; ++k) { sum += A(i,k) *B(k,j); } C(i,j)= sum; } } if(separate_device_memory) { Datastruct_GPU_Memory_Functions<T>::release_struct(A,policy.devicenum); Datastruct_GPU_Memory_Functions<T>::release_struct(B,policy.devicenum); if(policy.update_host) { Datastruct_GPU_Memory_Functions<T>::update_host_data(C,policy.devicenum); } Datastruct_GPU_Memory_Functions<T>::release_struct(C,policy.devicenum); } return; } else { In_Kernel_Mathfunctions<T>::matrix_multiply_dot_w(A, B, C); return; } } //now the strassen algorithm can proceed further.... Then suddenly the error message is gone... But again, mapping an object twice should be allowed with gcc-15.2 and it should not cost anything. I dont want to always call omp_get_associated_ptr and check whether the data is mapped. I will now attach the files of my projects... the strassen algorithm can use openmpi, so the header is required, but for compiling and executing the test application, ive disabled usage of mpi. The cmake lists txt will create two applications, the crash the error message appears in the created arraytest executable with gcc-15.2 and the settings -fopenmp -foffload=nvptx-none -fno-stack-protector -fno-math-errno -fno-trapping-math -Wall with this linker command. target_link_libraries(arraytest PRIVATE rt m c stdc++ mpi) I am sorry for the large test case. Please believe me that I've tried to reproduce it with smaller test cases. It did not appear. Note that the problem also does not appear when i reduce main_omp.cpp to just do the multiplication... So it seems that the problem appears when doing computations afterwards. Yet the issue seems to be this call. Interestingly, the error also only appears if I afterwards, long after the strassen algorithm, later make the call cout<<"On gpu"<<std::endl; p5.mode=Math_Functions_Policy::GPU_ONLY; p5.memmapped_files=false; Math_Functions<double>::qr_decomposition(A4,Q4,R4,&p5); Q4.printmatrix(); R4.printmatrix(); in main_omp.cpp. If i switch here to CPU_ONLY, then the error message dissapears. But I did not find any problems in the qr:decomposition. So, after getting desperate, I then inlined the naive multiplication call in the strassen algorithm, since this was mostly where objects are mapped several times and then the issue dissapeared. Note that, the strides and extents are still mapped twice. Just not inside a function call. But then the error is gone