https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120814
--- Comment #3 from Benjamin Schulz <schulz.benjamin at googlemail dot com> ---
I want to note that if one comments out
//A.device_upload(true);
//B.device_upload(true);
//C.device_alloc(true);
and
// C.host_update(true);
in
bool matrix_multiply_dot( mdspan<T, CA>& A, mdspan<T, CB>& B, mdspan<T, CC>&
C, bool on_gpu=false,bool default_device=true,int devicenum=0)
and replaces these calls by
device_datastruct_upload(dA,devicenum);
device_datastruct_upload(dB,devicenum);
device_datastruct_alloc(dC,devicenum);
and
host_datastruct_update(dC,devicenum);
then, the loop of the matrix multiplication recognizes that dA,dB,dC have been
offloaded and works fine on gpu.
It just does not work with the member functions of A,B,C called
//A.device_upload(true);
//B.device_upload(true);
//C.device_alloc(true);
and
// C.host_update(true);
which, however, do nothing than set the default device number and then calling
device_datastruct_upload(dA,devicenum);
device_datastruct_upload(dB,devicenum);
device_datastruct_alloc(dC,devicenum);
and
host_datastruct_update(dC,devicenum)
with the reference of A.pdatastruct, which is the same as dA later and so on...
So the
#pragma omp target teams loop should definitely recognize by the adresses and
in both cases that dA,dB,dC have been off-loaded by using
omp_has_device_addr....