Thanks - AmgX will accept unordered column indices.
________________________________
From: Barry Smith <[email protected]>
Sent: Saturday, 25 June 2022, 14:39
To: Mark Adams <[email protected]>
Cc: Matthew Martineau <[email protected]>; For users of the development
version of PETSc <[email protected]>
Subject: Re: [petsc-dev] MatMPIAIJGetLocalMat problem with GPUs
External email: Use caution opening links or attachments
Does AMGX require sorted column indices? (Python indentation notation below)
If not
just use MatMPIAIJGetLocalMatMerge instead of MatMPIAIJGetLocalMat.
If yes,
on the first call
Mat tmplocal;
PetscCall(MatMPIAIJGetLocalMat(Pmat, MAT_INITIAL_MATRIX, &tmplocal));
PetscCall(MatConvert(tmplocal,MATSEQAIJCUSPARSE,&amgx->localA));
PetscCall(MatDestroy(&tmplocal));
leave the later calls as is with
PetscCall(MatMPIAIJGetLocalMat(Pmat, MAT_REUSE_MATRIX, &amgx->localA);
Eventually, someone will need to buckle down and write
MatMPIAIJGetLocalMat_SeqAIJCUSPARSE, but that can be done later.
Barry
On Jun 25, 2022, at 9:13 AM, Mark Adams
<[email protected]<mailto:[email protected]>> wrote:
On Fri, Jun 24, 2022 at 1:54 PM Barry Smith
<[email protected]<mailto:[email protected]>> wrote:
On Jun 24, 2022, at 1:38 PM, Mark Adams
<[email protected]<mailto:[email protected]>> wrote:
I am rearranging the code for clarity from the repo but I have:
PetscBool is_dev_ptrs;
PetscCall(MatMPIAIJGetLocalMat(Pmat, MAT_INITIAL_MATRIX, &amgx->localA));
PetscCall(PetscObjectTypeCompareAny((PetscObject)amgx->localA, &is_dev_ptrs,
MATAIJCUSPARSE, MATSEQAIJCUSPARSE, MATMPIAIJCUSPARSE, ""));
PetscPrintf(PETSC_COMM_SELF,"checking against mataijcusparse amgx->localA =
%d\n",is_dev_ptrs);
PetscCall(PetscObjectTypeCompareAny((PetscObject)Pmat, &is_dev_ptrs,
MATAIJCUSPARSE, MATSEQAIJCUSPARSE, MATMPIAIJCUSPARSE, ""));
PetscPrintf(PETSC_COMM_SELF,"checking against mataijcusparse Pmat =
%d\n",is_dev_ptrs);
And it seems to show that MatMPIAIJGetLocalMat takes a mpiaijcusparse Mat and
returns an seqaij mat (see below):
Yes, this is how it currently behaves as Stefano has indicated. Thus it is not
currently directly suitable for use with GPUs. As Stefano has indicated it has
be revised to handle mpiaijcusparse matrices correctly in the same way that
MatMPIAIJGetLocalMatMerge has been revised for GPUs.
OK, sorry, I did not understand that this is not supported. We need a
MatMPIAIJCusparseGetLocalMatMerge (I read this as supported with "hstack"
format, unsorted?, by Stefano)
What is the best way to proceed? Should we just convert to amgx->localA to
mpiaijcusparse if Pmat is a cusparse matrix?
If so, should this code go in amgx or MatMPIAIJGetLocalMat(MAT_INITIAL_MATRIX) ?
Or should I add a MatMPIAIJCusparseGetLocalMatMerge that simply wraps these two
calls for now?
Thanks,
Mark
AMGX version 2.2.0.132-opensource
Built on Jun 24 2022, 09:21:43
Compiled with CUDA Runtime 11.5, using CUDA driver 11.5
checking against mataijcusparse amgx->localA = 0
checking against mataijcusparse Pmat = 1
localA_name seqaij
Pmat_name mpiaijcusparse
Matt's existing testing code (below) then shows the types that conform with
these tests and prints that I added.
// XXX DEBUG REMOVE
const char* localA_name;
PetscObjectGetType((PetscObject)amgx->localA, &localA_name);
PetscPrintf(PETSC_COMM_SELF,"localA_name %s\n", localA_name);
const char* Pmat_name;
PetscObjectGetType((PetscObject)Pmat, &Pmat_name);
PetscPrintf(PETSC_COMM_SELF,"Pmat_name %s\n", Pmat_name);
On Fri, Jun 24, 2022 at 10:00 AM Barry Smith
<[email protected]<mailto:[email protected]>> wrote:
On Jun 24, 2022, at 8:58 AM, Mark Adams
<[email protected]<mailto:[email protected]>> wrote:
And before we move to the MR, I think Matt found a clear problem:
* PetscCall(MatMPIAIJGetLocalMat(Pmat, MAT_REUSE_MATRIX, &amgx->localA));
returns "localA seqaij"
* And, oddly, PetscCall(MatSeqAIJGetArrayRead(amgx->localA, &amgx->values));
returns:
"it seems to detect that the pointer is a device mapped pointer but that it is
invalid"
It does not return a device mapped pointer, it returns a valid host pointer
only. MatSeqAIJGetArrayRead() is intended to only return a host pointer, it
cannot return a device pointer. MatSeqAIJCusparseGetArrayRead() returns device
pointers and should be used for this purpose.
Matt, lets just comment out the REUSE line and add another INITIAL line
(destroying the old Mat of course), and lets press on.
Looking at the code there is no way that simply using INITIAL instead of
REUSE will make a code that does not work on the GPU run on the GPU. The
MatMPIAIJGetLocalMat() returns only a MATSEQAIJ matrix regardless of the
INITIAL versus REUSE and one can never get a device pointer from a non-GPU
matrix.
As noted by Stefano, the code either needs to use MatMPIAIJGetLocalMatMerge
() which does return a CUSPARSE matrix (but the columns are not supported) or
MatMPIAIJGetLocalMat()
needs to be updated to return a CUSPARSE matrix when the input MPI matrix is a
CUSPARSE matrix.
We can keep the debugging code for now.
We (PETSc) can work on this independently,
Thanks,
Mark
On Fri, Jun 24, 2022 at 8:51 AM Mark Adams
<[email protected]<mailto:[email protected]>> wrote:
I am not seeing this response, I see my "hstack" comment last.
https://gitlab.com/petsc/petsc/-/merge_requests/4323<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.com%2Fpetsc%2Fpetsc%2F-%2Fmerge_requests%2F4323&data=05%7C01%7Cmmartineau%40nvidia.com%7C416bbc844bf646be910a08da56b03085%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637917611977954097%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=m5I2%2F%2BpR5a4PKMZcs9la%2BQcGne6CZHe7vx0SDDUPNEU%3D&reserved=0>
On Thu, Jun 23, 2022 at 4:37 PM Barry Smith
<[email protected]<mailto:[email protected]>> wrote:
I have responded in the MR, which has all the context and the code. Please
move this conversation from petsc-dev to the MR. Note you can use the little
cartoon cloud symbol (upper write of the sub window with my text) to reply to
my post and keep everything in a thread for clarity.
We are confused because it seems you are trying a variety of things and we
don't know how the different things you tried resulted in the multiple errors
you reported.
On Jun 23, 2022, at 3:59 PM, Matthew Martineau
<[email protected]<mailto:[email protected]>> wrote:
I checked in the changes and some debugging statements.
PetscCall(MatMPIAIJGetLocalMat(Pmat, MAT_REUSE_MATRIX, &amgx->localA));
PetscCall(PetscObjectTypeCompareAny((PetscObject)amgx->localA, &is_dev_ptrs,
MATAIJCUSPARSE, MATSEQAIJCUSPARSE, MATMPIAIJCUSPARSE, ""));
Then the call returns false. If we instead call PetscObjectTypeCompareAny on
Pmat then it returns true. If you print the type of the matrices:
localA seqaij
Pmat mpiaijcusparse
If you subsequently call MatSeqAIJCUSPARSEGetArrayRead on localA then it errors
(presumably because of the type mismatch).
If we call MatSeqAIJGetArrayRead on localA and then pass the `values` to AmgX
it seems to detect that the pointer is a device mapped pointer but that it is
invalid.
PetscCall(MatMPIAIJGetLocalMat(Pmat, MAT_REUSE_MATRIX, &amgx->localA));
PetscCall(MatSeqAIJGetArrayRead(amgx->localA, &amgx->values)); // Seems to
return invalid pointer, but I’ll investigate more
This doesn’t reproduce if we call:
PetscCall(MatMPIAIJGetLocalMat(Pmat, MAT_INITIAL_MATRIX, &amgx->localA));
PetscCall(MatSeqAIJGetArrayRead(amgx->localA, &amgx->values)); // Pointer
appears to be valid and we converge
Essentially all I want to achieve is that when we are parallel, we fetch the
local part of A and the device pointer to the matrix values from that structure
so that we can pass to AmgX. Preferring whichever API calls are the most
efficient.
From: Stefano Zampini
<[email protected]<mailto:[email protected]>>
Sent: 23 June 2022 20:55
To: Mark Adams <[email protected]<mailto:[email protected]>>
Cc: Barry Smith <[email protected]<mailto:[email protected]>>; For users of the
development version of PETSc
<[email protected]<mailto:[email protected]>>; Matthew Martineau
<[email protected]<mailto:[email protected]>>
Subject: Re: [petsc-dev] MatMPIAIJGetLocalMat problem with GPUs
External email: Use caution opening links or attachments
The logic is wrong. It should check for MATSEQAIJCUSPARSE.
On Thu, Jun 23, 2022, 21:36 Mark Adams
<[email protected]<mailto:[email protected]>> wrote:
On Thu, Jun 23, 2022 at 3:02 PM Barry Smith
<[email protected]<mailto:[email protected]>> wrote:
It looks like the current code copies the nonzero values to the CPU from the
MPI matrix (with the calls PetscCall(MatSeqAIJGetArrayRead(mpimat->A,&aav));
PetscCall(MatSeqAIJGetArrayRead(mpimat->B,&bav));, then copies them into the
CPU memory of the Seq matrix. When the matrix entries are next accessed on the
GPU it should automatically copy them down to the GPU. So the code looks ok
even for GPUs. We'll need to see the full error message with what the "invalid
pointer" is.
I showed Matt how to peek into offloadmask and he found that it is a host
state, but this is not the issue. The access method should do the copy to the
device.
I am thinking the logic here might be wrong. (Matt fixed "VEC" --> "MAT" in the
comparison below).
Matt, is the issue that you are calling MatSeqAIJCUSPARSEGetArrayRead and
getting a host pointer?
I think the state of amgx->localA after the call to
MatSeqAIJCUSPARSEGetArrayRead should be "BOTH" because this copied the data to
the device so they are both valid and you should have device data.
211 PetscBool is_dev_ptrs;
212 PetscCall(PetscObjectTypeCompareAny((PetscObject)amgx->localA,
&is_dev_ptrs, VECCUDA, VECMPICUDA, VECSEQCUDA, ""));
213
214 if (is_dev_ptrs) {
216 PetscCall(MatSeqAIJCUSPARSEGetArrayRead(amgx->localA, &amgx->values));
217 } else {
219 PetscCall(MatSeqAIJGetArrayRead(amgx->localA, &amgx->values));
220 }
Barry
Yes this routine is terribly inefficient for GPU matrices, it needs to be
specialized to not use the GPU memory but that is a separate issue from there
being bugs in the current code.
The code also seems to implicitly assume the parallel matrix has the same
nonzero pattern with a reuse. This should be checked with each use by stashing
the nonzero state of the matrix into the sequential matrix and making sure the
parallel matrix has that same stashed value each time. Currently if one changes
the nonzero matrix of the parallel matrix one is likely to get random confusing
crashes due to memory corruption. But likely not the problem here.
On Jun 23, 2022, at 2:23 PM, Mark Adams
<[email protected]<mailto:[email protected]>> wrote:
We have a bug in the AMGx test snes_tests-ex13_amgx in parallel.
Matt Martineau found that MatMPIAIJGetLocalMat worked in the first pass in the
code below, where the local matrix is created (INITIAL), but in the next pass,
when "REUSE" is used, he sees an invalid pointer.
Matt found that it does have offloadmask == CPU.
Maybe it is missing logic to put the output in same state as the input?
Any ideas on this or should I just dig into it?
Thanks,
bool partial_setup_allowed = (pc->setupcalled && pc->flag !=
DIFFERENT_NONZERO_PATTERN);
199 if (amgx->nranks > 1) {
200 if (partial_setup_allowed) {
202 PetscCall(MatMPIAIJGetLocalMat(Pmat, MAT_REUSE_MATRIX,
&amgx->localA)); // This path seems doesn't work by the time we reach AmgX API
203 } else {
205 PetscCall(MatMPIAIJGetLocalMat(Pmat, MAT_INITIAL_MATRIX,
&amgx->localA)); // This path works
206 }
207 } else {
208 amgx->localA = Pmat;
209 }
210