Barry If I remember correctly, the ctable business was a limitation of Mark's initial implementation, which I removed here https://gitlab.com/petsc/petsc/-/merge_requests/3411/diffs?commit_id=36173707431f5583889720868beae9046e6e1fd2. Here colmap was "int" already. We made a mistake letting this branch fall so behind main. I will add my observations in the MR
Il giorno mer 2 giu 2021 alle ore 09:20 Barry Smith <[email protected]> ha scritto: > > Switching the thread to petsc-dev since users don't need this. > > Had a nasty failure on one the CI, turned out the GPU colmap was > declared as PetscInt* but it was copied down from the CPU into the GPU > memory as int. The power of void* interface. Thus ex5cu.cu which was > never CI tested was crashing very strangely. By declaring it properly as > int* I got the example to run with 64 bit indices. Submitted a new pipeline > to see what else pops up. > > I am a bit worried about the csr2csc data structures and code, one line > declaring csr2csc_i inside the struct disappeared in the rebase on main, I > stuck it in but don't understand that code at all. I'll report further when > the CI is done or crashes. I will also need to make sure it is compiled pre > and post PETSC_PKG_CUDA_VERSION_GE(11,0,0) > > I will also investigate this CTABLE business, the GPU doesn't need to > know or care if the CPU is using a CTABLE or not; it just needs the colmap > array format that it gets from spreading out the garray anyway to be copied > to the GPU so PETSc can likely be built with CTABLES as normal; this will > improve the testing a lot. > > > > Barry > > > On May 30, 2021, at 11:54 AM, Barry Smith <[email protected]> wrote: > > > I believe I have finally successfully rebased the branch > barry/2020-11-11/cleanup-matsetvaluesdevice > against main and cleaned up all the issues. Please read the commit message. > > I have submitted a CI pipeline with ctables turned off temporarily for > testing of the MatSetValuesDevice(). If it works hopefully Mark can maybe > run a few additional tests of his Landau code that are not in the usual > testing to verify and we can finally get the branch into main. > > Mark, > > Since this change is involved, it is likely your Landau mass matrix > branch may not rebase cleanly. Let me know if you would like me to do the > rebase and testing of your Landau mass matrix branch. I can get it ready to > work with the results of barry/2020-11-11/cleanup-matsetvaluesdevice and > then hand it back to you for further development. > > Barry > > > On May 29, 2021, at 11:32 AM, Barry Smith <[email protected]> wrote: > > > I am working away on this branch, making some progress, also cleaning > things up with some small simplifications. Hope I can succeed, a bunch of > stuff got moved around and some structs had changes, the merge could not > handle some of these so I have to do a good amount of code wrangling to fix > it. > > I'll let you know as I progress. > > Barry > > > On May 28, 2021, at 10:53 PM, Barry Smith <[email protected]> wrote: > > > I have rebased and tried to fix everything. I am now fixing the issues > of --download-openmpi and cuda, once that is done I will test, rebase with > main again if needed and restart the MR and get it into main. > > Barry > > I was stupid to let the MR lay fallow, I should have figured out a > solution to the openmpi and cuda issue instead of punting and waiting for a > dream fix. > > > > On May 28, 2021, at 2:39 PM, Mark Adams <[email protected]> wrote: > > Thanks, > > I did not intend to make any (real) changes. > The only thing that I did not intend to use from Barry's branch, that > conflicted, was the help and comment block at the top of ex5cu.cu > > * I ended up with two declarations of PetscSplitCSRDataStructure > * I added some includes to fix errors like this: > /ccs/home/adams/petsc/include/../src/mat/impls/aij/seq/seqcusparse/cusparsematimpl.h(263): > error: incomplete type is not allowed > * I end ended not having csr2csc_i in Mat_SeqAIJCUSPARSE so I get: > /autofs/nccs-svm1_home1/adams/petsc/src/mat/impls/aij/seq/seqcusparse/ > aijcusparse.cu(1348): error: class "Mat_SeqAIJCUSPARSE" has no member > "csr2csc_i" > > > > > On Fri, May 28, 2021 at 3:13 PM Stefano Zampini <[email protected]> > wrote: > >> I can take a quick look at it tomorrow, what are the main changes you >> made since then? >> >> On May 28, 2021, at 9:51 PM, Mark Adams <[email protected]> wrote: >> >> I am getting messed up in trying to resolve conflicts in rebasing over >> main. >> Is there a better way of doing this? >> Can I just tell git to use Barry's version and then test it? >> Or should I just try it again? >> >> On Fri, May 28, 2021 at 2:15 PM Mark Adams <[email protected]> wrote: >> >>> I am rebasing over main and its a bit of a mess. I must have missed >>> something. I get this. I think the _n_SplitCSRMat must be wrong. >>> >>> >>> In file included from >>> /autofs/nccs-svm1_home1/adams/petsc/src/vec/is/sf/impls/basic/sfbasic.c:128:0: >>> /ccs/home/adams/petsc/include/petscmat.h:1976:32: error: conflicting >>> types for 'PetscSplitCSRDataStructure' >>> typedef struct _n_SplitCSRMat *PetscSplitCSRDataStructure; >>> ^~~~~~~~~~~~~~~~~~~~~~~~~~ >>> /ccs/home/adams/petsc/include/petscmat.h:1922:31: note: previous >>> declaration of 'PetscSplitCSRDataStructure' was here >>> typedef struct _p_SplitCSRMat PetscSplitCSRDataStructure; >>> ^~~~~~~~~~~~~~~~~~~~~~~~~~ >>> CC arch-summit-opt-gnu-cuda/obj/vec/vec/impls/seq/dvec2.o >>> >>> On Fri, May 28, 2021 at 1:50 PM Stefano Zampini < >>> [email protected]> wrote: >>> >>>> OpenMPI.py depends on cuda.py in that, if cuda is present, configures >>>> using cuda. MPI.py or MPICH.py do not depend on cuda.py (MPICH, only >>>> weakly, it adds a print if cuda is present) >>>> Since eventually the MPI distro will only need a hint to be configured >>>> with CUDA, why not removing the dependency at all and add only a flag >>>> —download-openmpi-use-cuda? >>>> >>>> On May 28, 2021, at 8:44 PM, Barry Smith <[email protected]> wrote: >>>> >>>> >>>> Stefano, who has a far better memory than me, wrote >>>> >>>> > Or probably remove —download-openmpi ? Or, just for the moment, why >>>> can’t we just tell configure that mpi is a weak dependence of cuda.py, so >>>> that it will be forced to be configured later? >>>> >>>> MPI.py depends on cuda.py so we cannot also have cuda.py depend on >>>> MPI.py using the generic dependencies of configure/packages >>>> >>>> but perhaps we can just hardwire the rerunning of cuda.py when the >>>> MPI compilers are reset. I will try that now and if I can get it to work we >>>> should be able to move those old fix branches along as MR. >>>> >>>> Barry >>>> >>>> >>>> >>>> On May 28, 2021, at 12:41 PM, Mark Adams <[email protected]> wrote: >>>> >>>> OK, I will try to rebase and test Barry's branch. >>>> >>>> On Fri, May 28, 2021 at 1:26 PM Stefano Zampini < >>>> [email protected]> wrote: >>>> >>>>> Yes, it is the branch I was using before force pushing to >>>>> Barry’s barry/2020-11-11/cleanup-matsetvaluesdevice >>>>> You can use both I guess >>>>> >>>>> On May 28, 2021, at 8:25 PM, Mark Adams <[email protected]> wrote: >>>>> >>>>> Is this the correct branch? It conflicted with ex5cu so I assume it is. >>>>> >>>>> >>>>> stefanozampini/simplify-setvalues-device >>>>> <https://gitlab.com/petsc/petsc/-/tree/stefanozampini/simplify-setvalues-device> >>>>> >>>>> On Fri, May 28, 2021 at 1:24 PM Mark Adams <[email protected]> wrote: >>>>> >>>>>> I am fixing rebasing this branch over main. >>>>>> >>>>>> On Fri, May 28, 2021 at 1:16 PM Stefano Zampini < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Or probably remove —download-openmpi ? Or, just for the moment, why >>>>>>> can’t we just tell configure that mpi is a weak dependence of cuda.py, >>>>>>> so >>>>>>> that it will be forced to be configured later? >>>>>>> >>>>>>> On May 28, 2021, at 8:12 PM, Stefano Zampini < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>> That branch provides a fix for MatSetValuesDevice but it never got >>>>>>> merged because of the CI issues with the —download-openmpi. We can >>>>>>> probably >>>>>>> try to skip the test in that specific configuration? >>>>>>> >>>>>>> On May 28, 2021, at 7:45 PM, Barry Smith <[email protected]> wrote: >>>>>>> >>>>>>> >>>>>>> ~/petsc/src/mat/tutorials* >>>>>>> (barry/2021-05-28/robustify-cuda-gencodearch-check=)* >>>>>>> arch-robustify-cuda-gencodearch-check >>>>>>> $ ./ex5cu >>>>>>> terminate called after throwing an instance of >>>>>>> 'thrust::system::system_error' >>>>>>> what(): fill_n: failed to synchronize: cudaErrorIllegalAddress: >>>>>>> an illegal memory access was encountered >>>>>>> Aborted (core dumped) >>>>>>> >>>>>>> requires: cuda !define(PETSC_USE_CTABLE) >>>>>>> >>>>>>> CI does not test with CUDA and no ctable. The code is still >>>>>>> broken as it was six months ago in the discussion Stefano pointed to. >>>>>>> It is >>>>>>> clear why just no one has had the time to clean things up. >>>>>>> >>>>>>> Barry >>>>>>> >>>>>>> >>>>>>> On May 28, 2021, at 11:13 AM, Mark Adams <[email protected]> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, May 28, 2021 at 11:57 AM Stefano Zampini < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> If you are referring to your device set values, I guess it is not >>>>>>>> currently tested >>>>>>>> >>>>>>> >>>>>>> No. There is a test for that (ex5cu). >>>>>>> I have a user that is getting a segv in MatSetValues with >>>>>>> aijcusparse. I suspect there is memory corruption but I'm trying to >>>>>>> cover >>>>>>> all the bases. >>>>>>> I have added a cuda test to ksp/ex56 that works. I can do an MR for >>>>>>> it if such a test does not exist. >>>>>>> >>>>>>> >>>>>>>> See the discussions here >>>>>>>> https://gitlab.com/petsc/petsc/-/merge_requests/3411 >>>>>>>> I started cleaning up the code to prepare for testing but we never >>>>>>>> finished it >>>>>>>> https://gitlab.com/petsc/petsc/-/commits/stefanozampini/simplify-setvalues-device/ >>>>>>>> >>>>>>>> >>>>>>>> On May 28, 2021, at 6:53 PM, Mark Adams <[email protected]> wrote: >>>>>>>> >>>>>>>> Is there a test with MatSetValues and CUDA? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>> >>>> >>>> >> > > > > -- Stefano
