Switching the thread to petsc-dev since users don't need this.

  Had a nasty failure on one the CI, turned out the GPU colmap was declared as 
PetscInt* but it was copied down from the CPU into the GPU memory as int. The 
power of void* interface. Thus ex5cu.cu which was never CI tested was crashing 
very strangely. By declaring it properly as int* I got the example to run with 
64 bit indices. Submitted a new pipeline to see what else pops up. 

   I am a bit worried about the csr2csc data structures and code, one line 
declaring csr2csc_i inside the struct disappeared in the rebase on main, I 
stuck it in but don't understand that code at all. I'll report further when the 
CI is done or crashes. I will also need to make sure it is compiled pre and 
post PETSC_PKG_CUDA_VERSION_GE(11,0,0)

   I will also investigate this CTABLE business, the GPU doesn't need to know 
or care if the CPU is using a CTABLE or not; it just needs the colmap array 
format that it gets from spreading out the garray anyway to be copied to the 
GPU so PETSc can likely be built with CTABLES as normal; this will improve the 
testing a lot. 



Barry


> On May 30, 2021, at 11:54 AM, Barry Smith <[email protected]> wrote:
> 
> 
>    I believe I have finally successfully rebased the branch 
> barry/2020-11-11/cleanup-matsetvaluesdevice against main and cleaned up all 
> the issues. Please read the commit message.
> 
>   I have submitted a CI pipeline with ctables turned off temporarily for 
> testing of the MatSetValuesDevice(). If it works hopefully Mark can maybe run 
> a few additional tests of his Landau code that are not in the usual testing 
> to verify and we can finally get the branch into main.
> 
>   Mark,
> 
>      Since this change is involved, it is likely your Landau mass matrix 
> branch may not rebase cleanly. Let me know if you would like me to do the 
> rebase and testing of your Landau mass matrix branch. I can get it ready to 
> work with the results of barry/2020-11-11/cleanup-matsetvaluesdevice and then 
> hand it back to you for further development.
> 
>   Barry
> 
> 
>> On May 29, 2021, at 11:32 AM, Barry Smith <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> 
>>   I am working away on this branch, making some progress, also cleaning 
>> things up with some small simplifications. Hope I can succeed, a bunch of 
>> stuff got moved around and some structs had changes, the merge could not 
>> handle some of these so I have to do a good amount of code wrangling to fix 
>> it.
>> 
>>   I'll let you know as I progress.
>> 
>>   Barry
>> 
>> 
>>> On May 28, 2021, at 10:53 PM, Barry Smith <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> 
>>>   I have rebased and tried to fix everything. I am now fixing the issues of 
>>> --download-openmpi and cuda, once that is done I will test, rebase with 
>>> main again if needed and restart the MR and get it into main.
>>> 
>>>   Barry
>>> 
>>> I was stupid to let the MR lay fallow, I should have figured out a solution 
>>> to the openmpi and cuda issue instead of punting and waiting for a dream 
>>> fix.
>>> 
>>> 
>>> 
>>>> On May 28, 2021, at 2:39 PM, Mark Adams <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> Thanks,
>>>> 
>>>> I did not intend to make any (real) changes. 
>>>> The only thing that I did not intend to use from Barry's branch, that 
>>>> conflicted, was the help and comment block at the top of ex5cu.cu 
>>>> <http://ex5cu.cu/>
>>>> 
>>>> * I ended up with two declarations of PetscSplitCSRDataStructure
>>>> * I added some includes to fix errors like this:
>>>> /ccs/home/adams/petsc/include/../src/mat/impls/aij/seq/seqcusparse/cusparsematimpl.h(263):
>>>>  error: incomplete type is not allowed
>>>> * I end ended not having csr2csc_i in Mat_SeqAIJCUSPARSE so I get: 
>>>> /autofs/nccs-svm1_home1/adams/petsc/src/mat/impls/aij/seq/seqcusparse/aijcusparse.cu
>>>>  <http://aijcusparse.cu/>(1348): error: class "Mat_SeqAIJCUSPARSE" has no 
>>>> member "csr2csc_i"
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Fri, May 28, 2021 at 3:13 PM Stefano Zampini <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> I can take a quick look at it tomorrow, what are the main changes you made 
>>>> since then?
>>>> 
>>>>> On May 28, 2021, at 9:51 PM, Mark Adams <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> I am getting messed up in trying to resolve conflicts in rebasing over 
>>>>> main.
>>>>> Is there a better way of doing this?
>>>>> Can I just tell git to use Barry's version and then test it?
>>>>> Or should I just try it again?
>>>>> 
>>>>> On Fri, May 28, 2021 at 2:15 PM Mark Adams <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> I am rebasing over main and its a bit of a mess. I must have missed 
>>>>> something. I get this. I think the _n_SplitCSRMat must be wrong.
>>>>> 
>>>>> 
>>>>> In file included from 
>>>>> /autofs/nccs-svm1_home1/adams/petsc/src/vec/is/sf/impls/basic/sfbasic.c:128:0:
>>>>> /ccs/home/adams/petsc/include/petscmat.h:1976:32: error: conflicting 
>>>>> types for 'PetscSplitCSRDataStructure'
>>>>>  typedef struct _n_SplitCSRMat *PetscSplitCSRDataStructure;
>>>>>                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>> /ccs/home/adams/petsc/include/petscmat.h:1922:31: note: previous 
>>>>> declaration of 'PetscSplitCSRDataStructure' was here
>>>>>  typedef struct _p_SplitCSRMat PetscSplitCSRDataStructure;
>>>>>                                ^~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>>           CC arch-summit-opt-gnu-cuda/obj/vec/vec/impls/seq/dvec2.o
>>>>> 
>>>>> On Fri, May 28, 2021 at 1:50 PM Stefano Zampini 
>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>> OpenMPI.py depends on cuda.py in that, if cuda is present, configures 
>>>>> using cuda. MPI.py or MPICH.py do not depend on cuda.py (MPICH, only 
>>>>> weakly, it adds a print if cuda is present)
>>>>> Since eventually the MPI distro will only need a hint to be configured 
>>>>> with CUDA, why not removing the dependency at all and add only a flag 
>>>>> —download-openmpi-use-cuda?
>>>>> 
>>>>>> On May 28, 2021, at 8:44 PM, Barry Smith <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> 
>>>>>> 
>>>>>>  Stefano, who has a far better memory than me, wrote
>>>>>> 
>>>>>> > Or probably remove —download-openmpi ? Or, just for the moment, why 
>>>>>> > can’t we just tell configure that mpi is a weak dependence of cuda.py, 
>>>>>> > so that it will be forced to be configured later?
>>>>>> 
>>>>>>   MPI.py depends on cuda.py so we cannot also have cuda.py depend on 
>>>>>> MPI.py using the generic dependencies of configure/packages  
>>>>>> 
>>>>>>   but perhaps we can just hardwire the rerunning of cuda.py when the MPI 
>>>>>> compilers are reset. I will try that now and if I can get it to work we 
>>>>>> should be able to move those old fix branches along as MR.
>>>>>> 
>>>>>>   Barry
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On May 28, 2021, at 12:41 PM, Mark Adams <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> 
>>>>>>> OK, I will try to rebase and test Barry's branch.
>>>>>>> 
>>>>>>> On Fri, May 28, 2021 at 1:26 PM Stefano Zampini 
>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>> Yes, it is the branch I was using before force pushing to Barry’s 
>>>>>>> barry/2020-11-11/cleanup-matsetvaluesdevice
>>>>>>> You can use both I guess
>>>>>>> 
>>>>>>>> On May 28, 2021, at 8:25 PM, Mark Adams <[email protected] 
>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>> 
>>>>>>>> Is this the correct branch? It conflicted with ex5cu so I assume it is.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> stefanozampini/simplify-setvalues-device 
>>>>>>>> <https://gitlab.com/petsc/petsc/-/tree/stefanozampini/simplify-setvalues-device>
>>>>>>>> 
>>>>>>>> On Fri, May 28, 2021 at 1:24 PM Mark Adams <[email protected] 
>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>> I am fixing rebasing this branch over main.
>>>>>>>> 
>>>>>>>> On Fri, May 28, 2021 at 1:16 PM Stefano Zampini 
>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>> Or probably remove —download-openmpi ? Or, just for the moment, why 
>>>>>>>> can’t we just tell configure that mpi is a weak dependence of cuda.py, 
>>>>>>>> so that it will be forced to be configured later?
>>>>>>>> 
>>>>>>>>> On May 28, 2021, at 8:12 PM, Stefano Zampini 
>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>> 
>>>>>>>>> That branch provides a fix for MatSetValuesDevice but it never got 
>>>>>>>>> merged because of the CI issues with the —download-openmpi. We can 
>>>>>>>>> probably try to skip the test in that specific configuration?
>>>>>>>>> 
>>>>>>>>>> On May 28, 2021, at 7:45 PM, Barry Smith <[email protected] 
>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> ~/petsc/src/mat/tutorials 
>>>>>>>>>> (barry/2021-05-28/robustify-cuda-gencodearch-check=) 
>>>>>>>>>> arch-robustify-cuda-gencodearch-check
>>>>>>>>>> $ ./ex5cu
>>>>>>>>>> terminate called after throwing an instance of 
>>>>>>>>>> 'thrust::system::system_error'
>>>>>>>>>>   what():  fill_n: failed to synchronize: cudaErrorIllegalAddress: 
>>>>>>>>>> an illegal memory access was encountered
>>>>>>>>>> Aborted (core dumped)
>>>>>>>>>> 
>>>>>>>>>>         requires: cuda !define(PETSC_USE_CTABLE)
>>>>>>>>>> 
>>>>>>>>>>   CI does not test with CUDA and no ctable.  The code is still 
>>>>>>>>>> broken as it was six months ago in the discussion Stefano pointed 
>>>>>>>>>> to. It is clear why just no one has had the time to clean things up.
>>>>>>>>>> 
>>>>>>>>>>   Barry
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On May 28, 2021, at 11:13 AM, Mark Adams <[email protected] 
>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, May 28, 2021 at 11:57 AM Stefano Zampini 
>>>>>>>>>>> <[email protected] <mailto:[email protected]>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> If you are referring to your device set values, I guess it is not 
>>>>>>>>>>> currently tested
>>>>>>>>>>> 
>>>>>>>>>>> No. There is a test for that (ex5cu).
>>>>>>>>>>> I have a user that is getting a segv in MatSetValues with 
>>>>>>>>>>> aijcusparse. I suspect there is memory corruption but I'm trying to 
>>>>>>>>>>> cover all the bases.
>>>>>>>>>>> I have added a cuda test to ksp/ex56 that works. I can do an MR for 
>>>>>>>>>>> it if such a test does not exist.
>>>>>>>>>>>  
>>>>>>>>>>> See the discussions here 
>>>>>>>>>>> https://gitlab.com/petsc/petsc/-/merge_requests/3411 
>>>>>>>>>>> <https://gitlab.com/petsc/petsc/-/merge_requests/3411>
>>>>>>>>>>> I started cleaning up the code to prepare for testing but we never 
>>>>>>>>>>> finished it 
>>>>>>>>>>> https://gitlab.com/petsc/petsc/-/commits/stefanozampini/simplify-setvalues-device/
>>>>>>>>>>>  
>>>>>>>>>>> <https://gitlab.com/petsc/petsc/-/commits/stefanozampini/simplify-setvalues-device/>
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On May 28, 2021, at 6:53 PM, Mark Adams <[email protected] 
>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Is there a test with MatSetValues and CUDA? 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 

Reply via email to