estewart08 added a comment. In D99432#2728788 <https://reviews.llvm.org/D99432#2728788>, @ABataev wrote:
> In D99432#2726997 <https://reviews.llvm.org/D99432#2726997>, @estewart08 > wrote: > >> In D99432#2726845 <https://reviews.llvm.org/D99432#2726845>, @ABataev wrote: >> >>> In D99432#2726588 <https://reviews.llvm.org/D99432#2726588>, @estewart08 >>> wrote: >>> >>>> In D99432#2726391 <https://reviews.llvm.org/D99432#2726391>, @ABataev >>>> wrote: >>>> >>>>> In D99432#2726337 <https://reviews.llvm.org/D99432#2726337>, @estewart08 >>>>> wrote: >>>>> >>>>>> In D99432#2726060 <https://reviews.llvm.org/D99432#2726060>, @ABataev >>>>>> wrote: >>>>>> >>>>>>> In D99432#2726050 <https://reviews.llvm.org/D99432#2726050>, >>>>>>> @estewart08 wrote: >>>>>>> >>>>>>>> In D99432#2726025 <https://reviews.llvm.org/D99432#2726025>, @ABataev >>>>>>>> wrote: >>>>>>>> >>>>>>>>> In D99432#2726019 <https://reviews.llvm.org/D99432#2726019>, >>>>>>>>> @estewart08 wrote: >>>>>>>>> >>>>>>>>>> In reference to https://bugs.llvm.org/show_bug.cgi?id=48851, I do >>>>>>>>>> not see how this helps SPMD mode with team privatization of >>>>>>>>>> declarations in-between target teams and parallel regions. >>>>>>>>> >>>>>>>>> Diв you try the reproducer with the applied patch? >>>>>>>> >>>>>>>> Yes, I still saw the test fail, although it was not with latest >>>>>>>> llvm-project. Are you saying the reproducer passes for you? >>>>>>> >>>>>>> I don't have CUDA installed but from what I see in the LLVM IR it shall >>>>>>> pass. Do you have a debug log, does it crashes or produces incorrect >>>>>>> results? >>>>>> >>>>>> This is on an AMDGPU but I assume the behavior would be similar for >>>>>> NVPTX. >>>>>> >>>>>> It produces incorrect/incomplete results in the dist[0] index after a >>>>>> manual reduction and in turn the final global gpu_results array is >>>>>> incorrect. >>>>>> When thread 0 does a reduction into dist[0] it has no knowledge of >>>>>> dist[1] having been updated by thread 1. Which tells me the array is >>>>>> still thread private. >>>>>> Adding some printfs, looking at one teams' output: >>>>>> >>>>>> SPMD >>>>>> >>>>>> Thread 0: dist[0]: 1 >>>>>> Thread 0: dist[1]: 0 // This should be 1 >>>>>> After reduction into dist[0]: 1 // This should be 2 >>>>>> gpu_results = [1,1] // [2,2] expected >>>>>> >>>>>> Generic Mode: >>>>>> >>>>>> Thread 0: dist[0]: 1 >>>>>> Thread 0: dist[1]: 1 >>>>>> After reduction into dist[0]: 2 >>>>>> gpu_results = [2,2] >>>>> >>>>> Hmm, I would expect a crash if the array was allocated in the local >>>>> memory. Could you try to add some more printfs (with data and addresses >>>>> of the array) to check the results? Maybe there is a data race somewhere >>>>> in the code? >>>> >>>> As a reminder, each thread updates a unique index in the dist array and >>>> each team updates a unique index in gpu_results. >>>> >>>> SPMD - shows each thread has a unique address for dist array >>>> >>>> Team 0 Thread 1: dist[0]: 0, 0x7f92e24a8bf8 >>>> Team 0 Thread 1: dist[1]: 1, 0x7f92e24a8bfc >>>> >>>> Team 0 Thread 0: dist[0]: 1, 0x7f92e24a8bf0 >>>> Team 0 Thread 0: dist[1]: 0, 0x7f92e24a8bf4 >>>> >>>> Team 0 Thread 0: After reduction into dist[0]: 1 >>>> Team 0 Thread 0: gpu_results address: 0x7f92a5000000 >>>> -------------------------------------------------- >>>> Team 1 Thread 1: dist[0]: 0, 0x7f92f9ec5188 >>>> Team 1 Thread 1: dist[1]: 1, 0x7f92f9ec518c >>>> >>>> Team 1 Thread 0: dist[0]: 1, 0x7f92f9ec5180 >>>> Team 1 Thread 0: dist[1]: 0, 0x7f92f9ec5184 >>>> >>>> Team 1 Thread 0: After reduction into dist[0]: 1 >>>> Team 1 Thread 0: gpu_results address: 0x7f92a5000000 >>>> >>>> gpu_results[0]: 1 >>>> gpu_results[1]: 1 >>>> >>>> Generic - shows each team shares dist array address amongst threads >>>> >>>> Team 0 Thread 1: dist[0]: 1, 0x7fac01938880 >>>> Team 0 Thread 1: dist[1]: 1, 0x7fac01938884 >>>> >>>> Team 0 Thread 0: dist[0]: 1, 0x7fac01938880 >>>> Team 0 Thread 0: dist[1]: 1, 0x7fac01938884 >>>> >>>> Team 0 Thread 0: After reduction into dist[0]: 2 >>>> Team 0 Thread 0: gpu_results address: 0x7fabc5000000 >>>> -------------------------------------------------- >>>> Team 1 Thread 1: dist[0]: 1, 0x7fac19354e10 >>>> Team 1 Thread 1: dist[1]: 1, 0x7fac19354e14 >>>> >>>> Team 1 Thread 0: dist[0]: 1, 0x7fac19354e10 >>>> Team 1 Thread 0: dist[1]: 1, 0x7fac19354e14 >>>> >>>> Team 1 Thread 0: After reduction into dist[0]: 2 >>>> Team 1 Thread 0: gpu_results address: 0x7fabc5000000 >>> >>> Could you check if it works with `-fno-openmp-cuda-parallel-target-regions` >>> option? >> >> Unfortunately that crashes: >> llvm-project/llvm/lib/IR/Instructions.cpp:495: void >> llvm::CallInst::init(llvm::FunctionType*, llvm::Value*, >> llvm::ArrayRef<llvm::Value*>, >> llvm::ArrayRef<llvm::OperandBundleDefT<llvm::Value*> >, const llvm::Twine&): >> Assertion `(i >= FTy->getNumParams() || FTy->getParamType(i) == >> Args[i]->getType()) && "Calling a function with a bad signature!"' failed. > > Hmm, could you provide a full stack trace? At this point I am not sure I want to dig into that crash as our llvm-branch is not caught up to trunk. I did build trunk and ran some tests on a sm_70: -Without this patch: code fails with incomplete results -Without this patch and with -fno-openmp-cuda-parallel-target-regions: code fails with incomplete results -With this patch: code fails with incomplete results (thread private array) Team 0 Thread 1: dist[0]: 0, 0x7c1e800000a8 Team 0 Thread 1: dist[1]: 1, 0x7c1e800000ac Team 0 Thread 0: dist[0]: 1, 0x7c1e800000a0 Team 0 Thread 0: dist[1]: 0, 0x7c1e800000a4 Team 0 Thread 0: After reduction into dist[0]: 1 Team 0 Thread 0: gpu_results address: 0x7c1ebc800000 Team 1 Thread 1: dist[0]: 0, 0x7c1e816f27c8 Team 1 Thread 1: dist[1]: 1, 0x7c1e816f27cc Team 1 Thread 0: dist[0]: 1, 0x7c1e816f27c0 Team 1 Thread 0: dist[1]: 0, 0x7c1e816f27c4 Team 1 Thread 0: After reduction into dist[0]: 1 Team 1 Thread 0: gpu_results address: 0x7c1ebc800000 gpu_results[0]: 1 gpu_results[1]: 1 FAIL -With this patch and with -fno-openmp-cuda-parallel-target-regions: Pass Team 0 Thread 1: dist[0]: 1, 0x7a5b56000018 Team 0 Thread 1: dist[1]: 1, 0x7a5b5600001c Team 0 Thread 0: dist[0]: 1, 0x7a5b56000018 Team 0 Thread 0: dist[1]: 1, 0x7a5b5600001c Team 0 Thread 0: After reduction into dist[0]: 2 Team 0 Thread 0: gpu_results address: 0x7a5afc800000 Team 1 Thread 1: dist[0]: 1, 0x7a5b56000018 Team 1 Thread 1: dist[1]: 1, 0x7a5b5600001c Team 1 Thread 0: dist[0]: 1, 0x7a5b56000018 Team 1 Thread 0: dist[1]: 1, 0x7a5b5600001c Team 1 Thread 0: After reduction into dist[0]: 2 Team 1 Thread 0: gpu_results address: 0x7a5afc800000 gpu_results[0]: 2 gpu_results[1]: 2 PASS I am concerned about team 0 and team 1 having the same address for the dist array here. Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D99432/new/ https://reviews.llvm.org/D99432 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits