JonChesterfield wrote: This patch probably does too many things and it's implemented in a non-conventional fashion. I think that does a decent job of having something to upset most people that have looked at it. I note that reviewers are primarily complaining about disjoint aspects of the patch so a more optimistic reading could be that most of you love some of it and hate part of it, and that the links to the RFCs may have been too subtle.
The functionality I need here, independent of the implementation, is: 1. Represent GPU SIMT constructs in LLVM IR without up front specifying a target ISA What I _want_ beyond that is: 1. Be able to lower that to target specific just out of clang, so they're free to use on a specific target 2. Easy, clean ways to test the lowering (i.e. please let me do this in IR) If I can't have cheap lowering the openmp crowd will insist on amdgpu and nvptx specific paths that duplicate the lowering that llvm already knows how to do, and so will the hip crowd, because expanding the intrinsics in the backend will induce overhead. Which will primarily stop me removing the target specific dispatch from the language implementations. Not on the critical path, just stops me simplifying things. This patch is conceptually a few things. 1. My first step towards making spirv64-unknown-unknown usable across vendors 4. Target independent intrinsics for the read_first_lane, ballot etc primitives of the SIMT model 5. Intrinsics for the various grid model accessors 6. A few misc things (exit, suspend) which are kind of niche 7. Doing target specific lowering in an IR pass 8. Doing lowering for _two different_ targets in the _same_ IR pass I want the first point. Compile code once then JIT it onto some arbitrary GPU architecture? Love it. Noone else seems keen but that's fine, I think some end users will like that functionality. It seems to have been some of the motivation behind spirv too. Intrinsics for ballot are necessary in the same way that convergent needs to be represented in the IR. Presently the different targets have their own intrinsics for these, because they really should be intrinsics. That I think is absolutely uncontentious. It's absurd that we have multiple different intrinsics that mean the same thing for a common concept that we want to manipulate in IR. The grid builtins are convenient for the opencl/cuda programming model. They don't matter much to IR manipulation. I want them here because the cost is tiny, most are directly mapped to a register read and the others to some trivial code sequence, and it frees one from the endless squabble over which special compiler-rt is the bestest. The exit/suspend stuff is niche. Maybe I can't get those through review. They're here because I've gone with "what would it take to run libc on Intel GPUs, assuming Intel won't put target specific code in upstream". I can make them both nops under spirv for the time being. I've done the lowering in IR because that's the right and proper thing to do, fully aware that most people disagree with me on that. We get trivial testing, composability, not writing the thing twice for SDag and GlobalISel, and the ability to lower right out of clang where the target is known early. So OpenMP that uses these intrinsics and targets nvptx64-- directly pays zero cost. Likewise amdgpu-- as the target will immediately turn into the target intrinsics and interact with the amdgpu attributor as if this was never here. The pass should be scheduled early after clang as an optimisation and run unconditionally in the backend. Some people might recognise this strategy from my variadic lowering pass. I've lowered amdgpu and nvptx in the same IR pass, which is heresy, because I want to emphasise that these are not semantically clever or novel things, that can be rewritten to target specific things whenever is convenient. That seems to have been misread above. I also want it to be very obvious to Intel exactly where they need to add a column of function pointers to have this work on their backend. An extra target gets a column, an extra intrinsic gets a row, fill in all the grid spaces. I like dumb code that walks over tables. However. This has not sailed cleanly through review, and I'm totally willing to compromise the pretty implementation to get the functionality of build once, JIT to any GPU. If I have to copy&paste files across directories and delete parts so be it. Likewise I'll cheerfully leave a load of conceptually redundant target dispatch cruft in libc and openmp if you guys won't let me lower these in an IR pass shortly after clang. https://github.com/llvm/llvm-project/pull/131190 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits