JonChesterfield wrote:

This patch probably does too many things and it's implemented in a 
non-conventional fashion. I think that does a decent job of having something to 
upset most people that have looked at it. I note that reviewers are primarily 
complaining about disjoint aspects of the patch so a more optimistic reading 
could be that most of you love some of it and hate part of it, and that the 
links to the RFCs may have been too subtle.


The functionality I need here, independent of the implementation, is:

1. Represent GPU SIMT constructs in LLVM IR without up front specifying a 
target ISA

What I _want_ beyond that is:

1. Be able to lower that to target specific just out of clang, so they're free 
to use on a specific target
2. Easy, clean ways to test the lowering (i.e. please let me do this in IR)

If I can't have cheap lowering the openmp crowd will insist on amdgpu and nvptx 
specific paths that duplicate the lowering that llvm already knows how to do, 
and so will the hip crowd, because expanding the intrinsics in the backend will 
induce overhead. Which will primarily stop me removing the target specific 
dispatch from the language implementations. Not on the critical path, just 
stops me simplifying things.

This patch is conceptually a few things.

1. My first step towards making spirv64-unknown-unknown usable across vendors
4. Target independent intrinsics for the read_first_lane, ballot etc primitives 
of the SIMT model
5. Intrinsics for the various grid model accessors
6. A few misc things (exit, suspend) which are kind of niche
7. Doing target specific lowering in an IR pass
8. Doing lowering for _two different_ targets in the _same_ IR pass


I want the first point. Compile code once then JIT it onto some arbitrary GPU 
architecture? Love it. Noone else seems keen but that's fine, I think some end 
users will like that functionality. It seems to have been some of the 
motivation behind spirv too.

Intrinsics for ballot are necessary in the same way that convergent needs to be 
represented in the IR. Presently the different targets have their own 
intrinsics for these, because they really should be intrinsics. That I think is 
absolutely uncontentious. It's absurd that we have multiple different 
intrinsics that mean the same thing for a common concept that we want to 
manipulate in IR.

The grid builtins are convenient for the opencl/cuda programming model. They 
don't matter much to IR manipulation. I want them here because the cost is 
tiny, most are directly mapped to a register read and the others to some 
trivial code sequence, and it frees one from the endless squabble over which 
special compiler-rt is the bestest.

The exit/suspend stuff is niche. Maybe I can't get those through review. 
They're here because I've gone with "what would it take to run libc on Intel 
GPUs, assuming Intel won't put target specific code in upstream". I can make 
them both nops under spirv for the time being.

I've done the lowering in IR because that's the right and proper thing to do, 
fully aware that most people disagree with me on that. We get trivial testing, 
composability, not writing the thing twice for SDag and GlobalISel, and the 
ability to lower right out of clang where the target is known early. So OpenMP 
that uses these intrinsics and targets nvptx64-- directly pays zero cost. 
Likewise amdgpu-- as the target will immediately turn into the target 
intrinsics and interact with the amdgpu attributor as if this was never here. 
The pass should be scheduled early after clang as an optimisation and run 
unconditionally in the backend. Some people might recognise this strategy from 
my variadic lowering pass.

I've lowered amdgpu and nvptx in the same IR pass, which is heresy, because I 
want to emphasise that these are not semantically clever or novel things, that 
can be rewritten to target specific things whenever is convenient. That seems 
to have been misread above. I also want it to be very obvious to Intel exactly 
where they need to add a column of function pointers to have this work on their 
backend. An extra target gets a column, an extra intrinsic gets a row, fill in 
all the grid spaces. I like dumb code that walks over tables.


However. This has not sailed cleanly through review, and I'm totally willing to 
compromise the pretty implementation to get the functionality of build once, 
JIT to any GPU. If I have to copy&paste files across directories and delete 
parts so be it. Likewise I'll cheerfully leave a load of conceptually redundant 
target dispatch cruft in libc and openmp if you guys won't let me lower these 
in an IR pass shortly after clang.

https://github.com/llvm/llvm-project/pull/131190
_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to