[PATCH] D77954: [CUDA][HIP] Fix host/device based overload resolution

John McCall via Phabricator via cfe-commits Thu, 23 Apr 2020 20:07:56 -0700

rjmccall added inline comments.


================
Comment at: clang/lib/Sema/SemaOverload.cpp:9749
+  if (isBetterMultiversionCandidate(Cand1, Cand2))
+    return true;
+
----------------
tra wrote:
> rjmccall wrote:
> > erichkeane wrote:
> > > yaxunl wrote:
> > > > echristo wrote:
> > > > > rjmccall wrote:
> > > > > > yaxunl wrote:
> > > > > > > rjmccall wrote:
> > > > > > > > If we move anything below this check, it needs to figure out a 
> > > > > > > > tri-state so that it can return false if `Cand2` is a better 
> > > > > > > > candidate than `Cand1`.  Now, that only matters if multiversion 
> > > > > > > > functions are supported under CUDA, but if you're relying on 
> > > > > > > > them not being supported, that should at least be commented on.
> > > > > > > multiversion host functions is orthogonal to CUDA therefore 
> > > > > > > should be supported. multiversion in device, host device, and 
> > > > > > > global functions are not supported. However this change does not 
> > > > > > > make things worse, and should continue to work if they are 
> > > > > > > supported.
> > > > > > > 
> > > > > > > host/device based overloading resolution is mostly for 
> > > > > > > determining viability of a function. If two functions are both 
> > > > > > > viable, other factors should take precedence in preference. This 
> > > > > > > general rule has been taken for cases other than multiversion, I 
> > > > > > > think it should also apply to multiversion.
> > > > > > > 
> > > > > > > I will make isBetterMultiversionCandidate three states.
> > > > > > > This general rule has been taken for cases other than 
> > > > > > > multiversion, I think it should also apply to multiversion.
> > > > > > 
> > > > > > Well, but the multiversion people could say the same: that 
> > > > > > multiversioning is for picking an alternative among 
> > > > > > otherwise-identical functions, and HD and H functions are not 
> > > > > > otherwise-identical.
> > > > > > 
> > > > > > CC'ing @echristo for his thoughts on the right ordering here.
> > > > > Adding @erichkeane here as well.
> > > > > 
> > > > > I think this makes sense, but I can see a reason to multiversion a 
> > > > > function that will run on host and device. A version of some matrix 
> > > > > mult that takes advantage of 3 host architectures and one cuda one? 
> > > > > Am I missing something here?
> > > > My understanding is that a multiversion function is for a specific 
> > > > cpu(gpu). Let's say we want to have a function f for gfx900, gfx906, 
> > > > sandybridge, ivybridge, shouldn't they be more like
> > > > 
> > > > ```
> > > > __host__ __attribute__((cpu_specific(sandybridge))) f();
> > > > __host__ __attribute__((cpu_specific(ivybridge))) f();
> > > > __device__ __attribute__((cpu_specific(gfx900))) f();
> > > > __device__ __attribute__((cpu_specific(gfx906))) f();
> > > > ```
> > > > instead of all `__device__ __host__` functions?
> > > IMO, it doesn't make sense for functions to functions be BOTH host and 
> > > device, they'd have to be just one.  Otherwise I'm not sure how the 
> > > resolver behavior is supposed to work.  The whole idea is that the 
> > > definition is chosen at runtime.
> > > 
> > > Unless __host__ __device void foo(); is TWO declaration chains (meaning 
> > > two separate AST entries), it doesn't make sense to have multiverison on 
> > > it (and then, how it would be spelled is awkward/confusing to me).
> > > 
> > > In the above case, if those 4 declarations are not 2 separate root- AST 
> > > nodes, multiversioning won't work.
> > There are certainly functions that ought to be usable from either host or 
> > device context — any inline function that just does ordinary language 
> > things should be in that category.  Also IIUC many declarations are 
> > *inferred* to be `__host__ __device__`, or can be mass-annotated with 
> > pragmas, and those reasons are probably the main ones this might matter — 
> > we might include a header in CUDA mode that declares a multi-versioned 
> > function, and we should handle it right.
> > 
> > My read of how CUDA programmers expect this to work is that they see the 
> > `__host__` / `__device__` attributes as primarily a mechanism for catching 
> > problems where you're using the wrong functions for the current 
> > configuration.  That is, while we allow overloading by 
> > `__host__`/`__device__`-ness, users expect those attributes to mostly be 
> > used as a filter for what's "really there" rather than really strictly 
> > segregating the namespace.  So I would say that CUDA programmers would 
> > probably expect the interaction with multiversioning to be:
> > 
> > - Programmers can put `__host__`, `__device__`, or both on a variant 
> > depending on where it was usable.
> > - Dispatches should simply ignore any variants that aren't usable for the 
> > current configuration.
> > 
> > And specifically they would not expect e.g. a `__host__` dispatch function 
> > to only consider `__host__` variants — it should be able to dispatch to 
> > anything available, which is to say, it should also include `__host__ 
> > __device__` variants.  Similarly (and probably more usefully), a `__host__ 
> > __device__` dispatch function being compiled for the device should also 
> > consider pure `__device__` functions, and so on.
> > 
> > If we accept that, then I think it gives us a much better idea for how to 
> > resolve the priority of the overload rules.  The main impact of 
> > `isBetterMultiversionCandidate` is to try to ensure that we're looking at 
> > the `__attribute__((cpu_dispatch))` function instead of one of the 
> > `__attribute__((cpu_specific))` variants.  (It has no effect on 
> > `__attribute__((target))` multi-versioning, mostly because it doesn't need 
> > to: target-specific variants don't show up in lookup with 
> > `__attribute__((target))`.)  That rule should take precedence over the CUDA 
> > preference for exact matches, because e.g. if we're compiling this:
> > 
> > ```
> > __host__ __device__ int magic(void) __attribute__((cpu_dispatch("...")));
> > __host__ __device__ int magic(void) __attribute__((cpu_specific(generic)));
> > __host__ int magic(void) __attribute__((cpu_specific(mmx)));
> > __host__ int magic(void) __attribute__((cpu_specific(sse)));
> > __device__ int magic(void) 
> > __attribute__((cpu_specific(some_device_feature)));
> > __device__ int magic(void) 
> > __attribute__((cpu_specific(some_other_device_feature)));
> > ```
> > 
> > then we don't want the compiler to prefer a CPU-specific variant over the 
> > dispatch function just because one of the variant was marked `__host__`.
> It's a bit more complicated and a bit less straightforward than that. :-(  
> https://goo.gl/EXnymm
> Handling of target attributes is where clang is very different from the NVCC, 
> so no matter which mental model of "CUDA programmer" you pick, there's 
> another one which will not match. 
> 
> In the existing code `__host__ __device__` is commonly used as a sledgehammer 
> to work around NVCC's limitations. It does not allow attribute-based 
> overloading, so the only way you can specialize a function for host/device is 
> via something like this:
> ```
> __host__ __device__ void foo() {
> #if __CUDA_ARCH__ > 0
>  // GPU code
> #else
>  // CPU code.
> #endif
> }
> ```
> 
> With clang you can write separate overloaded functions and we'll do our best 
> to pick the one you meant to call. Alas, there are cases where it's ambiguous 
> and depends on the callee's attributes, which may depend on theirs. When 
> something ends up being called from different contexts, interesting things 
> start happening. With more functions becoming constexpr (those are implicitly 
> HD), we'll be running into such impossible-to-do-the-right-thing situations 
> more often. The only reliable way to avoid such ambiguity is to 'clone' HD 
> functions into separate H & D functions and do overload resolutions only 
> considering same-side functions which will, in effect, completely separate 
> host and device name spaces. 
> 
> Run-time dispatch is also somewhat irrelevant to CUDA. Sort of. On one hand 
> kernel launch is already a form of runtime dispatch, only it's CUDA runtime 
> does the dispatching based on the GPU one attempts to run the kernel on. 
> `__device__` functions are always compiled for the specific GPU variant. 
> Also, GPU variants often have different instruction sets and can't be mixed 
> together in the same object file at all, so there's no variants once we're 
> running the code as it's already compiled for precisely the GPU we're running 
> on. Almost. Technically GPUs in the same family do share the same instruction 
> sets, but I'm not sure runtime dispatch would buy us much there as the 
> hardware differences are relatively minor.
> 
> The only reliable way to avoid such ambiguity is to 'clone' HD functions into 
> separate H & D functions and do overload resolutions only considering 
> same-side functions which will, in effect, completely separate host and 
> device name spaces.

Okay.  Well, even if you completely split host and device functions, I think 
we'd still want to prefer dispatch functions over variant functions before 
preferring H over HD.

Although... I suppose we *do* want to consider H vs. HD before looking at the 
more arbitrary factors that `isBetterMultiversionCandidate` looks at, like the 
number of architectures in the dispatch.  Honestly, though, those just seem 
like bad rules that we should drop from the code.

> Run-time dispatch is also somewhat irrelevant to CUDA. Sort of. 

I understand that there's very little reason (or even ability) to use 
multiversioning in device code, but it can certainly happen in host code, 
right?  Still, I guess the easiest thing would just be to forbid multiversioned 
functions on the device.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D77954/new/

https://reviews.llvm.org/D77954



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D77954: [CUDA][HIP] Fix host/device based overload resolution

Reply via email to