rjmccall added inline comments.
================ Comment at: clang/lib/Sema/SemaOverload.cpp:9749 + if (isBetterMultiversionCandidate(Cand1, Cand2)) + return true; + ---------------- tra wrote: > rjmccall wrote: > > erichkeane wrote: > > > yaxunl wrote: > > > > echristo wrote: > > > > > rjmccall wrote: > > > > > > yaxunl wrote: > > > > > > > rjmccall wrote: > > > > > > > > If we move anything below this check, it needs to figure out a > > > > > > > > tri-state so that it can return false if `Cand2` is a better > > > > > > > > candidate than `Cand1`. Now, that only matters if multiversion > > > > > > > > functions are supported under CUDA, but if you're relying on > > > > > > > > them not being supported, that should at least be commented on. > > > > > > > multiversion host functions is orthogonal to CUDA therefore > > > > > > > should be supported. multiversion in device, host device, and > > > > > > > global functions are not supported. However this change does not > > > > > > > make things worse, and should continue to work if they are > > > > > > > supported. > > > > > > > > > > > > > > host/device based overloading resolution is mostly for > > > > > > > determining viability of a function. If two functions are both > > > > > > > viable, other factors should take precedence in preference. This > > > > > > > general rule has been taken for cases other than multiversion, I > > > > > > > think it should also apply to multiversion. > > > > > > > > > > > > > > I will make isBetterMultiversionCandidate three states. > > > > > > > This general rule has been taken for cases other than > > > > > > > multiversion, I think it should also apply to multiversion. > > > > > > > > > > > > Well, but the multiversion people could say the same: that > > > > > > multiversioning is for picking an alternative among > > > > > > otherwise-identical functions, and HD and H functions are not > > > > > > otherwise-identical. > > > > > > > > > > > > CC'ing @echristo for his thoughts on the right ordering here. > > > > > Adding @erichkeane here as well. > > > > > > > > > > I think this makes sense, but I can see a reason to multiversion a > > > > > function that will run on host and device. A version of some matrix > > > > > mult that takes advantage of 3 host architectures and one cuda one? > > > > > Am I missing something here? > > > > My understanding is that a multiversion function is for a specific > > > > cpu(gpu). Let's say we want to have a function f for gfx900, gfx906, > > > > sandybridge, ivybridge, shouldn't they be more like > > > > > > > > ``` > > > > __host__ __attribute__((cpu_specific(sandybridge))) f(); > > > > __host__ __attribute__((cpu_specific(ivybridge))) f(); > > > > __device__ __attribute__((cpu_specific(gfx900))) f(); > > > > __device__ __attribute__((cpu_specific(gfx906))) f(); > > > > ``` > > > > instead of all `__device__ __host__` functions? > > > IMO, it doesn't make sense for functions to functions be BOTH host and > > > device, they'd have to be just one. Otherwise I'm not sure how the > > > resolver behavior is supposed to work. The whole idea is that the > > > definition is chosen at runtime. > > > > > > Unless __host__ __device void foo(); is TWO declaration chains (meaning > > > two separate AST entries), it doesn't make sense to have multiverison on > > > it (and then, how it would be spelled is awkward/confusing to me). > > > > > > In the above case, if those 4 declarations are not 2 separate root- AST > > > nodes, multiversioning won't work. > > There are certainly functions that ought to be usable from either host or > > device context — any inline function that just does ordinary language > > things should be in that category. Also IIUC many declarations are > > *inferred* to be `__host__ __device__`, or can be mass-annotated with > > pragmas, and those reasons are probably the main ones this might matter — > > we might include a header in CUDA mode that declares a multi-versioned > > function, and we should handle it right. > > > > My read of how CUDA programmers expect this to work is that they see the > > `__host__` / `__device__` attributes as primarily a mechanism for catching > > problems where you're using the wrong functions for the current > > configuration. That is, while we allow overloading by > > `__host__`/`__device__`-ness, users expect those attributes to mostly be > > used as a filter for what's "really there" rather than really strictly > > segregating the namespace. So I would say that CUDA programmers would > > probably expect the interaction with multiversioning to be: > > > > - Programmers can put `__host__`, `__device__`, or both on a variant > > depending on where it was usable. > > - Dispatches should simply ignore any variants that aren't usable for the > > current configuration. > > > > And specifically they would not expect e.g. a `__host__` dispatch function > > to only consider `__host__` variants — it should be able to dispatch to > > anything available, which is to say, it should also include `__host__ > > __device__` variants. Similarly (and probably more usefully), a `__host__ > > __device__` dispatch function being compiled for the device should also > > consider pure `__device__` functions, and so on. > > > > If we accept that, then I think it gives us a much better idea for how to > > resolve the priority of the overload rules. The main impact of > > `isBetterMultiversionCandidate` is to try to ensure that we're looking at > > the `__attribute__((cpu_dispatch))` function instead of one of the > > `__attribute__((cpu_specific))` variants. (It has no effect on > > `__attribute__((target))` multi-versioning, mostly because it doesn't need > > to: target-specific variants don't show up in lookup with > > `__attribute__((target))`.) That rule should take precedence over the CUDA > > preference for exact matches, because e.g. if we're compiling this: > > > > ``` > > __host__ __device__ int magic(void) __attribute__((cpu_dispatch("..."))); > > __host__ __device__ int magic(void) __attribute__((cpu_specific(generic))); > > __host__ int magic(void) __attribute__((cpu_specific(mmx))); > > __host__ int magic(void) __attribute__((cpu_specific(sse))); > > __device__ int magic(void) > > __attribute__((cpu_specific(some_device_feature))); > > __device__ int magic(void) > > __attribute__((cpu_specific(some_other_device_feature))); > > ``` > > > > then we don't want the compiler to prefer a CPU-specific variant over the > > dispatch function just because one of the variant was marked `__host__`. > It's a bit more complicated and a bit less straightforward than that. :-( > https://goo.gl/EXnymm > Handling of target attributes is where clang is very different from the NVCC, > so no matter which mental model of "CUDA programmer" you pick, there's > another one which will not match. > > In the existing code `__host__ __device__` is commonly used as a sledgehammer > to work around NVCC's limitations. It does not allow attribute-based > overloading, so the only way you can specialize a function for host/device is > via something like this: > ``` > __host__ __device__ void foo() { > #if __CUDA_ARCH__ > 0 > // GPU code > #else > // CPU code. > #endif > } > ``` > > With clang you can write separate overloaded functions and we'll do our best > to pick the one you meant to call. Alas, there are cases where it's ambiguous > and depends on the callee's attributes, which may depend on theirs. When > something ends up being called from different contexts, interesting things > start happening. With more functions becoming constexpr (those are implicitly > HD), we'll be running into such impossible-to-do-the-right-thing situations > more often. The only reliable way to avoid such ambiguity is to 'clone' HD > functions into separate H & D functions and do overload resolutions only > considering same-side functions which will, in effect, completely separate > host and device name spaces. > > Run-time dispatch is also somewhat irrelevant to CUDA. Sort of. On one hand > kernel launch is already a form of runtime dispatch, only it's CUDA runtime > does the dispatching based on the GPU one attempts to run the kernel on. > `__device__` functions are always compiled for the specific GPU variant. > Also, GPU variants often have different instruction sets and can't be mixed > together in the same object file at all, so there's no variants once we're > running the code as it's already compiled for precisely the GPU we're running > on. Almost. Technically GPUs in the same family do share the same instruction > sets, but I'm not sure runtime dispatch would buy us much there as the > hardware differences are relatively minor. > > The only reliable way to avoid such ambiguity is to 'clone' HD functions into > separate H & D functions and do overload resolutions only considering > same-side functions which will, in effect, completely separate host and > device name spaces. Okay. Well, even if you completely split host and device functions, I think we'd still want to prefer dispatch functions over variant functions before preferring H over HD. Although... I suppose we *do* want to consider H vs. HD before looking at the more arbitrary factors that `isBetterMultiversionCandidate` looks at, like the number of architectures in the dispatch. Honestly, though, those just seem like bad rules that we should drop from the code. > Run-time dispatch is also somewhat irrelevant to CUDA. Sort of. I understand that there's very little reason (or even ability) to use multiversioning in device code, but it can certainly happen in host code, right? Still, I guess the easiest thing would just be to forbid multiversioned functions on the device. CHANGES SINCE LAST ACTION https://reviews.llvm.org/D77954/new/ https://reviews.llvm.org/D77954 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits