Re: [PATCH] D9888: [OPENMP] Driver support for OpenMP offloading

Samuel Antao via cfe-commits Wed, 07 Oct 2015 18:57:18 -0700

sfantao added a comment.

In http://reviews.llvm.org/D9888#262325, @tra wrote:

> In http://reviews.llvm.org/D9888#257904, @sfantao wrote:
>
> > This diff refactors the original patch and is rebased on top of the latests 
> > offloading changes inserted for CUDA.
> >
> > Here I don't touch the CUDA support. I tried, however, to have the 
> > implementation modular enough so that it could eventually be combined with 
> > the CUDA implementation. In my view OpenMP offloading is more general in 
> > the sense that it does not refer to a given tool chain, instead it uses 
> > existing toolchains to generate code for offloading devices. So, I believe 
> > that a tool chain (which I did not include in this patch) targeting NVPTX 
> > will be able to handle both CUDA and OpenMP offloading models.
>
>
> What do you mean by "does not refer to a given toolchain"? Do you have the 
> toolchain patch available?

I mean not having to create a toolchain for a specific offloading model. OpenMP 
offloading is meant for any target and possibility many different targets 
simultaneously, so having a toolchain for each combination would be 
overwhelming.

I don't have a patch for the toolchain out for review yet. I'm planing to port 
what we have in clang-omp for the NVPTX toolchain once I have the host 
functionality in place. In there 
(https://github.com/clang-omp/clang_trunk/tree/master/lib/Driver) the Driver is 
implemented in a different way, I guess the version I'm proposing here is much 
cleaner. However, the ToolChains shouldn't be that different. All the tweaking 
is moved to the `Tool` itself, and I imagine I can drive that using the 
`ToolChain` offloading kind I'm proposing here. In 
https://github.com/clang-omp/clang_trunk/blob/master/lib/Driver/Tools.cpp I 
basically pick some arguments to forward to the tool and do some tricks to 
include libdevice in compilation when required. Do you think something like 
that could also work for CUDA?

> Creating a separate toolchain for CUDA was a crutch that was available to 
> craft appropriate cc1 command line for device-side compilation using existing 
> toolchain. It works, but it's rather rigid arrangement. Creating a NVPTX 
> toolchain which can be parameterized to produce CUDA or OpenMP would be an 
> improvement.

> 

> Ideally toolchain tweaking should probably be done outside of the toolchain 
> itself so that it can be used with any combination of {CUDA or OpenMP target 
> tweaks}x{toolchains capable of generating target code}.

I agree. I decided to move all the offloading tweaking to the tools, given that 
that is what clang tool already does: customizes the arguments based on the 
`ToolChain` that is passed to it.

> 

> 

> > b ) The building of the driver actions is unchanged.

> 

> > 

> 

> > I don't create device specific actions. Instead only the 
> > bundling/unbundling are inserted as first or last action if the file type 
> > requires that.

> 

> 

> Could you elaborate on that? The way I read it, the driver sees linear chain 
> of compilation steps plus bundling/unbundling at the beginning/end and that 
> each action would result in multiple compiler invocations, presumably per 
> target.

> 

> If that's the case, then it may present a bit of a challenge in case one part 
> of compilation depends on results of another. That's the case for CUDA where 
> results of device-side compilation must be present for host-side compilation 
> so we can generate additional code to initialize it at runtime.

That's right. I try to tackle the challenge of passing host/device results to 
device/host jobs by using a cache of results as I had described in d). The goal 
here is to add the flexibility required to accommodate different offloading 
models. In OpenMP we use host compile results in device compile jobs, and 
device link results in host link jobs whereas in CUDA the assemble result is 
used in compile job. I believe that we can have that cache to include whatever 
information is required to suit all needs.

> > c) Add offloading kind to `ToolChain`

> 

> > 

> 

> > Offloading does not require a new toolchain to be created. Existent 
> > toolchains are used and the offloading kind is used to drive specific 
> > behavior in each toolchain so that valid device code is generated.

> 

> > 

> 

> > This is a major difference from what is currently done for CUDA. But I 
> > guess the CUDA implementation easily fits this design and the Nvidia GPU 
> > toolchain could be reused for both CUDA and OpenMP offloading.

> 

> 

> Sounds good. I'd be happy to make necessary make CUDA support use it.

Great! Thanks.

> > d) Use Job results cache to easily use host results in device actions and 
> > vice-versa.

> 

> > 

> 

> > An array of the results for each job is kept so that the device job can use 
> > the result previously generated for the host and used it as input or 
> > vice-versa.

> 

> 

> Nice. That's something that will be handy for CUDA and may help to avoid 
> passing bits of info about other jobs explicitly throughout the driver.

> 

> > The result cache can also be updated to keep the required information for 
> > the CUDA implementation to decide host/device binaries combining  
> > (injection is the term used in the code). I don't have a concrete proposal 
> > for that however, given that is not clear to me what are the plans for CUDA 
> > to support separate compilation, I understand that the CUDA binary is 
> > inserted directly in host IR (Art, can you shed some light on this?).

> 

> 

> Currently CUDA depends on libcudart which assumes that GPU code and its 
> initialization is done the way nvcc does it. Currently we do include PTX 
> assembly (as in readable text)  generated on device side into host-side IR 
> *and* generate some host data structures and init code to register GPU 
> binaries with libcudart. I haven't figured out a way to compile host/device 
> sides of CUDA without a host-side compilation depending on device results.

> 

> Long-term we're considering implementing CUDA runtime support based on plain 
> driver interface which would give us more control over where we keep GPU code 
> and how we initialize it. Then we could simplify things and, for example, 
> incorporate GPU code via linker script.  Alas for the time being we're stuck 
> with libcudart and sequential device and host compilation phases.

> 

> As for separate compilation -- compilation part is doable. It's using the 
> results of such compilation that becomes tricky. CUDA's triple-bracket kernel 
> launch syntax depends on libcudart and will not work, because we would not 
> generate init code. You can still launch kernels manually using raw driver 
> API, but it's quite a bit more convoluted.

Ok, I see. I am not aware of what exactly libcudart does, but I can elaborate 
on what the OpenMP offloading implementation we have in place does:

We have a descriptor that is registered with the runtime library (we generate a 
function for that called before any global initializers are executed ), this 
descriptor has (among other things) fields that are initialized with the 
symbols defined by the linker script (so that the runtime library can 
immediately get the CUDA module)  and also the names of the kernels (in OpenMP 
with don't have user-defined names for these kernels, so we generate some 
mangling to make sure they are unique). While launching the kernel, the runtime 
gets a pointer from which he can easily retrieve the name, and the CUDA driver 
API is used to get the CUDA function to be launched. We have been successfully 
generating a CUDA module that works well with separate compilation using ptxas 
and nvlink.

Part of my work is also port the runtime library in clang-omp to the LLLVM 
OpenMP project. I see CUDA as a simplified version of what OpenMP does, given 
that the user controls the data mappings explicitly, so I am sure we can find 
some synergies in the runtime library too and you may be able to use something 
that we already have in there.

Thanks!
Samuel

> --Artem

http://reviews.llvm.org/D9888

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Re: [PATCH] D9888: [OPENMP] Driver support for OpenMP offloading

Reply via email to