tra added a comment.

In http://reviews.llvm.org/D9888#257904, @sfantao wrote:

> This diff refactors the original patch and is rebased on top of the latests 
> offloading changes inserted for CUDA.
>
> Here I don't touch the CUDA support. I tried, however, to have the 
> implementation modular enough so that it could eventually be combined with 
> the CUDA implementation. In my view OpenMP offloading is more general in the 
> sense that it does not refer to a given tool chain, instead it uses existing 
> toolchains to generate code for offloading devices. So, I believe that a tool 
> chain (which I did not include in this patch) targeting NVPTX will be able to 
> handle both CUDA and OpenMP offloading models.


What do you mean by "does not refer to a given toolchain"? Do you have the 
toolchain patch available?

Creating a separate toolchain for CUDA was a crutch that was available to craft 
appropriate cc1 command line for device-side compilation using existing 
toolchain. It works, but it's rather rigid arrangement. Creating a NVPTX 
toolchain which can be parameterized to produce CUDA or OpenMP would be an 
improvement.

Ideally toolchain tweaking should probably be done outside of the toolchain 
itself so that it can be used with any combination of {CUDA or OpenMP target 
tweaks}x{toolchains capable of generating target code}.

> b ) The building of the driver actions is unchanged.

> 

> I don't create device specific actions. Instead only the bundling/unbundling 
> are inserted as first or last action if the file type requires that.


Could you elaborate on that? The way I read it, the driver sees linear chain of 
compilation steps plus bundling/unbundling at the beginning/end and that each 
action would result in multiple compiler invocations, presumably per target.

If that's the case, then it may present a bit of a challenge in case one part 
of compilation depends on results of another. That's the case for CUDA where 
results of device-side compilation must be present for host-side compilation so 
we can generate additional code to initialize it at runtime.

> c) Add offloading kind to `ToolChain`

> 

> Offloading does not require a new toolchain to be created. Existent 
> toolchains are used and the offloading kind is used to drive specific 
> behavior in each toolchain so that valid device code is generated.

> 

> This is a major difference from what is currently done for CUDA. But I guess 
> the CUDA implementation easily fits this design and the Nvidia GPU toolchain 
> could be reused for both CUDA and OpenMP offloading.


Sounds good. I'd be happy to make necessary make CUDA support use it.

> d) Use Job results cache to easily use host results in device actions and 
> vice-versa.

> 

> An array of the results for each job is kept so that the device job can use 
> the result previously generated for the host and used it as input or 
> vice-versa.


Nice. That's something that will be handy for CUDA and may help to avoid 
passing bits of info about other jobs explicitly throughout the driver.

> The result cache can also be updated to keep the required information for the 
> CUDA implementation to decide host/device binaries combining  (injection is 
> the term used in the code). I don't have a concrete proposal for that 
> however, given that is not clear to me what are the plans for CUDA to support 
> separate compilation, I understand that the CUDA binary is inserted directly 
> in host IR (Art, can you shed some light on this?).


Currently CUDA depends on libcudart which assumes that GPU code and its 
initialization is done the way nvcc does it. Currently we do include PTX 
assembly (as in readable text)  generated on device side into host-side IR 
*and* generate some host data structures and init code to register GPU binaries 
with libcudart. I haven't figured out a way to compile host/device sides of 
CUDA without a host-side compilation depending on device results.

Long-term we're considering implementing CUDA runtime support based on plain 
driver interface which would give us more control over where we keep GPU code 
and how we initialize it. Then we could simplify things and, for example, 
incorporate GPU code via linker script.  Alas for the time being we're stuck 
with libcudart and sequential device and host compilation phases.

As for separate compilation -- compilation part is doable. It's using the 
results of such compilation that becomes tricky. CUDA's triple-bracket kernel 
launch syntax depends on libcudart and will not work, because we would not 
generate init code. You can still launch kernels manually using raw driver API, 
but it's quite a bit more convoluted.

--Artem


================
Comment at: include/clang/Driver/Driver.h:208
@@ +207,3 @@
+  /// CreateUnbundledOffloadingResult - Create a command to unbundle the input
+  /// and use the resulting input info. If there re inputs already cached in
+  /// OffloadingHostResults for that action use them instead. If no offloading
----------------
re -> are

================
Comment at: include/clang/Driver/Driver.h:210
@@ +209,3 @@
+  /// OffloadingHostResults for that action use them instead. If no offloading
+  /// is being support just return the provided input info.
+  InputInfo CreateUnbundledOffloadingResult(
----------------
"If offloading is not supported" perhaps?

================
Comment at: lib/Driver/Driver.cpp:2090
@@ +2089,3 @@
+          dyn_cast<OffloadUnbundlingJobAction>(A)) {
+    // The input of the unbundling job has to a single input non-source file,
+    // so we do not consider it having multiple architectures. We just use the
----------------
"has to be"


http://reviews.llvm.org/D9888



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to