jhuber6 created this revision. jhuber6 added reviewers: jdoerfert, JonChesterfield, ronlieb, yaxunl, tra. Herald added subscribers: carlosgalvezp, dexonsmith. Herald added a project: All. jhuber6 requested review of this revision. Herald added subscribers: cfe-commits, sstefan1, MaskRay. Herald added a project: clang.
The changes made in D123460 <https://reviews.llvm.org/D123460> generalized the code generation for OpenMP's offloading entries. We can use the same scheme to register globals for CUDA code. This patch adds the code generation to create these offloading entries when compiling using the new offloading driver mode. The offloading entries are simple structs that contain the information necessary to register the global. The struct used is as follows: Type struct __tgt_offload_entry { void *addr; // Pointer to the offload entry info. // (function or global) char *name; // Name of the function or global. size_t size; // Size of the entry info (0 if it a function). int32_t flags; int32_t reserved; }; Currently CUDA handles RDC code generation by deferring the registration of globals in the current TU to a callback function containing the modules ID. Later all the module IDs will be used to register all of the globals at once. Rather than mimic this, offloading entries allow us to mimic the way OpenMP registers globals. That is, we create a simple global struct for each device global to be registered. These are placed at a special section `cuda_offloading_entires`. Because this section is a valid C-identifier, the linker will profide a `__start` and `__stop` pointer that we can use to iterate and register all globals at runtime. the registration requires a flag variable to indicate which registration function to use. I have assigned the flags somewhat arbitrarily, but these use the following values. Kernel: 0 Variable: 0 Managed: 1 Surface: 2 Texture: 4 Depends on D120272 <https://reviews.llvm.org/D120272> Repository: rG LLVM Github Monorepo https://reviews.llvm.org/D123471 Files: clang/include/clang/Basic/LangOptions.def clang/include/clang/Driver/Options.td clang/lib/CodeGen/CGCUDANV.cpp clang/lib/CodeGen/CGCUDARuntime.h clang/lib/Driver/ToolChains/Clang.cpp clang/test/CodeGenCUDA/offloading-entries.cu
Index: clang/test/CodeGenCUDA/offloading-entries.cu =================================================================== --- /dev/null +++ clang/test/CodeGenCUDA/offloading-entries.cu @@ -0,0 +1,33 @@ +// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --check-globals +// RUN: %clang_cc1 -std=c++11 -triple x86_64-unknown-linux-gnu \ +// RUN: -foffload-new-driver -emit-llvm -o - -x cuda %s | FileCheck \ +// RUN: --check-prefix=HOST %s + +#include "Inputs/cuda.h" + +//. +// HOST: @x = internal global i32 undef, align 4 +// HOST: @.omp_offloading.entry_name = internal unnamed_addr constant [8 x i8] c"_Z3foov\00" +// HOST: @.omp_offloading.entry._Z3foov = weak constant %struct.__tgt_offload_entry { i8* bitcast (void ()* @_Z18__device_stub__foov to i8*), i8* getelementptr inbounds ([8 x i8], [8 x i8]* @.omp_offloading.entry_name, i32 0, i32 0), i64 0, i32 0, i32 0 }, section "cuda_offloading_entries", align 1 +// HOST: @.omp_offloading.entry_name.1 = internal unnamed_addr constant [8 x i8] c"_Z3barv\00" +// HOST: @.omp_offloading.entry._Z3barv = weak constant %struct.__tgt_offload_entry { i8* bitcast (void ()* @_Z18__device_stub__barv to i8*), i8* getelementptr inbounds ([8 x i8], [8 x i8]* @.omp_offloading.entry_name.1, i32 0, i32 0), i64 0, i32 0, i32 0 }, section "cuda_offloading_entries", align 1 +// HOST: @.omp_offloading.entry_name.2 = internal unnamed_addr constant [2 x i8] c"x\00" +// HOST: @.omp_offloading.entry.x = weak constant %struct.__tgt_offload_entry { i8* bitcast (i32* @x to i8*), i8* getelementptr inbounds ([2 x i8], [2 x i8]* @.omp_offloading.entry_name.2, i32 0, i32 0), i64 4, i32 0, i32 0 }, section "cuda_offloading_entries", align 1 +//. +// HOST-LABEL: @_Z18__device_stub__foov( +// HOST-NEXT: entry: +// HOST-NEXT: [[TMP0:%.*]] = call i32 @cudaLaunch(i8* bitcast (void ()* @_Z18__device_stub__foov to i8*)) +// HOST-NEXT: br label [[SETUP_END:%.*]] +// HOST: setup.end: +// HOST-NEXT: ret void +// +__global__ void foo() {} +// HOST-LABEL: @_Z18__device_stub__barv( +// HOST-NEXT: entry: +// HOST-NEXT: [[TMP0:%.*]] = call i32 @cudaLaunch(i8* bitcast (void ()* @_Z18__device_stub__barv to i8*)) +// HOST-NEXT: br label [[SETUP_END:%.*]] +// HOST: setup.end: +// HOST-NEXT: ret void +// +__global__ void bar() {} +__device__ int x = 1; Index: clang/lib/Driver/ToolChains/Clang.cpp =================================================================== --- clang/lib/Driver/ToolChains/Clang.cpp +++ clang/lib/Driver/ToolChains/Clang.cpp @@ -6083,6 +6083,10 @@ CmdArgs.push_back("-fno-openmp-extensions"); } + // Forward the new driver to change offloading code generation. + if (Args.hasArg(options::OPT_foffload_new_driver)) + CmdArgs.push_back("-foffload-new-driver"); + SanitizeArgs.addArgs(TC, Args, CmdArgs, InputType); const XRayArgs &XRay = TC.getXRayArgs(); Index: clang/lib/CodeGen/CGCUDARuntime.h =================================================================== --- clang/lib/CodeGen/CGCUDARuntime.h +++ clang/lib/CodeGen/CGCUDARuntime.h @@ -52,6 +52,24 @@ Texture, // Builtin texture }; + /// The kind flag of the target region entry. + enum OffloadRegionEntryKindFlag : uint32_t { + /// Mark the region entry as a kernel. + OffloadRegionKernelEntry = 0x0, + }; + + /// The kind flag of the global variable entry. + enum OffloadVarEntryKindFlag : uint32_t { + /// Mark the entry as a global variable. + OffloadGlobalVarEntry = 0x0, + /// Mark the entry as a managed global variable. + OffloadGlobalManagedEntry = 0x1, + /// Mark the entry as a surface variable. + OffloadGlobalSurfaceEntry = 0x2, + /// Mark the entry as a texture variable. + OffloadGlobalTextureEntry = 0x4, + }; + private: unsigned Kind : 2; unsigned Extern : 1; Index: clang/lib/CodeGen/CGCUDANV.cpp =================================================================== --- clang/lib/CodeGen/CGCUDANV.cpp +++ clang/lib/CodeGen/CGCUDANV.cpp @@ -157,6 +157,8 @@ llvm::Function *makeModuleDtorFunction(); /// Transform managed variables for device compilation. void transformManagedVars(); + /// Create offloading entries to register globals in RDC mode. + void createOffloadingEntries(); public: CGNVCUDARuntime(CodeGenModule &CGM); @@ -210,7 +212,8 @@ CGNVCUDARuntime::CGNVCUDARuntime(CodeGenModule &CGM) : CGCUDARuntime(CGM), Context(CGM.getLLVMContext()), TheModule(CGM.getModule()), - RelocatableDeviceCode(CGM.getLangOpts().GPURelocatableDeviceCode), + RelocatableDeviceCode(CGM.getLangOpts().GPURelocatableDeviceCode || + CGM.getLangOpts().OffloadingNewDriver), DeviceMC(InitDeviceMC(CGM)) { CodeGen::CodeGenTypes &Types = CGM.getTypes(); ASTContext &Ctx = CGM.getContext(); @@ -1107,6 +1110,40 @@ } } +// Creates offloading entries for all the kernels and globals that must be +// registered. The linker will provide a pointer to this section so we can +// register the symbols with the linked device image. +void CGNVCUDARuntime::createOffloadingEntries() { + llvm::OpenMPIRBuilder OMPBuilder(CGM.getModule()); + OMPBuilder.initialize(); + + StringRef Section = "cuda_offloading_entries"; + for (KernelInfo &I : EmittedKernels) + OMPBuilder.emitOffloadingEntry( + KernelHandles[I.Kernel], getDeviceSideName(cast<NamedDecl>(I.D)), 0, + DeviceVarFlags::OffloadRegionKernelEntry, Section); + + for (VarInfo &I : DeviceVars) { + uint64_t VarSize = + CGM.getDataLayout().getTypeAllocSize(I.Var->getValueType()); + if (I.Flags.getKind() == DeviceVarFlags::Variable) { + OMPBuilder.emitOffloadingEntry( + I.Var, getDeviceSideName(I.D), VarSize, + I.Flags.isManaged() ? DeviceVarFlags::OffloadGlobalManagedEntry + : DeviceVarFlags::OffloadGlobalVarEntry, + Section); + } else if (I.Flags.getKind() == DeviceVarFlags::Surface) { + OMPBuilder.emitOffloadingEntry(I.Var, getDeviceSideName(I.D), VarSize, + DeviceVarFlags::OffloadGlobalSurfaceEntry, + Section); + } else if (I.Flags.getKind() == DeviceVarFlags::Texture) { + OMPBuilder.emitOffloadingEntry(I.Var, getDeviceSideName(I.D), VarSize, + DeviceVarFlags::OffloadGlobalTextureEntry, + Section); + } + } +} + // Returns module constructor to be added. llvm::Function *CGNVCUDARuntime::finalizeModule() { if (CGM.getLangOpts().CUDAIsDevice) { @@ -1135,7 +1172,11 @@ } return nullptr; } - return makeModuleCtorFunction(); + if (!(CGM.getLangOpts().OffloadingNewDriver && RelocatableDeviceCode)) + return makeModuleCtorFunction(); + + createOffloadingEntries(); + return nullptr; } llvm::GlobalValue *CGNVCUDARuntime::getKernelHandle(llvm::Function *F, Index: clang/include/clang/Driver/Options.td =================================================================== --- clang/include/clang/Driver/Options.td +++ clang/include/clang/Driver/Options.td @@ -2513,10 +2513,10 @@ PosFlag<SetTrue, [CC1Option]>, NegFlag<SetFalse>, BothFlags<[NoArgumentUnused, HelpHidden]>>; def static_openmp: Flag<["-"], "static-openmp">, HelpText<"Use the static host OpenMP runtime while linking.">; -def fopenmp_new_driver : Flag<["-"], "fopenmp-new-driver">, Flags<[CC1Option]>, Group<Action_Group>, +def fopenmp_new_driver : Flag<["-"], "fopenmp-new-driver">, Flags<[CC1Option]>, Group<f_Group>, HelpText<"Use the new driver for OpenMP offloading.">; -def foffload_new_driver : Flag<["-"], "foffload-new-driver">, Flags<[CC1Option]>, Group<Action_Group>, - HelpText<"Use the new driver for offloading.">; +def foffload_new_driver : Flag<["-"], "foffload-new-driver">, Flags<[CC1Option]>, Group<f_Group>, + MarshallingInfoFlag<LangOpts<"OffloadingNewDriver">>, HelpText<"Use the new driver for offloading.">; def fno_optimize_sibling_calls : Flag<["-"], "fno-optimize-sibling-calls">, Group<f_Group>; def foptimize_sibling_calls : Flag<["-"], "foptimize-sibling-calls">, Group<f_Group>; defm escaping_block_tail_calls : BoolFOption<"escaping-block-tail-calls", Index: clang/include/clang/Basic/LangOptions.def =================================================================== --- clang/include/clang/Basic/LangOptions.def +++ clang/include/clang/Basic/LangOptions.def @@ -266,6 +266,7 @@ LANGOPT(GPUMaxThreadsPerBlock, 32, 1024, "default max threads per block for kernel launch bounds for HIP") LANGOPT(GPUDeferDiag, 1, 0, "defer host/device related diagnostic messages for CUDA/HIP") LANGOPT(GPUExcludeWrongSideOverloads, 1, 0, "always exclude wrong side overloads in overloading resolution for CUDA/HIP") +LANGOPT(OffloadingNewDriver, 1, 0, "use the new driver for generating offloading code.") LANGOPT(SYCLIsDevice , 1, 0, "Generate code for SYCL device") LANGOPT(SYCLIsHost , 1, 0, "SYCL host compilation")
_______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits