[PATCH] D44435: CUDA ctor/dtor Module-Unique Symbol Name

2018-05-09 Thread Simeon Ehrig via Phabricator via cfe-commits
SimeonEhrig added a comment.

Okay, I will close the request and thank you very much for your help and your 
hints.


https://reviews.llvm.org/D44435



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D44435: Add the module name to __cuda_module_ctor and __cuda_module_dtor for unique function names

2018-03-13 Thread Simeon Ehrig via Phabricator via cfe-commits
SimeonEhrig created this revision.
SimeonEhrig added reviewers: karies, v.g.vassilev, rsmith, rjmccall.

This allows multi-module / incremental compilation environments to have unique 
global CUDA constructor and destructor function names.


Repository:
  rC Clang

https://reviews.llvm.org/D44435

Files:
  lib/CodeGen/CGCUDANV.cpp
  unittests/CodeGen/IncrementalProcessingTest.cpp

Index: unittests/CodeGen/IncrementalProcessingTest.cpp
===
--- unittests/CodeGen/IncrementalProcessingTest.cpp
+++ unittests/CodeGen/IncrementalProcessingTest.cpp
@@ -21,9 +21,11 @@
 #include "llvm/IR/Module.h"
 #include "llvm/Support/Host.h"
 #include "llvm/Support/MemoryBuffer.h"
+#include "llvm/Target/TargetOptions.h"
 #include "gtest/gtest.h"
 
 #include 
+#include 
 
 using namespace llvm;
 using namespace clang;
@@ -171,4 +173,122 @@
 
 }
 
+
+// If the cuda mode in the compiler instance is enable, a cuda ctor or dtor will
+// be generated for every statement, if a fatbinary file is exists.
+const char CUDATestProgram1[] =
+"void cudaFunc1(){}\n";
+
+const char CUDATestProgram2[] =
+"void cudaFunc2(){}\n";
+
+const Function* getCUDActor(llvm::Module& M) {
+  for (const auto& Func: M)
+if (Func.hasName() && Func.getName().startswith("__cuda_module_ctor-"))
+  return &Func;
+
+  return nullptr;
+}
+
+const Function* getCUDAdtor(llvm::Module& M) {
+  for (const auto& Func: M)
+if (Func.hasName() && Func.getName().startswith("__cuda_module_dtor-"))
+  return &Func;
+
+  return nullptr;
+}
+
+TEST(IncrementalProcessing, EmitCUDAGlobalInitFunc) {
+LLVMContext Context;
+CompilerInstance compiler;
+
+compiler.createDiagnostics();
+compiler.getLangOpts().CPlusPlus = 1;
+compiler.getLangOpts().CPlusPlus11 = 1;
+compiler.getLangOpts().CUDA = 1;
+
+compiler.getTargetOpts().Triple = llvm::Triple::normalize(
+llvm::sys::getProcessTriple());
+compiler.setTarget(clang::TargetInfo::CreateTargetInfo(
+  compiler.getDiagnostics(),
+  std::make_shared(
+compiler.getTargetOpts(;
+
+// To enable the generating of cuda host code, it's needs to set up the
+// auxTriple.
+llvm::Triple hostTriple(llvm::sys::getProcessTriple());
+compiler.getFrontendOpts().AuxTriple =
+hostTriple.isArch64Bit() ? "nvptx64-nvidia-cuda" : "nvptx-nvidia-cuda";
+auto targetOptions = std::make_shared();
+targetOptions->Triple = compiler.getFrontendOpts().AuxTriple;
+targetOptions->HostTriple = compiler.getTarget().getTriple().str();
+compiler.setAuxTarget(clang::TargetInfo::CreateTargetInfo(
+compiler.getDiagnostics(), targetOptions));
+
+// A fatbinary file is necessary, that the code generator generates the ctor
+// and dtor.
+auto tmpFatbinFileOrError = llvm::sys::fs::TempFile::create("dummy.fatbin");
+ASSERT_TRUE((bool)tmpFatbinFileOrError);
+auto tmpFatbinFile = std::move(*tmpFatbinFileOrError);
+compiler.getCodeGenOpts().CudaGpuBinaryFileName = tmpFatbinFile.TmpName;
+
+compiler.createFileManager();
+compiler.createSourceManager(compiler.getFileManager());
+compiler.createPreprocessor(clang::TU_Prefix);
+compiler.getPreprocessor().enableIncrementalProcessing();
+
+compiler.createASTContext();
+
+CodeGenerator* CG =
+CreateLLVMCodeGen(
+compiler.getDiagnostics(),
+"main-module",
+compiler.getHeaderSearchOpts(),
+compiler.getPreprocessorOpts(),
+compiler.getCodeGenOpts(),
+Context);
+
+compiler.setASTConsumer(std::unique_ptr(CG));
+compiler.createSema(clang::TU_Prefix, nullptr);
+Sema& S = compiler.getSema();
+
+std::unique_ptr ParseOP(new Parser(S.getPreprocessor(), S,
+   /*SkipFunctionBodies*/ false));
+Parser &P = *ParseOP.get();
+
+std::array, 3> M;
+M[0] = IncrementalParseAST(compiler, P, *CG, nullptr);
+ASSERT_TRUE(M[0]);
+
+M[1] = IncrementalParseAST(compiler, P, *CG, CUDATestProgram1);
+ASSERT_TRUE(M[1]);
+ASSERT_TRUE(M[1]->getFunction("_Z9cudaFunc1v"));
+
+M[2] = IncrementalParseAST(compiler, P, *CG, CUDATestProgram2);
+ASSERT_TRUE(M[2]);
+ASSERT_TRUE(M[2]->getFunction("_Z9cudaFunc2v"));
+// First code should not end up in second module:
+ASSERT_FALSE(M[2]->getFunction("_Z9cudaFunc1v"));
+
+// Make sure, that cuda ctor's and dtor's exist:
+const Function* CUDActor1 = getCUDActor(*M[1]);
+ASSERT_TRUE(CUDActor1);
+
+const Function* CUDActor2 = getCUDActor(*M[2]);
+ASSERT_TRUE(CUDActor2);
+
+const Function* CUDAdtor1 = getCUDAdtor(*M[1]);
+ASSERT_TRUE(CUDAdtor1);
+
+const Function* CUDAdtor2 = getCUDAdtor(*M[2]);
+ASSERT_TRUE(CUDAdtor2);
+
+// Compare the names of both ctor's and dtor's to check, that they are
+// unique.
+ASSERT_FALSE(CUDActor1->getName() == CUDActor2->getName());
+ASSE

[PATCH] D44435: Add the module name to __cuda_module_ctor and __cuda_module_dtor for unique function names

2018-03-13 Thread Simeon Ehrig via Phabricator via cfe-commits
SimeonEhrig updated this revision to Diff 138224.
SimeonEhrig added a comment.

change comment of the example function for TEST(IncrementalProcessing, 
EmitCUDAGlobalInitFunc)


Repository:
  rC Clang

https://reviews.llvm.org/D44435

Files:
  lib/CodeGen/CGCUDANV.cpp
  unittests/CodeGen/IncrementalProcessingTest.cpp

Index: unittests/CodeGen/IncrementalProcessingTest.cpp
===
--- unittests/CodeGen/IncrementalProcessingTest.cpp
+++ unittests/CodeGen/IncrementalProcessingTest.cpp
@@ -21,9 +21,11 @@
 #include "llvm/IR/Module.h"
 #include "llvm/Support/Host.h"
 #include "llvm/Support/MemoryBuffer.h"
+#include "llvm/Target/TargetOptions.h"
 #include "gtest/gtest.h"
 
 #include 
+#include 
 
 using namespace llvm;
 using namespace clang;
@@ -171,4 +173,122 @@
 
 }
 
+
+// In CUDA incremental processing, a CUDA ctor or dtor will be generated for 
+// every statement if a fatbinary file exists.
+const char CUDATestProgram1[] =
+"void cudaFunc1(){}\n";
+
+const char CUDATestProgram2[] =
+"void cudaFunc2(){}\n";
+
+const Function* getCUDActor(llvm::Module& M) {
+  for (const auto& Func: M)
+if (Func.hasName() && Func.getName().startswith("__cuda_module_ctor-"))
+  return &Func;
+
+  return nullptr;
+}
+
+const Function* getCUDAdtor(llvm::Module& M) {
+  for (const auto& Func: M)
+if (Func.hasName() && Func.getName().startswith("__cuda_module_dtor-"))
+  return &Func;
+
+  return nullptr;
+}
+
+TEST(IncrementalProcessing, EmitCUDAGlobalInitFunc) {
+LLVMContext Context;
+CompilerInstance compiler;
+
+compiler.createDiagnostics();
+compiler.getLangOpts().CPlusPlus = 1;
+compiler.getLangOpts().CPlusPlus11 = 1;
+compiler.getLangOpts().CUDA = 1;
+
+compiler.getTargetOpts().Triple = llvm::Triple::normalize(
+llvm::sys::getProcessTriple());
+compiler.setTarget(clang::TargetInfo::CreateTargetInfo(
+  compiler.getDiagnostics(),
+  std::make_shared(
+compiler.getTargetOpts(;
+
+// To enable the generating of cuda host code, it's needs to set up the
+// auxTriple.
+llvm::Triple hostTriple(llvm::sys::getProcessTriple());
+compiler.getFrontendOpts().AuxTriple =
+hostTriple.isArch64Bit() ? "nvptx64-nvidia-cuda" : "nvptx-nvidia-cuda";
+auto targetOptions = std::make_shared();
+targetOptions->Triple = compiler.getFrontendOpts().AuxTriple;
+targetOptions->HostTriple = compiler.getTarget().getTriple().str();
+compiler.setAuxTarget(clang::TargetInfo::CreateTargetInfo(
+compiler.getDiagnostics(), targetOptions));
+
+// A fatbinary file is necessary, that the code generator generates the ctor
+// and dtor.
+auto tmpFatbinFileOrError = llvm::sys::fs::TempFile::create("dummy.fatbin");
+ASSERT_TRUE((bool)tmpFatbinFileOrError);
+auto tmpFatbinFile = std::move(*tmpFatbinFileOrError);
+compiler.getCodeGenOpts().CudaGpuBinaryFileName = tmpFatbinFile.TmpName;
+
+compiler.createFileManager();
+compiler.createSourceManager(compiler.getFileManager());
+compiler.createPreprocessor(clang::TU_Prefix);
+compiler.getPreprocessor().enableIncrementalProcessing();
+
+compiler.createASTContext();
+
+CodeGenerator* CG =
+CreateLLVMCodeGen(
+compiler.getDiagnostics(),
+"main-module",
+compiler.getHeaderSearchOpts(),
+compiler.getPreprocessorOpts(),
+compiler.getCodeGenOpts(),
+Context);
+
+compiler.setASTConsumer(std::unique_ptr(CG));
+compiler.createSema(clang::TU_Prefix, nullptr);
+Sema& S = compiler.getSema();
+
+std::unique_ptr ParseOP(new Parser(S.getPreprocessor(), S,
+   /*SkipFunctionBodies*/ false));
+Parser &P = *ParseOP.get();
+
+std::array, 3> M;
+M[0] = IncrementalParseAST(compiler, P, *CG, nullptr);
+ASSERT_TRUE(M[0]);
+
+M[1] = IncrementalParseAST(compiler, P, *CG, CUDATestProgram1);
+ASSERT_TRUE(M[1]);
+ASSERT_TRUE(M[1]->getFunction("_Z9cudaFunc1v"));
+
+M[2] = IncrementalParseAST(compiler, P, *CG, CUDATestProgram2);
+ASSERT_TRUE(M[2]);
+ASSERT_TRUE(M[2]->getFunction("_Z9cudaFunc2v"));
+// First code should not end up in second module:
+ASSERT_FALSE(M[2]->getFunction("_Z9cudaFunc1v"));
+
+// Make sure, that cuda ctor's and dtor's exist:
+const Function* CUDActor1 = getCUDActor(*M[1]);
+ASSERT_TRUE(CUDActor1);
+
+const Function* CUDActor2 = getCUDActor(*M[2]);
+ASSERT_TRUE(CUDActor2);
+
+const Function* CUDAdtor1 = getCUDAdtor(*M[1]);
+ASSERT_TRUE(CUDAdtor1);
+
+const Function* CUDAdtor2 = getCUDAdtor(*M[2]);
+ASSERT_TRUE(CUDAdtor2);
+
+// Compare the names of both ctor's and dtor's to check, that they are
+// unique.
+ASSERT_FALSE(CUDActor1->getName() == CUDActor2->getName());
+ASSERT_FALSE(CUDAdtor1->getName() == CUDAdtor2->getName());
+
+ASSERT_FALSE((bool)tmpFat

[PATCH] D44435: Add the module name to __cuda_module_ctor and __cuda_module_dtor for unique function names

2018-03-14 Thread Simeon Ehrig via Phabricator via cfe-commits
SimeonEhrig marked an inline comment as done.
SimeonEhrig added inline comments.



Comment at: lib/CodeGen/CGCUDANV.cpp:281
 
+  // get name from the module to generate unique ctor name for every module
+  SmallString<128> ModuleName

rjmccall wrote:
> Please explain in the comment *why* you're doing this.  It's just for 
> debugging, right?  So that it's known which object file the constructor 
> function comes from.
The motivation is the same at this review: https://reviews.llvm.org/D34059
We try to enable incremental compiling of cuda runtime code, so we need unique 
ctor/dtor names, to handle the cuda device code over different modules. 



Comment at: lib/CodeGen/CGCUDANV.cpp:281
 
+  // get name from the module to generate unique ctor name for every module
+  SmallString<128> ModuleName

tra wrote:
> SimeonEhrig wrote:
> > rjmccall wrote:
> > > Please explain in the comment *why* you're doing this.  It's just for 
> > > debugging, right?  So that it's known which object file the constructor 
> > > function comes from.
> > The motivation is the same at this review: https://reviews.llvm.org/D34059
> > We try to enable incremental compiling of cuda runtime code, so we need 
> > unique ctor/dtor names, to handle the cuda device code over different 
> > modules. 
> I'm also interested in in the motivation for this change.
> 
> Also, if the goal is to have an unique module identifier, would compiling two 
> different files with the same name be a problem? If the goal is to help 
> identifying a module, this may be OK, if not ideal. If you really need to 
> have unique name, then you may need to do something more elaborate. NVCC 
> appears to use some random number (or hash of something?) for that.
We need this modification for our C++-interpreter Cling, which we want to 
expand to interpret CUDA runtime code. Effective, it's a jit, which read in 
line by line the program code. Every line get his own llvm::Module. The 
Interpreter works with incremental and lazy compilation. Because the lazy 
compilation, we needs this modification. In the CUDA mode, clang generates  for 
every module an _ _cuda_module_ctor and _ _cuda_module_dtor, if the compiler 
was started with a path to a fatbinary file. But the ctor is also depend on the 
source code, which will translate to llvm IR in the module. For Example, if a _ 
_global_ _ kernel will defined, the CodeGen add the function call 
__cuda_register_globals() to the ctor. But the lazy compilations prevents, that 
we can translate a function, which is already translate. Without the 
modification, the interpreter things, that the ctor is always same and use the 
first translation of the function, which was generate. Therefore, it is 
impossible to add new kernels. 



Comment at: unittests/CodeGen/IncrementalProcessingTest.cpp:176-178
+
+// In CUDA incremental processing, a CUDA ctor or dtor will be generated for 
+// every statement if a fatbinary file exists.

tra wrote:
> I don't understand the comment. What is 'CUDA incremental processing' and 
> what exactly is meant by 'statement' here? I'd appreciate if you could give 
> me more details. My understanding is that ctor/dtor are generated once per 
> TU. I suspect "incremental processing" may change that, but I have no idea 
> what exactly does it do.
A CUDA ctor/dtor will generates for every llvm::module. The TU can also 
composed of many modules. In our interpreter, we add new code to our AST with 
new modules at runtime. 
The ctor/dtor generation is depend on the fatbinary code. The CodeGen checks, 
if a path to a fatbinary file is set. If it is, it generates an ctor with at 
least a __cudaRegisterFatBinary() function call. So, the generation is 
independent of the source code in the module and we can use every statement. A 
statement can be an expression, a declaration, a definition and so one.   


Repository:
  rC Clang

https://reviews.llvm.org/D44435



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D44435: Add the module name to __cuda_module_ctor and __cuda_module_dtor for unique function names

2018-03-15 Thread Simeon Ehrig via Phabricator via cfe-commits
SimeonEhrig added inline comments.



Comment at: unittests/CodeGen/IncrementalProcessingTest.cpp:176-178
+
+// In CUDA incremental processing, a CUDA ctor or dtor will be generated for 
+// every statement if a fatbinary file exists.

tra wrote:
> SimeonEhrig wrote:
> > tra wrote:
> > > I don't understand the comment. What is 'CUDA incremental processing' and 
> > > what exactly is meant by 'statement' here? I'd appreciate if you could 
> > > give me more details. My understanding is that ctor/dtor are generated 
> > > once per TU. I suspect "incremental processing" may change that, but I 
> > > have no idea what exactly does it do.
> > A CUDA ctor/dtor will generates for every llvm::module. The TU can also 
> > composed of many modules. In our interpreter, we add new code to our AST 
> > with new modules at runtime. 
> > The ctor/dtor generation is depend on the fatbinary code. The CodeGen 
> > checks, if a path to a fatbinary file is set. If it is, it generates an 
> > ctor with at least a __cudaRegisterFatBinary() function call. So, the 
> > generation is independent of the source code in the module and we can use 
> > every statement. A statement can be an expression, a declaration, a 
> > definition and so one.   
> I still don't understand how it's going to work. Do you have some sort of 
> design document outlining how the interpreter is going to work with CUDA?
> 
> The purpose of the ctor/dtor is to stitch together host-side kernel launch 
> with the GPU-side kernel binary which resides in the GPU binary created by 
> device-side compilation. 
> 
> So, the question #1 -- if you pass GPU-side binary to the compiler, where did 
> you get it? Normally it's the result of device-side compilation of the same 
> TU. In your case it's not quite clear what exactly would that be, if you feed 
> the source to the compiler incrementally. I.e. do you somehow recompile 
> everything we've seen on device side so far for each new chunk of host-side 
> source you feed to the compiler? 
> 
> Next question is -- assuming that device side does have correct GPU-side 
> binary, when do you call those ctors/dtors? JIT model does not quite fit the 
> assumptions that drive regular CUDA compilation.
> 
> Let's consider this:
> ```
> __global__ void foo();
> __global__ void bar();
> 
> // If that's all we've  fed to compiler so far, we have no GPU code yet, so 
> there 
> // should be no fatbin file. If we do have it, what's in it?
> 
> void launch() {
>   foo<<<1,1>>>();
>   bar<<<1,1>>>();
> }
> // If you've generated ctors/dtors at this point they would be 
> // useless as no GPU code exists in the preceding code.
> 
> __global__ void foo() {}
> // Now we'd have some GPU code, but how can we need to retrofit it into 
> // all the ctors/dtors we've generated before. 
> __global__ void bar() {}
> // Does bar end up in its own fatbinary? Or is it combined into a new 
> // fatbin which contains both boo and bar?
> // If it's a new fatbin, you somehow need to update existing ctors/dtors, 
> // unless you want to leak CUDA resources fast.
> // If it's a separate fatbin, then you will need to at the very least change 
> the way 
> // ctors/dtors are generated by the 'launch' function, because now they need 
> to 
> // tie each kernel launch to a different fatbin.
> 
> ```
> 
> It looks to me that if you want to JIT CUDA code you will need to take over 
> GPU-side kernel management.
> ctors/dtors do that for full-TU compilation, but they rely on device-side 
> code being compiled and available during host-side compilation. For JIT, the 
> interpreter should be in charge of registering new kernels with the CUDA 
> runtime and unregistering/unloading them when a kernel goes away. This makes 
> ctors/dtors completely irrelevant.
At the moment, there is no documentation, because we still develop the feature. 
I try to describe how it works.

The device side compilation works with a second compiler (a normal clang), 
which we start via syscall. In the interpreter, we check if the input line is a 
kernel definition or a kernel launch. Then we write the source code to a file 
and compile it with the clang to a PCH-file.  Then the PCH-file will be 
compiled to PTX and then to a fatbin. If we add a new kernel, we will send the 
source code with the existing PCH-file to clang compiler. So we easy extend the 
AST and generate a PTX-file with all defined kernels. 

An implementation of this feature can you see at my prototype: 


Running the ctor/dtor isn't hard. I search after the JITSymbol and generate an 
function pointer. Than I can simply run it. This feature can you also see in my 
prototype. So, we can run the ctor, if new fatbin code is generated and the 
dtor before, if code was already registered. The CUDA runtime also provide the 
possibility to run the (un)register functions many times.

  __global__ void foo();
  __global__ void bar();

 

[PATCH] D44435: Add the module name to __cuda_module_ctor and __cuda_module_dtor for unique function names

2018-03-16 Thread Simeon Ehrig via Phabricator via cfe-commits
SimeonEhrig added inline comments.



Comment at: unittests/CodeGen/IncrementalProcessingTest.cpp:176-178
+
+// In CUDA incremental processing, a CUDA ctor or dtor will be generated for 
+// every statement if a fatbinary file exists.

tra wrote:
> SimeonEhrig wrote:
> > tra wrote:
> > > SimeonEhrig wrote:
> > > > tra wrote:
> > > > > I don't understand the comment. What is 'CUDA incremental processing' 
> > > > > and what exactly is meant by 'statement' here? I'd appreciate if you 
> > > > > could give me more details. My understanding is that ctor/dtor are 
> > > > > generated once per TU. I suspect "incremental processing" may change 
> > > > > that, but I have no idea what exactly does it do.
> > > > A CUDA ctor/dtor will generates for every llvm::module. The TU can also 
> > > > composed of many modules. In our interpreter, we add new code to our 
> > > > AST with new modules at runtime. 
> > > > The ctor/dtor generation is depend on the fatbinary code. The CodeGen 
> > > > checks, if a path to a fatbinary file is set. If it is, it generates an 
> > > > ctor with at least a __cudaRegisterFatBinary() function call. So, the 
> > > > generation is independent of the source code in the module and we can 
> > > > use every statement. A statement can be an expression, a declaration, a 
> > > > definition and so one.   
> > > I still don't understand how it's going to work. Do you have some sort of 
> > > design document outlining how the interpreter is going to work with CUDA?
> > > 
> > > The purpose of the ctor/dtor is to stitch together host-side kernel 
> > > launch with the GPU-side kernel binary which resides in the GPU binary 
> > > created by device-side compilation. 
> > > 
> > > So, the question #1 -- if you pass GPU-side binary to the compiler, where 
> > > did you get it? Normally it's the result of device-side compilation of 
> > > the same TU. In your case it's not quite clear what exactly would that 
> > > be, if you feed the source to the compiler incrementally. I.e. do you 
> > > somehow recompile everything we've seen on device side so far for each 
> > > new chunk of host-side source you feed to the compiler? 
> > > 
> > > Next question is -- assuming that device side does have correct GPU-side 
> > > binary, when do you call those ctors/dtors? JIT model does not quite fit 
> > > the assumptions that drive regular CUDA compilation.
> > > 
> > > Let's consider this:
> > > ```
> > > __global__ void foo();
> > > __global__ void bar();
> > > 
> > > // If that's all we've  fed to compiler so far, we have no GPU code yet, 
> > > so there 
> > > // should be no fatbin file. If we do have it, what's in it?
> > > 
> > > void launch() {
> > >   foo<<<1,1>>>();
> > >   bar<<<1,1>>>();
> > > }
> > > // If you've generated ctors/dtors at this point they would be 
> > > // useless as no GPU code exists in the preceding code.
> > > 
> > > __global__ void foo() {}
> > > // Now we'd have some GPU code, but how can we need to retrofit it into 
> > > // all the ctors/dtors we've generated before. 
> > > __global__ void bar() {}
> > > // Does bar end up in its own fatbinary? Or is it combined into a new 
> > > // fatbin which contains both boo and bar?
> > > // If it's a new fatbin, you somehow need to update existing ctors/dtors, 
> > > // unless you want to leak CUDA resources fast.
> > > // If it's a separate fatbin, then you will need to at the very least 
> > > change the way 
> > > // ctors/dtors are generated by the 'launch' function, because now they 
> > > need to 
> > > // tie each kernel launch to a different fatbin.
> > > 
> > > ```
> > > 
> > > It looks to me that if you want to JIT CUDA code you will need to take 
> > > over GPU-side kernel management.
> > > ctors/dtors do that for full-TU compilation, but they rely on device-side 
> > > code being compiled and available during host-side compilation. For JIT, 
> > > the interpreter should be in charge of registering new kernels with the 
> > > CUDA runtime and unregistering/unloading them when a kernel goes away. 
> > > This makes ctors/dtors completely irrelevant.
> > At the moment, there is no documentation, because we still develop the 
> > feature. I try to describe how it works.
> > 
> > The device side compilation works with a second compiler (a normal clang), 
> > which we start via syscall. In the interpreter, we check if the input line 
> > is a kernel definition or a kernel launch. Then we write the source code to 
> > a file and compile it with the clang to a PCH-file.  Then the PCH-file will 
> > be compiled to PTX and then to a fatbin. If we add a new kernel, we will 
> > send the source code with the existing PCH-file to clang compiler. So we 
> > easy extend the AST and generate a PTX-file with all defined kernels. 
> > 
> > An implementation of this feature can you see at my prototype: 
> > 
> > 
> > Running the ctor/dtor isn't hard. I searc

[PATCH] D44435: CUDA ctor/dtor Module-Unique Symbol Name

2018-04-18 Thread Simeon Ehrig via Phabricator via cfe-commits
SimeonEhrig updated this revision to Diff 142921.
SimeonEhrig added a comment.

Thank you everyone for your review comments!

We addressed the inline comments and improved the description of the change set 
for clarity and context.
Tests are updated as well.

This now implements the same fix as previously received in 
https://reviews.llvm.org/D34059 but just for CUDA.


https://reviews.llvm.org/D44435

Files:
  lib/CodeGen/CGCUDANV.cpp
  unittests/CodeGen/IncrementalProcessingTest.cpp

Index: unittests/CodeGen/IncrementalProcessingTest.cpp
===
--- unittests/CodeGen/IncrementalProcessingTest.cpp
+++ unittests/CodeGen/IncrementalProcessingTest.cpp
@@ -21,9 +21,11 @@
 #include "llvm/IR/Module.h"
 #include "llvm/Support/Host.h"
 #include "llvm/Support/MemoryBuffer.h"
+#include "llvm/Target/TargetOptions.h"
 #include "gtest/gtest.h"
 
 #include 
+#include 
 
 using namespace llvm;
 using namespace clang;
@@ -171,4 +173,122 @@
 
 }
 
+
+// In CUDA incremental processing, a CUDA ctor or dtor will be generated for
+// every statement if a fatbinary file exists.
+const char CUDATestProgram1[] =
+"void cudaFunc1(){}\n";
+
+const char CUDATestProgram2[] =
+"void cudaFunc2(){}\n";
+
+const Function* getCUDActor(llvm::Module& M) {
+  for (const auto& Func: M)
+if (Func.hasName() && Func.getName().startswith("__cuda_module_ctor_"))
+  return &Func;
+
+  return nullptr;
+}
+
+const Function* getCUDAdtor(llvm::Module& M) {
+  for (const auto& Func: M)
+if (Func.hasName() && Func.getName().startswith("__cuda_module_dtor_"))
+  return &Func;
+
+  return nullptr;
+}
+
+TEST(IncrementalProcessing, EmitCUDAGlobalInitFunc) {
+LLVMContext Context;
+CompilerInstance compiler;
+
+compiler.createDiagnostics();
+compiler.getLangOpts().CPlusPlus = 1;
+compiler.getLangOpts().CPlusPlus11 = 1;
+compiler.getLangOpts().CUDA = 1;
+
+compiler.getTargetOpts().Triple = llvm::Triple::normalize(
+llvm::sys::getProcessTriple());
+compiler.setTarget(clang::TargetInfo::CreateTargetInfo(
+  compiler.getDiagnostics(),
+  std::make_shared(
+compiler.getTargetOpts(;
+
+// To enable the generating of cuda host code, it's needs to set up the
+// auxTriple.
+llvm::Triple hostTriple(llvm::sys::getProcessTriple());
+compiler.getFrontendOpts().AuxTriple =
+hostTriple.isArch64Bit() ? "nvptx64-nvidia-cuda" : "nvptx-nvidia-cuda";
+auto targetOptions = std::make_shared();
+targetOptions->Triple = compiler.getFrontendOpts().AuxTriple;
+targetOptions->HostTriple = compiler.getTarget().getTriple().str();
+compiler.setAuxTarget(clang::TargetInfo::CreateTargetInfo(
+compiler.getDiagnostics(), targetOptions));
+
+// A fatbinary file is necessary, that the code generator generates the ctor
+// and dtor.
+auto tmpFatbinFileOrError = llvm::sys::fs::TempFile::create("dummy.fatbin");
+ASSERT_TRUE((bool)tmpFatbinFileOrError);
+auto tmpFatbinFile = std::move(*tmpFatbinFileOrError);
+compiler.getCodeGenOpts().CudaGpuBinaryFileName = tmpFatbinFile.TmpName;
+
+compiler.createFileManager();
+compiler.createSourceManager(compiler.getFileManager());
+compiler.createPreprocessor(clang::TU_Prefix);
+compiler.getPreprocessor().enableIncrementalProcessing();
+
+compiler.createASTContext();
+
+CodeGenerator* CG =
+CreateLLVMCodeGen(
+compiler.getDiagnostics(),
+"main-module",
+compiler.getHeaderSearchOpts(),
+compiler.getPreprocessorOpts(),
+compiler.getCodeGenOpts(),
+Context);
+
+compiler.setASTConsumer(std::unique_ptr(CG));
+compiler.createSema(clang::TU_Prefix, nullptr);
+Sema& S = compiler.getSema();
+
+std::unique_ptr ParseOP(new Parser(S.getPreprocessor(), S,
+   /*SkipFunctionBodies*/ false));
+Parser &P = *ParseOP.get();
+
+std::array, 3> M;
+M[0] = IncrementalParseAST(compiler, P, *CG, nullptr);
+ASSERT_TRUE(M[0]);
+
+M[1] = IncrementalParseAST(compiler, P, *CG, CUDATestProgram1);
+ASSERT_TRUE(M[1]);
+ASSERT_TRUE(M[1]->getFunction("_Z9cudaFunc1v"));
+
+M[2] = IncrementalParseAST(compiler, P, *CG, CUDATestProgram2);
+ASSERT_TRUE(M[2]);
+ASSERT_TRUE(M[2]->getFunction("_Z9cudaFunc2v"));
+// First code should not end up in second module:
+ASSERT_FALSE(M[2]->getFunction("_Z9cudaFunc1v"));
+
+// Make sure, that cuda ctor's and dtor's exist:
+const Function* CUDActor1 = getCUDActor(*M[1]);
+ASSERT_TRUE(CUDActor1);
+
+const Function* CUDActor2 = getCUDActor(*M[2]);
+ASSERT_TRUE(CUDActor2);
+
+const Function* CUDAdtor1 = getCUDAdtor(*M[1]);
+ASSERT_TRUE(CUDAdtor1);
+
+const Function* CUDAdtor2 = getCUDAdtor(*M[2]);
+ASSERT_TRUE(CUDAdtor2);
+
+// Compare the names of both ctor's and dtor's to check, that they are
+// 

[PATCH] D44435: CUDA ctor/dtor Module-Unique Symbol Name

2018-04-18 Thread Simeon Ehrig via Phabricator via cfe-commits
SimeonEhrig marked 2 inline comments as done.
SimeonEhrig added inline comments.



Comment at: lib/CodeGen/CGCUDANV.cpp:358
+  if (ModuleName.empty())
+ModuleName = "";
+

rjmccall wrote:
> This doesn't actually seem more useful than the empty string.
We improved the implementation. If there is no module name, it will not append 
any suffix and the symbol is just '__cuda_module_ctor'.


https://reviews.llvm.org/D44435



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D44435: CUDA ctor/dtor Module-Unique Symbol Name

2018-04-20 Thread Simeon Ehrig via Phabricator via cfe-commits
SimeonEhrig updated this revision to Diff 143246.
SimeonEhrig added a comment.

Add full context with -U99 to diff.


https://reviews.llvm.org/D44435

Files:
  lib/CodeGen/CGCUDANV.cpp
  unittests/CodeGen/IncrementalProcessingTest.cpp

Index: unittests/CodeGen/IncrementalProcessingTest.cpp
===
--- unittests/CodeGen/IncrementalProcessingTest.cpp
+++ unittests/CodeGen/IncrementalProcessingTest.cpp
@@ -21,9 +21,11 @@
 #include "llvm/IR/Module.h"
 #include "llvm/Support/Host.h"
 #include "llvm/Support/MemoryBuffer.h"
+#include "llvm/Target/TargetOptions.h"
 #include "gtest/gtest.h"
 
 #include 
+#include 
 
 using namespace llvm;
 using namespace clang;
@@ -171,4 +173,122 @@
 
 }
 
+
+// In CUDA incremental processing, a CUDA ctor or dtor will be generated for
+// every statement if a fatbinary file exists.
+const char CUDATestProgram1[] =
+"void cudaFunc1(){}\n";
+
+const char CUDATestProgram2[] =
+"void cudaFunc2(){}\n";
+
+const Function* getCUDActor(llvm::Module& M) {
+  for (const auto& Func: M)
+if (Func.hasName() && Func.getName().startswith("__cuda_module_ctor_"))
+  return &Func;
+
+  return nullptr;
+}
+
+const Function* getCUDAdtor(llvm::Module& M) {
+  for (const auto& Func: M)
+if (Func.hasName() && Func.getName().startswith("__cuda_module_dtor_"))
+  return &Func;
+
+  return nullptr;
+}
+
+TEST(IncrementalProcessing, EmitCUDAGlobalInitFunc) {
+LLVMContext Context;
+CompilerInstance compiler;
+
+compiler.createDiagnostics();
+compiler.getLangOpts().CPlusPlus = 1;
+compiler.getLangOpts().CPlusPlus11 = 1;
+compiler.getLangOpts().CUDA = 1;
+
+compiler.getTargetOpts().Triple = llvm::Triple::normalize(
+llvm::sys::getProcessTriple());
+compiler.setTarget(clang::TargetInfo::CreateTargetInfo(
+  compiler.getDiagnostics(),
+  std::make_shared(
+compiler.getTargetOpts(;
+
+// To enable the generating of cuda host code, it's needs to set up the
+// auxTriple.
+llvm::Triple hostTriple(llvm::sys::getProcessTriple());
+compiler.getFrontendOpts().AuxTriple =
+hostTriple.isArch64Bit() ? "nvptx64-nvidia-cuda" : "nvptx-nvidia-cuda";
+auto targetOptions = std::make_shared();
+targetOptions->Triple = compiler.getFrontendOpts().AuxTriple;
+targetOptions->HostTriple = compiler.getTarget().getTriple().str();
+compiler.setAuxTarget(clang::TargetInfo::CreateTargetInfo(
+compiler.getDiagnostics(), targetOptions));
+
+// A fatbinary file is necessary, that the code generator generates the ctor
+// and dtor.
+auto tmpFatbinFileOrError = llvm::sys::fs::TempFile::create("dummy.fatbin");
+ASSERT_TRUE((bool)tmpFatbinFileOrError);
+auto tmpFatbinFile = std::move(*tmpFatbinFileOrError);
+compiler.getCodeGenOpts().CudaGpuBinaryFileName = tmpFatbinFile.TmpName;
+
+compiler.createFileManager();
+compiler.createSourceManager(compiler.getFileManager());
+compiler.createPreprocessor(clang::TU_Prefix);
+compiler.getPreprocessor().enableIncrementalProcessing();
+
+compiler.createASTContext();
+
+CodeGenerator* CG =
+CreateLLVMCodeGen(
+compiler.getDiagnostics(),
+"main-module",
+compiler.getHeaderSearchOpts(),
+compiler.getPreprocessorOpts(),
+compiler.getCodeGenOpts(),
+Context);
+
+compiler.setASTConsumer(std::unique_ptr(CG));
+compiler.createSema(clang::TU_Prefix, nullptr);
+Sema& S = compiler.getSema();
+
+std::unique_ptr ParseOP(new Parser(S.getPreprocessor(), S,
+   /*SkipFunctionBodies*/ false));
+Parser &P = *ParseOP.get();
+
+std::array, 3> M;
+M[0] = IncrementalParseAST(compiler, P, *CG, nullptr);
+ASSERT_TRUE(M[0]);
+
+M[1] = IncrementalParseAST(compiler, P, *CG, CUDATestProgram1);
+ASSERT_TRUE(M[1]);
+ASSERT_TRUE(M[1]->getFunction("_Z9cudaFunc1v"));
+
+M[2] = IncrementalParseAST(compiler, P, *CG, CUDATestProgram2);
+ASSERT_TRUE(M[2]);
+ASSERT_TRUE(M[2]->getFunction("_Z9cudaFunc2v"));
+// First code should not end up in second module:
+ASSERT_FALSE(M[2]->getFunction("_Z9cudaFunc1v"));
+
+// Make sure, that cuda ctor's and dtor's exist:
+const Function* CUDActor1 = getCUDActor(*M[1]);
+ASSERT_TRUE(CUDActor1);
+
+const Function* CUDActor2 = getCUDActor(*M[2]);
+ASSERT_TRUE(CUDActor2);
+
+const Function* CUDAdtor1 = getCUDAdtor(*M[1]);
+ASSERT_TRUE(CUDAdtor1);
+
+const Function* CUDAdtor2 = getCUDAdtor(*M[2]);
+ASSERT_TRUE(CUDAdtor2);
+
+// Compare the names of both ctor's and dtor's to check, that they are
+// unique.
+ASSERT_FALSE(CUDActor1->getName() == CUDActor2->getName());
+ASSERT_FALSE(CUDAdtor1->getName() == CUDAdtor2->getName());
+
+ASSERT_FALSE((bool)tmpFatbinFile.discard());
+}
+
 } // end anonymous namespace
Index: lib/CodeGen/CGCUDAN

[PATCH] D44435: CUDA ctor/dtor Module-Unique Symbol Name

2018-04-20 Thread Simeon Ehrig via Phabricator via cfe-commits
SimeonEhrig added inline comments.



Comment at: lib/CodeGen/CGCUDANV.cpp:287
+CtorSuffix.append("_");
+CtorSuffix.append(ModuleName);
+  }

tra wrote:
> There is a general problem with this approach. File name can contain the 
> characters that PTX does not allow.
> We currently only deal with '.' and '@', but that's not enough here.
> You may want to either mangle the name somehow to avoid/convert illegal 
> characters or use some other way to provide unique suffix. Hex-encoded hash 
> of the file name would avoid this problem, for example.
> 
> 
> 
Maybe I'm wrong but I think, that should be no problem, because the generating 
of a cuda ctor/dtor have nothing to do with the PTX generation. 

The function 'makeModuleCtorFunction' should just generate llvm ir code for the 
host (e.g. x86_64).

If I'm wrong, could you tell me please, where in the source code the 
'makeModuleCtorFunction' affect the PTX generation.


https://reviews.llvm.org/D44435



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D44435: CUDA ctor/dtor Module-Unique Symbol Name

2018-04-24 Thread Simeon Ehrig via Phabricator via cfe-commits
SimeonEhrig updated this revision to Diff 143706.
SimeonEhrig added a comment.

Add a comment, which declares the need of a unique ctor/dotr name.


https://reviews.llvm.org/D44435

Files:
  lib/CodeGen/CGCUDANV.cpp
  unittests/CodeGen/IncrementalProcessingTest.cpp

Index: unittests/CodeGen/IncrementalProcessingTest.cpp
===
--- unittests/CodeGen/IncrementalProcessingTest.cpp
+++ unittests/CodeGen/IncrementalProcessingTest.cpp
@@ -21,9 +21,11 @@
 #include "llvm/IR/Module.h"
 #include "llvm/Support/Host.h"
 #include "llvm/Support/MemoryBuffer.h"
+#include "llvm/Target/TargetOptions.h"
 #include "gtest/gtest.h"
 
 #include 
+#include 
 
 using namespace llvm;
 using namespace clang;
@@ -171,4 +173,122 @@
 
 }
 
+
+// In CUDA incremental processing, a CUDA ctor or dtor will be generated for
+// every statement if a fatbinary file exists.
+const char CUDATestProgram1[] =
+"void cudaFunc1(){}\n";
+
+const char CUDATestProgram2[] =
+"void cudaFunc2(){}\n";
+
+const Function* getCUDActor(llvm::Module& M) {
+  for (const auto& Func: M)
+if (Func.hasName() && Func.getName().startswith("__cuda_module_ctor_"))
+  return &Func;
+
+  return nullptr;
+}
+
+const Function* getCUDAdtor(llvm::Module& M) {
+  for (const auto& Func: M)
+if (Func.hasName() && Func.getName().startswith("__cuda_module_dtor_"))
+  return &Func;
+
+  return nullptr;
+}
+
+TEST(IncrementalProcessing, EmitCUDAGlobalInitFunc) {
+LLVMContext Context;
+CompilerInstance compiler;
+
+compiler.createDiagnostics();
+compiler.getLangOpts().CPlusPlus = 1;
+compiler.getLangOpts().CPlusPlus11 = 1;
+compiler.getLangOpts().CUDA = 1;
+
+compiler.getTargetOpts().Triple = llvm::Triple::normalize(
+llvm::sys::getProcessTriple());
+compiler.setTarget(clang::TargetInfo::CreateTargetInfo(
+  compiler.getDiagnostics(),
+  std::make_shared(
+compiler.getTargetOpts(;
+
+// To enable the generating of cuda host code, it's needs to set up the
+// auxTriple.
+llvm::Triple hostTriple(llvm::sys::getProcessTriple());
+compiler.getFrontendOpts().AuxTriple =
+hostTriple.isArch64Bit() ? "nvptx64-nvidia-cuda" : "nvptx-nvidia-cuda";
+auto targetOptions = std::make_shared();
+targetOptions->Triple = compiler.getFrontendOpts().AuxTriple;
+targetOptions->HostTriple = compiler.getTarget().getTriple().str();
+compiler.setAuxTarget(clang::TargetInfo::CreateTargetInfo(
+compiler.getDiagnostics(), targetOptions));
+
+// A fatbinary file is necessary, that the code generator generates the ctor
+// and dtor.
+auto tmpFatbinFileOrError = llvm::sys::fs::TempFile::create("dummy.fatbin");
+ASSERT_TRUE((bool)tmpFatbinFileOrError);
+auto tmpFatbinFile = std::move(*tmpFatbinFileOrError);
+compiler.getCodeGenOpts().CudaGpuBinaryFileName = tmpFatbinFile.TmpName;
+
+compiler.createFileManager();
+compiler.createSourceManager(compiler.getFileManager());
+compiler.createPreprocessor(clang::TU_Prefix);
+compiler.getPreprocessor().enableIncrementalProcessing();
+
+compiler.createASTContext();
+
+CodeGenerator* CG =
+CreateLLVMCodeGen(
+compiler.getDiagnostics(),
+"main-module",
+compiler.getHeaderSearchOpts(),
+compiler.getPreprocessorOpts(),
+compiler.getCodeGenOpts(),
+Context);
+
+compiler.setASTConsumer(std::unique_ptr(CG));
+compiler.createSema(clang::TU_Prefix, nullptr);
+Sema& S = compiler.getSema();
+
+std::unique_ptr ParseOP(new Parser(S.getPreprocessor(), S,
+   /*SkipFunctionBodies*/ false));
+Parser &P = *ParseOP.get();
+
+std::array, 3> M;
+M[0] = IncrementalParseAST(compiler, P, *CG, nullptr);
+ASSERT_TRUE(M[0]);
+
+M[1] = IncrementalParseAST(compiler, P, *CG, CUDATestProgram1);
+ASSERT_TRUE(M[1]);
+ASSERT_TRUE(M[1]->getFunction("_Z9cudaFunc1v"));
+
+M[2] = IncrementalParseAST(compiler, P, *CG, CUDATestProgram2);
+ASSERT_TRUE(M[2]);
+ASSERT_TRUE(M[2]->getFunction("_Z9cudaFunc2v"));
+// First code should not end up in second module:
+ASSERT_FALSE(M[2]->getFunction("_Z9cudaFunc1v"));
+
+// Make sure, that cuda ctor's and dtor's exist:
+const Function* CUDActor1 = getCUDActor(*M[1]);
+ASSERT_TRUE(CUDActor1);
+
+const Function* CUDActor2 = getCUDActor(*M[2]);
+ASSERT_TRUE(CUDActor2);
+
+const Function* CUDAdtor1 = getCUDAdtor(*M[1]);
+ASSERT_TRUE(CUDAdtor1);
+
+const Function* CUDAdtor2 = getCUDAdtor(*M[2]);
+ASSERT_TRUE(CUDAdtor2);
+
+// Compare the names of both ctor's and dtor's to check, that they are
+// unique.
+ASSERT_FALSE(CUDActor1->getName() == CUDActor2->getName());
+ASSERT_FALSE(CUDAdtor1->getName() == CUDAdtor2->getName());
+
+ASSERT_FALSE((bool)tmpFatbinFile.discard());
+}
+
 } // end anonymous namespace

[PATCH] D44435: CUDA ctor/dtor Module-Unique Symbol Name

2018-04-25 Thread Simeon Ehrig via Phabricator via cfe-commits
SimeonEhrig added inline comments.



Comment at: lib/CodeGen/CGCUDANV.cpp:287
+CtorSuffix.append("_");
+CtorSuffix.append(ModuleName);
+  }

tra wrote:
> SimeonEhrig wrote:
> > tra wrote:
> > > There is a general problem with this approach. File name can contain the 
> > > characters that PTX does not allow.
> > > We currently only deal with '.' and '@', but that's not enough here.
> > > You may want to either mangle the name somehow to avoid/convert illegal 
> > > characters or use some other way to provide unique suffix. Hex-encoded 
> > > hash of the file name would avoid this problem, for example.
> > > 
> > > 
> > > 
> > Maybe I'm wrong but I think, that should be no problem, because the 
> > generating of a cuda ctor/dtor have nothing to do with the PTX generation. 
> > 
> > The function 'makeModuleCtorFunction' should just generate llvm ir code for 
> > the host (e.g. x86_64).
> > 
> > If I'm wrong, could you tell me please, where in the source code the 
> > 'makeModuleCtorFunction' affect the PTX generation.
> You are correct that PTX is irrelevant here. I've completely missed that this 
> will be generated for the host, which is more forgiving. 
> 
> That said, I'm still not completely sure whether we're guaranteed that using 
> arbitrary characters in a symbol name is OK on x86 and, potentially, other 
> host platforms. As an experiment, try using a module which has a space in its 
> name.
At line 295 and 380 in CGCUDANV.cpp I use a sanitizer function, which replace 
all symbols without [a-zA-Z0-9._] with a '_'. It's the same solution like in 
D34059. So I think, it would works in general.

Only for information. I tested it with a module name, which includes a 
whitespace and without the sanitizer. It works on Linux x86 and the ELF format. 
There was an whitespace in the symbol of the cuda module ctor (I checked it 
with readelf).

In general, do you think my solution approach is technically okay? Your answer 
will be really helpful for internal usage in our cling project. At the moment I 
developed the cling-cuda-interpreter based on this patch and it would helps a 
lot of, if I can say, that the patch doesn't cause any problem with the 
CUDA-environment.  


https://reviews.llvm.org/D44435



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D44435: CUDA ctor/dtor Module-Unique Symbol Name

2018-05-07 Thread Simeon Ehrig via Phabricator via cfe-commits
SimeonEhrig added a comment.

In https://reviews.llvm.org/D44435#1088019, @tra wrote:

> Perhaps we should take a step back and consider whether this is the right 
> approach to solve your problem.
>
> If I understand it correctly, the real issue is that you repeatedly recompile 
> the same module and cling will only use the function from the first module 
> it's seen it in. Unlike regular functions that presumably remain the same in 
> all the modules they are present in, CUDA constructors do change and you need 
> cling to grab the one from the most recent module.
>
> This patch deals with the issue by attempting to add a unique sufix. 
> Presumably cling will then generate some sort of unique module name and will 
> get unique constructor name in return. The down side of this approach is that 
> module name is something that is derived from the file name and the 
> functionality you're changing is in the shared code, so you need to make sure 
> that whatever you implement makes sense for LLVM in general and that it does 
> what it claims it does. AFAICT, LLVM has no pressing need for the unique 
> constructor name -- it's a function with internal linkage and, if we ever 
> need to generate more than one, LLVM is capable of generating unique names 
> within the module all by itself. The patch currently does not fulfill the 
> "unique" part either.
>
> Perhaps you should consider a different approach which could handle the issue 
> completely in cling. E.g. You could rename the constructor in the module's IR 
> before passing it to JIT. Or you could rename it in PTX (it's just text after 
> all) before passing it to driver or PTXAS.


You are right. The clang commit is not the best solution. So, we searched for 
another solution and found one. The solution is similar to your suggestion. We 
found a possibility to integrate a llvm module pass, which detects the symbols 
`__cuda_module_ctor` and `__cuda_module_dtor` and append the module name to the 
symbol, before the llvm IR will be generated. So, we were able to move the 
solution from clang to cling, which is better for both projects.


https://reviews.llvm.org/D44435



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D44435: Add the module name to __cuda_module_ctor and __cuda_module_dtor for unique function names

2018-03-20 Thread Simeon Ehrig via Phabricator via cfe-commits
SimeonEhrig added inline comments.



Comment at: unittests/CodeGen/IncrementalProcessingTest.cpp:176-178
+
+// In CUDA incremental processing, a CUDA ctor or dtor will be generated for 
+// every statement if a fatbinary file exists.

tra wrote:
> SimeonEhrig wrote:
> > tra wrote:
> > > SimeonEhrig wrote:
> > > > tra wrote:
> > > > > SimeonEhrig wrote:
> > > > > > tra wrote:
> > > > > > > I don't understand the comment. What is 'CUDA incremental 
> > > > > > > processing' and what exactly is meant by 'statement' here? I'd 
> > > > > > > appreciate if you could give me more details. My understanding is 
> > > > > > > that ctor/dtor are generated once per TU. I suspect "incremental 
> > > > > > > processing" may change that, but I have no idea what exactly does 
> > > > > > > it do.
> > > > > > A CUDA ctor/dtor will generates for every llvm::module. The TU can 
> > > > > > also composed of many modules. In our interpreter, we add new code 
> > > > > > to our AST with new modules at runtime. 
> > > > > > The ctor/dtor generation is depend on the fatbinary code. The 
> > > > > > CodeGen checks, if a path to a fatbinary file is set. If it is, it 
> > > > > > generates an ctor with at least a __cudaRegisterFatBinary() 
> > > > > > function call. So, the generation is independent of the source code 
> > > > > > in the module and we can use every statement. A statement can be an 
> > > > > > expression, a declaration, a definition and so one.   
> > > > > I still don't understand how it's going to work. Do you have some 
> > > > > sort of design document outlining how the interpreter is going to 
> > > > > work with CUDA?
> > > > > 
> > > > > The purpose of the ctor/dtor is to stitch together host-side kernel 
> > > > > launch with the GPU-side kernel binary which resides in the GPU 
> > > > > binary created by device-side compilation. 
> > > > > 
> > > > > So, the question #1 -- if you pass GPU-side binary to the compiler, 
> > > > > where did you get it? Normally it's the result of device-side 
> > > > > compilation of the same TU. In your case it's not quite clear what 
> > > > > exactly would that be, if you feed the source to the compiler 
> > > > > incrementally. I.e. do you somehow recompile everything we've seen on 
> > > > > device side so far for each new chunk of host-side source you feed to 
> > > > > the compiler? 
> > > > > 
> > > > > Next question is -- assuming that device side does have correct 
> > > > > GPU-side binary, when do you call those ctors/dtors? JIT model does 
> > > > > not quite fit the assumptions that drive regular CUDA compilation.
> > > > > 
> > > > > Let's consider this:
> > > > > ```
> > > > > __global__ void foo();
> > > > > __global__ void bar();
> > > > > 
> > > > > // If that's all we've  fed to compiler so far, we have no GPU code 
> > > > > yet, so there 
> > > > > // should be no fatbin file. If we do have it, what's in it?
> > > > > 
> > > > > void launch() {
> > > > >   foo<<<1,1>>>();
> > > > >   bar<<<1,1>>>();
> > > > > }
> > > > > // If you've generated ctors/dtors at this point they would be 
> > > > > // useless as no GPU code exists in the preceding code.
> > > > > 
> > > > > __global__ void foo() {}
> > > > > // Now we'd have some GPU code, but how can we need to retrofit it 
> > > > > into 
> > > > > // all the ctors/dtors we've generated before. 
> > > > > __global__ void bar() {}
> > > > > // Does bar end up in its own fatbinary? Or is it combined into a new 
> > > > > // fatbin which contains both boo and bar?
> > > > > // If it's a new fatbin, you somehow need to update existing 
> > > > > ctors/dtors, 
> > > > > // unless you want to leak CUDA resources fast.
> > > > > // If it's a separate fatbin, then you will need to at the very least 
> > > > > change the way 
> > > > > // ctors/dtors are generated by the 'launch' function, because now 
> > > > > they need to 
> > > > > // tie each kernel launch to a different fatbin.
> > > > > 
> > > > > ```
> > > > > 
> > > > > It looks to me that if you want to JIT CUDA code you will need to 
> > > > > take over GPU-side kernel management.
> > > > > ctors/dtors do that for full-TU compilation, but they rely on 
> > > > > device-side code being compiled and available during host-side 
> > > > > compilation. For JIT, the interpreter should be in charge of 
> > > > > registering new kernels with the CUDA runtime and 
> > > > > unregistering/unloading them when a kernel goes away. This makes 
> > > > > ctors/dtors completely irrelevant.
> > > > At the moment, there is no documentation, because we still develop the 
> > > > feature. I try to describe how it works.
> > > > 
> > > > The device side compilation works with a second compiler (a normal 
> > > > clang), which we start via syscall. In the interpreter, we check if the 
> > > > input line is a kernel definition or a kernel launch. Then we write the 
> > > > source code to a file and compile it with the clang to a PCH-file.  
> > >

[PATCH] D146389: [clang-repl][CUDA] Initial interactive CUDA support for clang-repl

2023-04-04 Thread Simeon Ehrig via Phabricator via cfe-commits
SimeonEhrig added inline comments.



Comment at: clang/tools/clang-repl/ClangRepl.cpp:135
+std::move(CI), std::move(DeviceCI), OffloadArch,
+"/tmp/clang-repl.fatbin"));
+

v.g.vassilev wrote:
> To cover the case where platforms have no `/tmp` we could use 
> `fs::createTemporaryFile`. However, some platforms have read-only file 
> systems. What do we do there?
Actual, we can avoid temporary files completely. The reason, why the fatbinary 
code is written to a file is the following code in the code generator of the 
CUDA runtime functions:

https://github.com/llvm/llvm-project/blob/d9d840cdaf51a9795930750d1b91d614a3849137/clang/lib/CodeGen/CGCUDANV.cpp#L722-L732

In the past, I avoided to change the code, because this was an extra Clang 
patch for Cling.

Maybe we can use the llvm virtualFileSystem: 
https://llvm.org/doxygen/classllvm_1_1vfs_1_1InMemoryFileSystem.html
But this is just an idea. I have no experience, if this is working for us.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D146389/new/

https://reviews.llvm.org/D146389

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D146389: [clang-repl][CUDA] Initial interactive CUDA support for clang-repl

2023-04-11 Thread Simeon Ehrig via Phabricator via cfe-commits
SimeonEhrig added a comment.

Except using an in-memory solution for generated fatbin code, the code looks 
good to me.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D146389/new/

https://reviews.llvm.org/D146389

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D146389: [clang-repl][CUDA] Initial interactive CUDA support for clang-repl

2023-04-28 Thread Simeon Ehrig via Phabricator via cfe-commits
SimeonEhrig added a comment.

In D146389#4292984 , @tra wrote:

> lib/CodeGen changes look OK to me.

I can confirm the code change in CodeGen works as expected. `clang-repl` does 
not generate temporary files anymore, if a CUDA kernel is compiled.

Compiling a simple CUDA application still working and saving the generated PTX 
and fatbin code via `clang++ ../helloWorld.cu -o helloWorld 
-L/usr/local/cuda/lib64 -lcudart_static --save-temps` is also still working.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D146389/new/

https://reviews.llvm.org/D146389

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits