yaxunl added a comment. In D99683#2672668 <https://reviews.llvm.org/D99683#2672668>, @tejohnson wrote:
> In D99683#2672578 <https://reviews.llvm.org/D99683#2672578>, @yaxunl wrote: > >> In D99683#2672554 <https://reviews.llvm.org/D99683#2672554>, @tejohnson >> wrote: >> >>> This raises some higher level questions for me: >>> >>> First, how will you deal with other corner cases that won't or cannot be >>> imported right now? While enabling importing of noinline functions and >>> cranking up the threshold will get the majority of functions imported, >>> there are cases that we still won't import (functions/vars that are >>> interposable, certain funcs/vars that cannot be renamed, most non-const >>> variables with non-trivial initializers). >> >> We will document the limitation of thinLTO support of HIP toolchain and >> recommend users not to use thinLTO in those corner cases. >> >>> Second, force importing of everything transitively referenced defeats the >>> purpose of ThinLTO and would probably make it worse than regular LTO. The >>> main entry module will need to import everything transitively referenced >>> from there, so everything not dead in the binary, which should make that >>> module post importing equivalent to a regular LTO module. In addition, >>> every other module needs to transitively import everything referenced from >>> those modules, making them very large depending on how many leaf vs >>> non-leaf functions and variables they contain. What is the goal of doing >>> ThinLTO in this case? >> >> The objective is to improve optimization/codegen time by using multi-threads >> of thinLTO. For example, I have 10 modules each containing a kernel. In full >> LTO linking, I get one big module containing 10 kernels with all functions >> inlined, and I have one thread for optimization/codegen. With thinLTO, I get >> one kernel in each module, with all functions inlined. AMDGPU >> internalization and global DCE will remove functions not used by that kernel >> in each module. I will get 10 threads, each doing optimization/codegen for >> one kernel. Theoretically, there could be 10 times speed up. > > That will work as long as there are no dependence edges anywhere between the > kernels. Is this a library that has a bunch of totally independent kernels > only called externally? There are no dependence edges between the kernels since they cannot call each other. The HIP device compilation output is always a shared library which contains multiple independent kernels which can be launched by a HIP program. CHANGES SINCE LAST ACTION https://reviews.llvm.org/D99683/new/ https://reviews.llvm.org/D99683 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits