yaxunl added a comment.

In D99683#2672668 <https://reviews.llvm.org/D99683#2672668>, @tejohnson wrote:

> In D99683#2672578 <https://reviews.llvm.org/D99683#2672578>, @yaxunl wrote:
>
>> In D99683#2672554 <https://reviews.llvm.org/D99683#2672554>, @tejohnson 
>> wrote:
>>
>>> This raises some higher level questions for me:
>>>
>>> First, how will you deal with other corner cases that won't or cannot be 
>>> imported right now? While enabling importing of noinline functions and 
>>> cranking up the threshold will get the majority of functions imported, 
>>> there are cases that we still won't import (functions/vars that are 
>>> interposable, certain funcs/vars that cannot be renamed, most non-const 
>>> variables with non-trivial initializers).
>>
>> We will document the limitation of thinLTO support of HIP toolchain and 
>> recommend users not to use thinLTO in those corner cases.
>>
>>> Second, force importing of everything transitively referenced defeats the 
>>> purpose of ThinLTO and would probably make it worse than regular LTO. The 
>>> main entry module will need to import everything transitively referenced 
>>> from there, so everything not dead in the binary, which should make that 
>>> module post importing equivalent to a regular LTO module. In addition, 
>>> every other module needs to transitively import everything referenced from 
>>> those modules, making them very large depending on how many leaf vs 
>>> non-leaf functions and variables they contain. What is the goal of doing 
>>> ThinLTO in this case?
>>
>> The objective is to improve optimization/codegen time by using multi-threads 
>> of thinLTO. For example, I have 10 modules each containing a kernel. In full 
>> LTO linking, I get one big module containing 10 kernels with all functions 
>> inlined, and I have one thread for optimization/codegen. With thinLTO, I get 
>> one kernel in each module, with all functions inlined. AMDGPU 
>> internalization and global DCE will remove functions not used by that kernel 
>> in each module. I will get 10 threads, each doing optimization/codegen for 
>> one kernel. Theoretically, there could be 10 times speed up.
>
> That will work as long as there are no dependence edges anywhere between the 
> kernels. Is this a library that has a bunch of totally independent kernels 
> only called externally?

There are no dependence edges between the kernels since they cannot call each 
other. The HIP device compilation output is always a shared library which 
contains multiple independent kernels which can be launched by a HIP program.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99683/new/

https://reviews.llvm.org/D99683

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to