ABataev added inline comments.

================
Comment at: lib/CodeGen/CGOpenMPRuntime.cpp:8739
+///     // For each component specified by this mapper:
+///     if (currentComponent.hasMapper())
+///       (*currentComponent.Mapper())(rt_mapper_handle, arg_base, arg_begin,
----------------
lildmh wrote:
> ABataev wrote:
> > lildmh wrote:
> > > ABataev wrote:
> > > > lildmh wrote:
> > > > > ABataev wrote:
> > > > > > lildmh wrote:
> > > > > > > ABataev wrote:
> > > > > > > > lildmh wrote:
> > > > > > > > > ABataev wrote:
> > > > > > > > > > lildmh wrote:
> > > > > > > > > > > ABataev wrote:
> > > > > > > > > > > > lildmh wrote:
> > > > > > > > > > > > > ABataev wrote:
> > > > > > > > > > > > > > lildmh wrote:
> > > > > > > > > > > > > > > ABataev wrote:
> > > > > > > > > > > > > > > > lildmh wrote:
> > > > > > > > > > > > > > > > > ABataev wrote:
> > > > > > > > > > > > > > > > > > lildmh wrote:
> > > > > > > > > > > > > > > > > > > ABataev wrote:
> > > > > > > > > > > > > > > > > > > > Currently `currentComponent` is 
> > > > > > > > > > > > > > > > > > > > generated by the compiler. But can we 
> > > > > > > > > > > > > > > > > > > > instead pass this data as an extra 
> > > > > > > > > > > > > > > > > > > > parameter to this `omp_mapper` function.
> > > > > > > > > > > > > > > > > > > Emm, I think this scheme will be very 
> > > > > > > > > > > > > > > > > > > difficult and inefficient. If we pass 
> > > > > > > > > > > > > > > > > > > components as an argument of `omp_mapper` 
> > > > > > > > > > > > > > > > > > > function, it means that the runtime needs 
> > > > > > > > > > > > > > > > > > > to generate all components related to a 
> > > > > > > > > > > > > > > > > > > map clause. I don't think the runtime is 
> > > > > > > > > > > > > > > > > > > able to do that efficiently. On the other 
> > > > > > > > > > > > > > > > > > > hand, in the current scheme, these 
> > > > > > > > > > > > > > > > > > > components are naturally generated by the 
> > > > > > > > > > > > > > > > > > > compiler, and the runtime only needs to 
> > > > > > > > > > > > > > > > > > > know the base pointer, pointer, type, 
> > > > > > > > > > > > > > > > > > > size. etc.
> > > > > > > > > > > > > > > > > > With the current scheme, we may end with 
> > > > > > > > > > > > > > > > > > the code blowout. We need to generate very 
> > > > > > > > > > > > > > > > > > similar code for different types and 
> > > > > > > > > > > > > > > > > > variables. The worst thing here is that we 
> > > > > > > > > > > > > > > > > > will be unable to optimize this huge amount 
> > > > > > > > > > > > > > > > > > of code because the codegen relies on the 
> > > > > > > > > > > > > > > > > > runtime functions and the code cannot be 
> > > > > > > > > > > > > > > > > > inlined. That's why I would like to move as 
> > > > > > > > > > > > > > > > > > much as possible code to the runtime rather 
> > > > > > > > > > > > > > > > > > than to emit it in the compiler. 
> > > > > > > > > > > > > > > > > I understand your concerns. I think this is 
> > > > > > > > > > > > > > > > > the best we can do right now.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > The most worrisome case will be when we have 
> > > > > > > > > > > > > > > > > nested mappers within each other. In this 
> > > > > > > > > > > > > > > > > case, a mapper function will call another 
> > > > > > > > > > > > > > > > > mapper function. We can inline the inner 
> > > > > > > > > > > > > > > > > mapper functions in this scenario, so that 
> > > > > > > > > > > > > > > > > these mapper function can be properly 
> > > > > > > > > > > > > > > > > optimized. As a result, I think the 
> > > > > > > > > > > > > > > > > performance should be fine.
> > > > > > > > > > > > > > > > Instead, we can use indirect function calls 
> > > > > > > > > > > > > > > > passed in the array to the runtime. Do you 
> > > > > > > > > > > > > > > > think it is going to be slower? In your current 
> > > > > > > > > > > > > > > > scheme, we generate many runtime calls instead. 
> > > > > > > > > > > > > > > > Could you try to estimate the number of calls 
> > > > > > > > > > > > > > > > in cases if we'll call the mappers through the 
> > > > > > > > > > > > > > > > indirect function calls and in your cuurent 
> > > > > > > > > > > > > > > > scheme, where we need to call the runtime 
> > > > > > > > > > > > > > > > functions many times in each particular mapper?
> > > > > > > > > > > > > > > Hi Alexey,
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Sorry I don't understand your idea. What indirect 
> > > > > > > > > > > > > > > function calls do you propose to be passed to the 
> > > > > > > > > > > > > > > runtime? What are these functions supposed to do?
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > The number of function calls will be exactly 
> > > > > > > > > > > > > > > equal to the number of components mapped, no 
> > > > > > > > > > > > > > > matter whether there are nested mappers or not. 
> > > > > > > > > > > > > > > The number of components depend on the program. 
> > > > > > > > > > > > > > > E.g., if we map a large array section, then there 
> > > > > > > > > > > > > > > will be many more function calls.
> > > > > > > > > > > > > > I mean the pointers to the mapper function, 
> > > > > > > > > > > > > > generated by the compiler. In your comment, it is 
> > > > > > > > > > > > > > `c.Mapper()`
> > > > > > > > > > > > > If we pass nested mapper functions to the runtime, I 
> > > > > > > > > > > > > think it will slow down execution because of the 
> > > > > > > > > > > > > extra level of indirect function calls. E.g., the 
> > > > > > > > > > > > > runtime will call `omp_mapper1`, which calls the 
> > > > > > > > > > > > > runtime back, which calls `omp_mapper2`, .... This 
> > > > > > > > > > > > > can result in a deep call stack.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I think the current implementation will be more 
> > > > > > > > > > > > > efficient, which doesn't pass nested mappers to the 
> > > > > > > > > > > > > runtime. One call to the outer most mapper function 
> > > > > > > > > > > > > will have all data mapping done. The call stack will 
> > > > > > > > > > > > > be 2 level deep (the first level is the mapper 
> > > > > > > > > > > > > function, and the second level is 
> > > > > > > > > > > > > `__tgt_push_mapper_component`) in this case from the 
> > > > > > > > > > > > > runtime. There are also more compiler optimization 
> > > > > > > > > > > > > space when we inline all nested mapper functions.
> > > > > > > > > > > > Yes, if we leave it as is. But if instead of the bunch 
> > > > > > > > > > > > unique functions we'll have the common one, that accept 
> > > > > > > > > > > > list if indirect pointers to functions additionally, 
> > > > > > > > > > > > and move it to the runtime library, we won't need those 
> > > > > > > > > > > > 2 functions we have currently. We'll have full access 
> > > > > > > > > > > > to the mapping data vector in the runtime library and 
> > > > > > > > > > > > won't need to use those 2 accessors we have currently. 
> > > > > > > > > > > > Instead, we'll need just one runtime functions, which 
> > > > > > > > > > > > implements the whole mapping logic. We still need to 
> > > > > > > > > > > > call it recursively, but I assume the number of calls 
> > > > > > > > > > > > will remain the same as in the current scheme. Did you 
> > > > > > > > > > > > understand the idea? If yes, it would good if you coild 
> > > > > > > > > > > > try to estimate the number of function calls in current 
> > > > > > > > > > > > scheme and in this new scheme to estimate possible pros 
> > > > > > > > > > > > and cons.
> > > > > > > > > > > Hi Alexey,
> > > > > > > > > > > 
> > > > > > > > > > > Could you give an example for this scheme? 1) I don't 
> > > > > > > > > > > understand how the mapper function can have full access 
> > > > > > > > > > > to the mapping data vector without providing these 2 
> > > > > > > > > > > accessors. 2) I don't think it is possible to have a 
> > > > > > > > > > > common function instead of bunch of unique functions for 
> > > > > > > > > > > each mapper declared.
> > > > > > > > > > Hi Lingda, something like this.
> > > > > > > > > > ```
> > > > > > > > > > void __tgt_mapper(void *base, void *begin, size_t size, 
> > > > > > > > > > int64_t type, auto components[]) {
> > > > > > > > > >   // Allocate space for an array section first.
> > > > > > > > > >   if (size > 1 && !maptype.IsDelete)
> > > > > > > > > >      <push>(base, begin, size*sizeof(Ty), 
> > > > > > > > > > clearToFrom(type));
> > > > > > > > > >    
> > > > > > > > > >   // Map members.
> > > > > > > > > >   for (unsigned i = 0; i < size; i++) {
> > > > > > > > > >      // For each component specified by this mapper:
> > > > > > > > > >      for (auto c : components) {
> > > > > > > > > >        if (c.hasMapper())
> > > > > > > > > >          (*c.Mapper())(c.arg_base, c.arg_begin, c.arg_size, 
> > > > > > > > > > c.arg_type);
> > > > > > > > > >        else
> > > > > > > > > >          <push>(c.arg_base, c.arg_begin, c.arg_size, 
> > > > > > > > > > c.arg_type);
> > > > > > > > > >      }
> > > > > > > > > >   }
> > > > > > > > > >   // Delete the array section.
> > > > > > > > > >   if (size > 1 && maptype.IsDelete)
> > > > > > > > > >     <push>(base, begin, size*sizeof(Ty), clearToFrom(type));
> > > > > > > > > > }
> > > > > > > > > > 
> > > > > > > > > > void <type>.mapper(void *base, void *begin, size_t size, 
> > > > > > > > > > int64_t type) {
> > > > > > > > > >  auto sub_components[] = {...};
> > > > > > > > > >  __tgt_mapper(base, begin, size, type, sub_components);
> > > > > > > > > > }
> > > > > > > > > > ```
> > > > > > > > > > 
> > > > > > > > > Hi Alexey,
> > > > > > > > > 
> > > > > > > > > I don't think this scheme is more efficient than the current 
> > > > > > > > > scheme. My reasons are:
> > > > > > > > > 
> > > > > > > > > 1) Most code here is essentially to generate `components`, 
> > > > > > > > > i.e., we need to generate `c.arg_base, c.arg_begin, 
> > > > > > > > > c.arg_size, c.arg_type` for each `c` in `components`, so 
> > > > > > > > > there will still be a lot of code in `<type>.mapper`. It will 
> > > > > > > > > not reduce the mapper function code, i.e., we will still have 
> > > > > > > > > a bunch of unique mapper functions.
> > > > > > > > > 
> > > > > > > > > 2) This scheme will prevent a lot of compiler optimization 
> > > > > > > > > from happening. In reality, a lot of computation should be 
> > > > > > > > > redundant. E.g., for two components `c1` and `c2`, `c1`'s 
> > > > > > > > > base may be the same as `c2`'s begin, so the compiler will be 
> > > > > > > > > able to eliminate these reduction computation, especially 
> > > > > > > > > when we inline all nested mapper functions together. If we 
> > > > > > > > > move these computation into the runtime, the compiler will 
> > > > > > > > > not be able to do such optimization.
> > > > > > > > > 
> > > > > > > > > 3) In terms of the number of `push` function calls, this 
> > > > > > > > > scheme has the exact same number of calls as the current 
> > > > > > > > > scheme, so I don't think this scheme can bring performance 
> > > > > > > > > benefits. The scheme should perform worse than the current 
> > > > > > > > > scheme, because it reduces the opportunities of compiler 
> > > > > > > > > optimization as mentioned above.
> > > > > > > > Hi Lingda, I'm trying to simplify the code generated by clang 
> > > > > > > > and avoid some unnecessary code duplications. If the complexity 
> > > > > > > > of this scheme is the same as proposed by you, I would prefer 
> > > > > > > > to use this scheme unless there are some other opinions.
> > > > > > > > 1. It is not a problem. This code is unique and is not 
> > > > > > > > duplicated in the different mappers.
> > > > > > > > 2. Inlining is no solution here. We still generate to much 
> > > > > > > > code, which is almost the same in many cases and it will lead 
> > > > > > > > to very ineffective codegen because we still end up with a lot 
> > > > > > > > of almost the same code. This also might lead to poor 
> > > > > > > > performance.
> > > > > > > > 3. Yes, the number of pushes is always the same, in all 
> > > > > > > > possible schemes. It would be good to compare somehow the 
> > > > > > > > performance of both schemes, at least preliminary.
> > > > > > > > 
> > > > > > > > Also, this solution reduces the number of required runtime 
> > > > > > > > functions, instead of 2 we need just 1 and, thus, we need to 
> > > > > > > > make fewer runtime functions calls.
> > > > > > > > 
> > > > > > > > I think it would better to propose this scheme as an alternate 
> > > > > > > > design and discuss it in the OpenMP telecon. What do you think? 
> > > > > > > > Or we can try to discuss it in the offline mode via the e-mail 
> > > > > > > > with other members.
> > > > > > > > I'm not trying to convince you to implement this scheme right 
> > > > > > > > now, but it would be good to discuss it. Maybe it will lead to 
> > > > > > > > some better ideas from others?
> > > > > > > Hi Alexey,
> > > > > > > 
> > > > > > > I still prefer the current scheme, because:
> > > > > > > 1) I don't like recursive mapper calls, which goes back to my 
> > > > > > > original scheme a little bit. I really think inlining can make a 
> > > > > > > big difference when we have nested mappers. These compiler 
> > > > > > > optimizations are the keys to have better performance for mappers.
> > > > > > > 2) I don't think the codegen here is inefficient. Yes there is 
> > > > > > > duplicated code across different mapper functions, but why that 
> > > > > > > will lead to poor performance?
> > > > > > > 3) Although we have 2 runtime functions now, the 
> > > > > > > `__tgt_mapper_num_components` is called only once per mapper. It 
> > > > > > > should have very negligible performance impact.
> > > > > > > 
> > > > > > > But if you have a different option, we can discuss it next time 
> > > > > > > in the meeting. I do have a time constraint to work on the mapper 
> > > > > > > implementation. I'll no longer work in this project starting this 
> > > > > > > September, and I have about 30% of my time working on it until 
> > > > > > > then.
> > > > > > Lingda, 
> > > > > > 1. We have recursive (actually, not recursive, because you cannot 
> > > > > > use types recursively) mappers calls anyway, it is nature of 
> > > > > > struсtures/classes.
> > > > > > 2. We have a lot of similar code. And I'm not sure that it can be 
> > > > > > optimized out.
> > > > > > 3. Yes, but it means that we have n extra runtime calls, where n is 
> > > > > > the number of branches in the structure/class tree.
> > > > > > 
> > > > > > I see :(. I understand your concern. In this case, we could try to 
> > > > > > discuss it offline, in the mailing list, to make it a little bit 
> > > > > > faster. We just need to hear other opinions on this matter, maybe 
> > > > > > there are some other pros and cons for these schemes.
> > > > > Hi Alexey,
> > > > > 
> > > > > Sure, let's discuss this in the mailing list. I'll summarize it and 
> > > > > send it to the mailing list later.
> > > > > 
> > > > > > We have recursive (actually, not recursive, because you cannot use 
> > > > > > types recursively) mappers calls anyway, it is nature of 
> > > > > > struсtures/classes.
> > > > > We won't have recursive calls with inlining.
> > > > > 
> > > > > > We have a lot of similar code. And I'm not sure that it can be 
> > > > > > optimized out.
> > > > > I think it's even harder to optimized these code out when we move 
> > > > > them into the runtime.
> > > > > 
> > > > > > Yes, but it means that we have n extra runtime calls, where n is 
> > > > > > the number of branches in the structure/class tree.
> > > > > I don't quite understand. It's still equal to the number of mappers 
> > > > > in any case.
> > > > > Sure, let's discuss this in the mailing list. I'll summarize it and 
> > > > > send it to the mailing list later.
> > > > 
> > > > Good, thanks!
> > > > 
> > > > > We won't have recursive calls with inlining.
> > > > 
> > > > We won't have recursive calls anyway (recursive types are not allowed). 
> > > > Plus, I'm not sure that inlining is the best option here. We have a lot 
> > > > of code for each mapper and I'm not sure that the optimizer will be 
> > > > able to squash it effectively.
> > > > 
> > > > > I think it's even harder to optimized these code out when we move 
> > > > > them into the runtime.
> > > > 
> > > > Definitely not, unless we use LTO or inlined runtime.
> > > > 
> > > > We won't have recursive calls anyway (recursive types are not allowed). 
> > > > Plus, I'm not sure that inlining is the best option here. We have a lot 
> > > > of code for each mapper and I'm not sure that the optimizer will be 
> > > > able to squash it effectively.
> > > Sorry I should not say recursive calls. Here it needs to "recursively" 
> > > call other mapper functions in case of nested mappers, but we don't need 
> > > it in case of inlining.
> > > 
> > > > Definitely not, unless we use LTO or inlined runtime.
> > > But you are proposing to move many code to the runtime here, right? That 
> > > doesn't make sense to me.
> > > 
> > > But you are proposing to move much code to the runtime here, right? That 
> > > doesn't make sense to me.
> > 
> > I'm just not sure that there going be significant problems with the 
> > performance because of that. And it significantly simplifies codegen in the 
> > compiler and moves the common part into a single function.
> > 
> > Plus, if in future we'll need to modify this functionality for some reason, 
> > 2 different versions of the compiler will produce incompatible code. With 
> > my scheme, you still can use old runtime and have the same functionality as 
> > the old compiler and the new one.
> Hi Alexey,
> 
> I think more carefully about your scheme, and I don't think we can solve the 
> 2 problems below with this scheme:
> 
> 1. In the example you gave before, the compiler needs to generate all map 
> types and pass them to `__tgt_mapper` through `sub_components`. But in this 
> case, the compiler won't be able to generate the correct `MEMBER_OF` field in 
> map type. As a result, the runtime has to fix it using the mechanism we 
> already have here: `__tgt_mapper_num_components`. This not only increases 
> complexity, but also, it means the runtime needs further manipulation of the 
> map type, which creates locality issues. While in the current scheme, the map 
> type is generated by compiler once, so the data locality will be very good in 
> this case.
> 
> 2. `sub_components` includes all components that should be mapped. If we are 
> mapping an array, this means we need to map many components, which will need 
> to allocate memory for `sub_components` in the heap. This creates further 
> memory management burden and is not an efficient way to use memory.
> 
> Based on these reasons, I think the current scheme is still more preferable.
Hi Lingda, 
1. Actually, I thought that the runtime function `__tgt_mapper` will do this, 
not the compiler.
2. Why do we need to allocate it on the heap? We can allocate it on the stack.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59474/new/

https://reviews.llvm.org/D59474



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to