[PATCH] D59474: [OpenMP 5.0] Codegen support for user-defined mappers

Lingda Li via Phabricator via cfe-commits Thu, 27 Jun 2019 11:05:53 -0700

lildmh marked an inline comment as done.
lildmh added inline comments.


================
Comment at: lib/CodeGen/CGOpenMPRuntime.cpp:8739
+///     // For each component specified by this mapper:
+///     if (currentComponent.hasMapper())
+///       (*currentComponent.Mapper())(rt_mapper_handle, arg_base, arg_begin,
----------------
ABataev wrote:
> lildmh wrote:
> > ABataev wrote:
> > > lildmh wrote:
> > > > ABataev wrote:
> > > > > lildmh wrote:
> > > > > > ABataev wrote:
> > > > > > > lildmh wrote:
> > > > > > > > ABataev wrote:
> > > > > > > > > lildmh wrote:
> > > > > > > > > > ABataev wrote:
> > > > > > > > > > > lildmh wrote:
> > > > > > > > > > > > ABataev wrote:
> > > > > > > > > > > > > lildmh wrote:
> > > > > > > > > > > > > > ABataev wrote:
> > > > > > > > > > > > > > > lildmh wrote:
> > > > > > > > > > > > > > > > ABataev wrote:
> > > > > > > > > > > > > > > > > lildmh wrote:
> > > > > > > > > > > > > > > > > > ABataev wrote:
> > > > > > > > > > > > > > > > > > > lildmh wrote:
> > > > > > > > > > > > > > > > > > > > ABataev wrote:
> > > > > > > > > > > > > > > > > > > > > lildmh wrote:
> > > > > > > > > > > > > > > > > > > > > > ABataev wrote:
> > > > > > > > > > > > > > > > > > > > > > > lildmh wrote:
> > > > > > > > > > > > > > > > > > > > > > > > ABataev wrote:
> > > > > > > > > > > > > > > > > > > > > > > > > lildmh wrote:
> > > > > > > > > > > > > > > > > > > > > > > > > > ABataev wrote:
> > > > > > > > > > > > > > > > > > > > > > > > > > > Currently 
> > > > > > > > > > > > > > > > > > > > > > > > > > > `currentComponent` is 
> > > > > > > > > > > > > > > > > > > > > > > > > > > generated by the 
> > > > > > > > > > > > > > > > > > > > > > > > > > > compiler. But can we 
> > > > > > > > > > > > > > > > > > > > > > > > > > > instead pass this data as 
> > > > > > > > > > > > > > > > > > > > > > > > > > > an extra parameter to 
> > > > > > > > > > > > > > > > > > > > > > > > > > > this `omp_mapper` 
> > > > > > > > > > > > > > > > > > > > > > > > > > > function.
> > > > > > > > > > > > > > > > > > > > > > > > > > Emm, I think this scheme 
> > > > > > > > > > > > > > > > > > > > > > > > > > will be very difficult and 
> > > > > > > > > > > > > > > > > > > > > > > > > > inefficient. If we pass 
> > > > > > > > > > > > > > > > > > > > > > > > > > components as an argument 
> > > > > > > > > > > > > > > > > > > > > > > > > > of `omp_mapper` function, 
> > > > > > > > > > > > > > > > > > > > > > > > > > it means that the runtime 
> > > > > > > > > > > > > > > > > > > > > > > > > > needs to generate all 
> > > > > > > > > > > > > > > > > > > > > > > > > > components related to a map 
> > > > > > > > > > > > > > > > > > > > > > > > > > clause. I don't think the 
> > > > > > > > > > > > > > > > > > > > > > > > > > runtime is able to do that 
> > > > > > > > > > > > > > > > > > > > > > > > > > efficiently. On the other 
> > > > > > > > > > > > > > > > > > > > > > > > > > hand, in the current 
> > > > > > > > > > > > > > > > > > > > > > > > > > scheme, these components 
> > > > > > > > > > > > > > > > > > > > > > > > > > are naturally generated by 
> > > > > > > > > > > > > > > > > > > > > > > > > > the compiler, and the 
> > > > > > > > > > > > > > > > > > > > > > > > > > runtime only needs to know 
> > > > > > > > > > > > > > > > > > > > > > > > > > the base pointer, pointer, 
> > > > > > > > > > > > > > > > > > > > > > > > > > type, size. etc.
> > > > > > > > > > > > > > > > > > > > > > > > > With the current scheme, we 
> > > > > > > > > > > > > > > > > > > > > > > > > may end with the code 
> > > > > > > > > > > > > > > > > > > > > > > > > blowout. We need to generate 
> > > > > > > > > > > > > > > > > > > > > > > > > very similar code for 
> > > > > > > > > > > > > > > > > > > > > > > > > different types and 
> > > > > > > > > > > > > > > > > > > > > > > > > variables. The worst thing 
> > > > > > > > > > > > > > > > > > > > > > > > > here is that we will be 
> > > > > > > > > > > > > > > > > > > > > > > > > unable to optimize this huge 
> > > > > > > > > > > > > > > > > > > > > > > > > amount of code because the 
> > > > > > > > > > > > > > > > > > > > > > > > > codegen relies on the runtime 
> > > > > > > > > > > > > > > > > > > > > > > > > functions and the code cannot 
> > > > > > > > > > > > > > > > > > > > > > > > > be inlined. That's why I 
> > > > > > > > > > > > > > > > > > > > > > > > > would like to move as much as 
> > > > > > > > > > > > > > > > > > > > > > > > > possible code to the runtime 
> > > > > > > > > > > > > > > > > > > > > > > > > rather than to emit it in the 
> > > > > > > > > > > > > > > > > > > > > > > > > compiler. 
> > > > > > > > > > > > > > > > > > > > > > > > I understand your concerns. I 
> > > > > > > > > > > > > > > > > > > > > > > > think this is the best we can 
> > > > > > > > > > > > > > > > > > > > > > > > do right now.
> > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > The most worrisome case will be 
> > > > > > > > > > > > > > > > > > > > > > > > when we have nested mappers 
> > > > > > > > > > > > > > > > > > > > > > > > within each other. In this 
> > > > > > > > > > > > > > > > > > > > > > > > case, a mapper function will 
> > > > > > > > > > > > > > > > > > > > > > > > call another mapper function. 
> > > > > > > > > > > > > > > > > > > > > > > > We can inline the inner mapper 
> > > > > > > > > > > > > > > > > > > > > > > > functions in this scenario, so 
> > > > > > > > > > > > > > > > > > > > > > > > that these mapper function can 
> > > > > > > > > > > > > > > > > > > > > > > > be properly optimized. As a 
> > > > > > > > > > > > > > > > > > > > > > > > result, I think the performance 
> > > > > > > > > > > > > > > > > > > > > > > > should be fine.
> > > > > > > > > > > > > > > > > > > > > > > Instead, we can use indirect 
> > > > > > > > > > > > > > > > > > > > > > > function calls passed in the 
> > > > > > > > > > > > > > > > > > > > > > > array to the runtime. Do you 
> > > > > > > > > > > > > > > > > > > > > > > think it is going to be slower? 
> > > > > > > > > > > > > > > > > > > > > > > In your current scheme, we 
> > > > > > > > > > > > > > > > > > > > > > > generate many runtime calls 
> > > > > > > > > > > > > > > > > > > > > > > instead. Could you try to 
> > > > > > > > > > > > > > > > > > > > > > > estimate the number of calls in 
> > > > > > > > > > > > > > > > > > > > > > > cases if we'll call the mappers 
> > > > > > > > > > > > > > > > > > > > > > > through the indirect function 
> > > > > > > > > > > > > > > > > > > > > > > calls and in your cuurent scheme, 
> > > > > > > > > > > > > > > > > > > > > > > where we need to call the runtime 
> > > > > > > > > > > > > > > > > > > > > > > functions many times in each 
> > > > > > > > > > > > > > > > > > > > > > > particular mapper?
> > > > > > > > > > > > > > > > > > > > > > Hi Alexey,
> > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > Sorry I don't understand your idea. 
> > > > > > > > > > > > > > > > > > > > > > What indirect function calls do you 
> > > > > > > > > > > > > > > > > > > > > > propose to be passed to the 
> > > > > > > > > > > > > > > > > > > > > > runtime? What are these functions 
> > > > > > > > > > > > > > > > > > > > > > supposed to do?
> > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > The number of function calls will 
> > > > > > > > > > > > > > > > > > > > > > be exactly equal to the number of 
> > > > > > > > > > > > > > > > > > > > > > components mapped, no matter 
> > > > > > > > > > > > > > > > > > > > > > whether there are nested mappers or 
> > > > > > > > > > > > > > > > > > > > > > not. The number of components 
> > > > > > > > > > > > > > > > > > > > > > depend on the program. E.g., if we 
> > > > > > > > > > > > > > > > > > > > > > map a large array section, then 
> > > > > > > > > > > > > > > > > > > > > > there will be many more function 
> > > > > > > > > > > > > > > > > > > > > > calls.
> > > > > > > > > > > > > > > > > > > > > I mean the pointers to the mapper 
> > > > > > > > > > > > > > > > > > > > > function, generated by the compiler. 
> > > > > > > > > > > > > > > > > > > > > In your comment, it is `c.Mapper()`
> > > > > > > > > > > > > > > > > > > > If we pass nested mapper functions to 
> > > > > > > > > > > > > > > > > > > > the runtime, I think it will slow down 
> > > > > > > > > > > > > > > > > > > > execution because of the extra level of 
> > > > > > > > > > > > > > > > > > > > indirect function calls. E.g., the 
> > > > > > > > > > > > > > > > > > > > runtime will call `omp_mapper1`, which 
> > > > > > > > > > > > > > > > > > > > calls the runtime back, which calls 
> > > > > > > > > > > > > > > > > > > > `omp_mapper2`, .... This can result in 
> > > > > > > > > > > > > > > > > > > > a deep call stack.
> > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > I think the current implementation will 
> > > > > > > > > > > > > > > > > > > > be more efficient, which doesn't pass 
> > > > > > > > > > > > > > > > > > > > nested mappers to the runtime. One call 
> > > > > > > > > > > > > > > > > > > > to the outer most mapper function will 
> > > > > > > > > > > > > > > > > > > > have all data mapping done. The call 
> > > > > > > > > > > > > > > > > > > > stack will be 2 level deep (the first 
> > > > > > > > > > > > > > > > > > > > level is the mapper function, and the 
> > > > > > > > > > > > > > > > > > > > second level is 
> > > > > > > > > > > > > > > > > > > > `__tgt_push_mapper_component`) in this 
> > > > > > > > > > > > > > > > > > > > case from the runtime. There are also 
> > > > > > > > > > > > > > > > > > > > more compiler optimization space when 
> > > > > > > > > > > > > > > > > > > > we inline all nested mapper functions.
> > > > > > > > > > > > > > > > > > > Yes, if we leave it as is. But if instead 
> > > > > > > > > > > > > > > > > > > of the bunch unique functions we'll have 
> > > > > > > > > > > > > > > > > > > the common one, that accept list if 
> > > > > > > > > > > > > > > > > > > indirect pointers to functions 
> > > > > > > > > > > > > > > > > > > additionally, and move it to the runtime 
> > > > > > > > > > > > > > > > > > > library, we won't need those 2 functions 
> > > > > > > > > > > > > > > > > > > we have currently. We'll have full access 
> > > > > > > > > > > > > > > > > > > to the mapping data vector in the runtime 
> > > > > > > > > > > > > > > > > > > library and won't need to use those 2 
> > > > > > > > > > > > > > > > > > > accessors we have currently. Instead, 
> > > > > > > > > > > > > > > > > > > we'll need just one runtime functions, 
> > > > > > > > > > > > > > > > > > > which implements the whole mapping logic. 
> > > > > > > > > > > > > > > > > > > We still need to call it recursively, but 
> > > > > > > > > > > > > > > > > > > I assume the number of calls will remain 
> > > > > > > > > > > > > > > > > > > the same as in the current scheme. Did 
> > > > > > > > > > > > > > > > > > > you understand the idea? If yes, it would 
> > > > > > > > > > > > > > > > > > > good if you coild try to estimate the 
> > > > > > > > > > > > > > > > > > > number of function calls in current 
> > > > > > > > > > > > > > > > > > > scheme and in this new scheme to estimate 
> > > > > > > > > > > > > > > > > > > possible pros and cons.
> > > > > > > > > > > > > > > > > > Hi Alexey,
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > Could you give an example for this scheme? 
> > > > > > > > > > > > > > > > > > 1) I don't understand how the mapper 
> > > > > > > > > > > > > > > > > > function can have full access to the 
> > > > > > > > > > > > > > > > > > mapping data vector without providing these 
> > > > > > > > > > > > > > > > > > 2 accessors. 2) I don't think it is 
> > > > > > > > > > > > > > > > > > possible to have a common function instead 
> > > > > > > > > > > > > > > > > > of bunch of unique functions for each 
> > > > > > > > > > > > > > > > > > mapper declared.
> > > > > > > > > > > > > > > > > Hi Lingda, something like this.
> > > > > > > > > > > > > > > > > ```
> > > > > > > > > > > > > > > > > void __tgt_mapper(void *base, void *begin, 
> > > > > > > > > > > > > > > > > size_t size, int64_t type, auto components[]) 
> > > > > > > > > > > > > > > > > {
> > > > > > > > > > > > > > > > >   // Allocate space for an array section 
> > > > > > > > > > > > > > > > > first.
> > > > > > > > > > > > > > > > >   if (size > 1 && !maptype.IsDelete)
> > > > > > > > > > > > > > > > >      <push>(base, begin, size*sizeof(Ty), 
> > > > > > > > > > > > > > > > > clearToFrom(type));
> > > > > > > > > > > > > > > > >    
> > > > > > > > > > > > > > > > >   // Map members.
> > > > > > > > > > > > > > > > >   for (unsigned i = 0; i < size; i++) {
> > > > > > > > > > > > > > > > >      // For each component specified by this 
> > > > > > > > > > > > > > > > > mapper:
> > > > > > > > > > > > > > > > >      for (auto c : components) {
> > > > > > > > > > > > > > > > >        if (c.hasMapper())
> > > > > > > > > > > > > > > > >          (*c.Mapper())(c.arg_base, 
> > > > > > > > > > > > > > > > > c.arg_begin, c.arg_size, c.arg_type);
> > > > > > > > > > > > > > > > >        else
> > > > > > > > > > > > > > > > >          <push>(c.arg_base, c.arg_begin, 
> > > > > > > > > > > > > > > > > c.arg_size, c.arg_type);
> > > > > > > > > > > > > > > > >      }
> > > > > > > > > > > > > > > > >   }
> > > > > > > > > > > > > > > > >   // Delete the array section.
> > > > > > > > > > > > > > > > >   if (size > 1 && maptype.IsDelete)
> > > > > > > > > > > > > > > > >     <push>(base, begin, size*sizeof(Ty), 
> > > > > > > > > > > > > > > > > clearToFrom(type));
> > > > > > > > > > > > > > > > > }
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > void <type>.mapper(void *base, void *begin, 
> > > > > > > > > > > > > > > > > size_t size, int64_t type) {
> > > > > > > > > > > > > > > > >  auto sub_components[] = {...};
> > > > > > > > > > > > > > > > >  __tgt_mapper(base, begin, size, type, 
> > > > > > > > > > > > > > > > > sub_components);
> > > > > > > > > > > > > > > > > }
> > > > > > > > > > > > > > > > > ```
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Hi Alexey,
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > I don't think this scheme is more efficient 
> > > > > > > > > > > > > > > > than the current scheme. My reasons are:
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 1) Most code here is essentially to generate 
> > > > > > > > > > > > > > > > `components`, i.e., we need to generate 
> > > > > > > > > > > > > > > > `c.arg_base, c.arg_begin, c.arg_size, 
> > > > > > > > > > > > > > > > c.arg_type` for each `c` in `components`, so 
> > > > > > > > > > > > > > > > there will still be a lot of code in 
> > > > > > > > > > > > > > > > `<type>.mapper`. It will not reduce the mapper 
> > > > > > > > > > > > > > > > function code, i.e., we will still have a bunch 
> > > > > > > > > > > > > > > > of unique mapper functions.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 2) This scheme will prevent a lot of compiler 
> > > > > > > > > > > > > > > > optimization from happening. In reality, a lot 
> > > > > > > > > > > > > > > > of computation should be redundant. E.g., for 
> > > > > > > > > > > > > > > > two components `c1` and `c2`, `c1`'s base may 
> > > > > > > > > > > > > > > > be the same as `c2`'s begin, so the compiler 
> > > > > > > > > > > > > > > > will be able to eliminate these reduction 
> > > > > > > > > > > > > > > > computation, especially when we inline all 
> > > > > > > > > > > > > > > > nested mapper functions together. If we move 
> > > > > > > > > > > > > > > > these computation into the runtime, the 
> > > > > > > > > > > > > > > > compiler will not be able to do such 
> > > > > > > > > > > > > > > > optimization.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 3) In terms of the number of `push` function 
> > > > > > > > > > > > > > > > calls, this scheme has the exact same number of 
> > > > > > > > > > > > > > > > calls as the current scheme, so I don't think 
> > > > > > > > > > > > > > > > this scheme can bring performance benefits. The 
> > > > > > > > > > > > > > > > scheme should perform worse than the current 
> > > > > > > > > > > > > > > > scheme, because it reduces the opportunities of 
> > > > > > > > > > > > > > > > compiler optimization as mentioned above.
> > > > > > > > > > > > > > > Hi Lingda, I'm trying to simplify the code 
> > > > > > > > > > > > > > > generated by clang and avoid some unnecessary 
> > > > > > > > > > > > > > > code duplications. If the complexity of this 
> > > > > > > > > > > > > > > scheme is the same as proposed by you, I would 
> > > > > > > > > > > > > > > prefer to use this scheme unless there are some 
> > > > > > > > > > > > > > > other opinions.
> > > > > > > > > > > > > > > 1. It is not a problem. This code is unique and 
> > > > > > > > > > > > > > > is not duplicated in the different mappers.
> > > > > > > > > > > > > > > 2. Inlining is no solution here. We still 
> > > > > > > > > > > > > > > generate to much code, which is almost the same 
> > > > > > > > > > > > > > > in many cases and it will lead to very 
> > > > > > > > > > > > > > > ineffective codegen because we still end up with 
> > > > > > > > > > > > > > > a lot of almost the same code. This also might 
> > > > > > > > > > > > > > > lead to poor performance.
> > > > > > > > > > > > > > > 3. Yes, the number of pushes is always the same, 
> > > > > > > > > > > > > > > in all possible schemes. It would be good to 
> > > > > > > > > > > > > > > compare somehow the performance of both schemes, 
> > > > > > > > > > > > > > > at least preliminary.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Also, this solution reduces the number of 
> > > > > > > > > > > > > > > required runtime functions, instead of 2 we need 
> > > > > > > > > > > > > > > just 1 and, thus, we need to make fewer runtime 
> > > > > > > > > > > > > > > functions calls.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > I think it would better to propose this scheme as 
> > > > > > > > > > > > > > > an alternate design and discuss it in the OpenMP 
> > > > > > > > > > > > > > > telecon. What do you think? Or we can try to 
> > > > > > > > > > > > > > > discuss it in the offline mode via the e-mail 
> > > > > > > > > > > > > > > with other members.
> > > > > > > > > > > > > > > I'm not trying to convince you to implement this 
> > > > > > > > > > > > > > > scheme right now, but it would be good to discuss 
> > > > > > > > > > > > > > > it. Maybe it will lead to some better ideas from 
> > > > > > > > > > > > > > > others?
> > > > > > > > > > > > > > Hi Alexey,
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > I still prefer the current scheme, because:
> > > > > > > > > > > > > > 1) I don't like recursive mapper calls, which goes 
> > > > > > > > > > > > > > back to my original scheme a little bit. I really 
> > > > > > > > > > > > > > think inlining can make a big difference when we 
> > > > > > > > > > > > > > have nested mappers. These compiler optimizations 
> > > > > > > > > > > > > > are the keys to have better performance for mappers.
> > > > > > > > > > > > > > 2) I don't think the codegen here is inefficient. 
> > > > > > > > > > > > > > Yes there is duplicated code across different 
> > > > > > > > > > > > > > mapper functions, but why that will lead to poor 
> > > > > > > > > > > > > > performance?
> > > > > > > > > > > > > > 3) Although we have 2 runtime functions now, the 
> > > > > > > > > > > > > > `__tgt_mapper_num_components` is called only once 
> > > > > > > > > > > > > > per mapper. It should have very negligible 
> > > > > > > > > > > > > > performance impact.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > But if you have a different option, we can discuss 
> > > > > > > > > > > > > > it next time in the meeting. I do have a time 
> > > > > > > > > > > > > > constraint to work on the mapper implementation. 
> > > > > > > > > > > > > > I'll no longer work in this project starting this 
> > > > > > > > > > > > > > September, and I have about 30% of my time working 
> > > > > > > > > > > > > > on it until then.
> > > > > > > > > > > > > Lingda, 
> > > > > > > > > > > > > 1. We have recursive (actually, not recursive, 
> > > > > > > > > > > > > because you cannot use types recursively) mappers 
> > > > > > > > > > > > > calls anyway, it is nature of struсtures/classes.
> > > > > > > > > > > > > 2. We have a lot of similar code. And I'm not sure 
> > > > > > > > > > > > > that it can be optimized out.
> > > > > > > > > > > > > 3. Yes, but it means that we have n extra runtime 
> > > > > > > > > > > > > calls, where n is the number of branches in the 
> > > > > > > > > > > > > structure/class tree.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I see :(. I understand your concern. In this case, we 
> > > > > > > > > > > > > could try to discuss it offline, in the mailing list, 
> > > > > > > > > > > > > to make it a little bit faster. We just need to hear 
> > > > > > > > > > > > > other opinions on this matter, maybe there are some 
> > > > > > > > > > > > > other pros and cons for these schemes.
> > > > > > > > > > > > Hi Alexey,
> > > > > > > > > > > > 
> > > > > > > > > > > > Sure, let's discuss this in the mailing list. I'll 
> > > > > > > > > > > > summarize it and send it to the mailing list later.
> > > > > > > > > > > > 
> > > > > > > > > > > > > We have recursive (actually, not recursive, because 
> > > > > > > > > > > > > you cannot use types recursively) mappers calls 
> > > > > > > > > > > > > anyway, it is nature of struсtures/classes.
> > > > > > > > > > > > We won't have recursive calls with inlining.
> > > > > > > > > > > > 
> > > > > > > > > > > > > We have a lot of similar code. And I'm not sure that 
> > > > > > > > > > > > > it can be optimized out.
> > > > > > > > > > > > I think it's even harder to optimized these code out 
> > > > > > > > > > > > when we move them into the runtime.
> > > > > > > > > > > > 
> > > > > > > > > > > > > Yes, but it means that we have n extra runtime calls, 
> > > > > > > > > > > > > where n is the number of branches in the 
> > > > > > > > > > > > > structure/class tree.
> > > > > > > > > > > > I don't quite understand. It's still equal to the 
> > > > > > > > > > > > number of mappers in any case.
> > > > > > > > > > > > Sure, let's discuss this in the mailing list. I'll 
> > > > > > > > > > > > summarize it and send it to the mailing list later.
> > > > > > > > > > > 
> > > > > > > > > > > Good, thanks!
> > > > > > > > > > > 
> > > > > > > > > > > > We won't have recursive calls with inlining.
> > > > > > > > > > > 
> > > > > > > > > > > We won't have recursive calls anyway (recursive types are 
> > > > > > > > > > > not allowed). Plus, I'm not sure that inlining is the 
> > > > > > > > > > > best option here. We have a lot of code for each mapper 
> > > > > > > > > > > and I'm not sure that the optimizer will be able to 
> > > > > > > > > > > squash it effectively.
> > > > > > > > > > > 
> > > > > > > > > > > > I think it's even harder to optimized these code out 
> > > > > > > > > > > > when we move them into the runtime.
> > > > > > > > > > > 
> > > > > > > > > > > Definitely not, unless we use LTO or inlined runtime.
> > > > > > > > > > > 
> > > > > > > > > > > We won't have recursive calls anyway (recursive types are 
> > > > > > > > > > > not allowed). Plus, I'm not sure that inlining is the 
> > > > > > > > > > > best option here. We have a lot of code for each mapper 
> > > > > > > > > > > and I'm not sure that the optimizer will be able to 
> > > > > > > > > > > squash it effectively.
> > > > > > > > > > Sorry I should not say recursive calls. Here it needs to 
> > > > > > > > > > "recursively" call other mapper functions in case of nested 
> > > > > > > > > > mappers, but we don't need it in case of inlining.
> > > > > > > > > > 
> > > > > > > > > > > Definitely not, unless we use LTO or inlined runtime.
> > > > > > > > > > But you are proposing to move many code to the runtime 
> > > > > > > > > > here, right? That doesn't make sense to me.
> > > > > > > > > > 
> > > > > > > > > > But you are proposing to move much code to the runtime 
> > > > > > > > > > here, right? That doesn't make sense to me.
> > > > > > > > > 
> > > > > > > > > I'm just not sure that there going be significant problems 
> > > > > > > > > with the performance because of that. And it significantly 
> > > > > > > > > simplifies codegen in the compiler and moves the common part 
> > > > > > > > > into a single function.
> > > > > > > > > 
> > > > > > > > > Plus, if in future we'll need to modify this functionality 
> > > > > > > > > for some reason, 2 different versions of the compiler will 
> > > > > > > > > produce incompatible code. With my scheme, you still can use 
> > > > > > > > > old runtime and have the same functionality as the old 
> > > > > > > > > compiler and the new one.
> > > > > > > > Hi Alexey,
> > > > > > > > 
> > > > > > > > I think more carefully about your scheme, and I don't think we 
> > > > > > > > can solve the 2 problems below with this scheme:
> > > > > > > > 
> > > > > > > > 1. In the example you gave before, the compiler needs to 
> > > > > > > > generate all map types and pass them to `__tgt_mapper` through 
> > > > > > > > `sub_components`. But in this case, the compiler won't be able 
> > > > > > > > to generate the correct `MEMBER_OF` field in map type. As a 
> > > > > > > > result, the runtime has to fix it using the mechanism we 
> > > > > > > > already have here: `__tgt_mapper_num_components`. This not only 
> > > > > > > > increases complexity, but also, it means the runtime needs 
> > > > > > > > further manipulation of the map type, which creates locality 
> > > > > > > > issues. While in the current scheme, the map type is generated 
> > > > > > > > by compiler once, so the data locality will be very good in 
> > > > > > > > this case.
> > > > > > > > 
> > > > > > > > 2. `sub_components` includes all components that should be 
> > > > > > > > mapped. If we are mapping an array, this means we need to map 
> > > > > > > > many components, which will need to allocate memory for 
> > > > > > > > `sub_components` in the heap. This creates further memory 
> > > > > > > > management burden and is not an efficient way to use memory.
> > > > > > > > 
> > > > > > > > Based on these reasons, I think the current scheme is still 
> > > > > > > > more preferable.
> > > > > > > Hi Lingda, 
> > > > > > > 1. Actually, I thought that the runtime function `__tgt_mapper` 
> > > > > > > will do this, not the compiler.
> > > > > > > 2. Why do we need to allocate it on the heap? We can allocate it 
> > > > > > > on the stack.
> > > > > > 1. In your scheme, both compiler and `__tgt_mapper` need to do 
> > > > > > this: the compiler will generate other parts in the type, e.g., 
> > > > > > `TO` `FROM` bits and basic `MEMBER_OF` bits. Then `__tgt_mapper` 
> > > > > > needs to modify the `MEMBER_OF` bits later. Since there are a lot 
> > > > > > of other memory accesses between the compiler and `__tgt_mapper` 
> > > > > > operations to the same map type, it's very likely the map type will 
> > > > > > not stay in the cache, which causes locality problem.
> > > > > > 
> > > > > > 2. Assume we are mapping an array with 1000000 elements, and each 
> > > > > > elements have 5 components. For each component, we need `base, 
> > > > > > begin_ptr, size, type, mapper`, which are 40 bytes. Together, we 
> > > > > > will need 1000000 * 5 * 40 = 200MB of space for this array, which 
> > > > > > stack cannnot handle. 
> > > > > 1. I don't think it is a big problem, this part of the code is 
> > > > > executed on the CPU and I don't think it will lead to significant 
> > > > > overhead.
> > > > > 2. When we map an array, we do not map it element-by-element, so we 
> > > > > don't need 10000 records. Moreover, we try to merge contiguous parts 
> > > > > into single one, reducing the total number of elements.
> > > > 1. I think it is a problem. Beside, doing so is not elegant: why having 
> > > > a single thing (setting the map type) done in 2 places while we can do 
> > > > it in one place?
> > > > 
> > > > 2. We need to map them element by element because it is not always 
> > > > possible to merge contiguous parts together (there may be no contiguous 
> > > > parts based on the mapper). And merging parts together will be a 
> > > > complex procedure: I don't think it can be done in the runtime because 
> > > > many code is moved into the runtime now. In contrast, the compiler will 
> > > > have better opportunities to merge things.
> > > > 
> > > > Besides, I don't think there is a valid reason that the current scheme 
> > > > is not good. You mentioned it's complex codegen. But it only has less 
> > > > 200 loc here, I don't see why it is complex.
> > > 2. I rather doubt that we will need to map a record with 100000 fields 
> > > element-by-element.
> > > 
> > > I think it would be better to share it with others and listen to their 
> > > opinions. It is better to spend some extra time to provide good design.
> > > You can include your doubts in the description of the new scheme, of 
> > > course.
> > 2. The implementation should work for any case anyway. Besides, I think 
> > mapping a large array should be actually a common case. The users don't 
> > want to map 1000000 elements by themselves, so they will want to use mapper 
> > to let the system do it automatically.
> > 
> > Sure, I can release the discussion to the mailing list. I don't see a 
> > reason to use the new scheme now.
> Lingda, I meant to send the message to the OpenMP telecon list :) Could you 
> forward the same e-mail to Ravi and others, please?
Sure, probably no one in the public mailing list will care about it. I'll send 
it to the bi-weekly meeting list.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59474/new/

https://reviews.llvm.org/D59474



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D59474: [OpenMP 5.0] Codegen support for user-defined mappers

Reply via email to