ABataev added inline comments.
================ Comment at: lib/CodeGen/CGOpenMPRuntime.cpp:8739 +/// // For each component specified by this mapper: +/// if (currentComponent.hasMapper()) +/// (*currentComponent.Mapper())(rt_mapper_handle, arg_base, arg_begin, ---------------- lildmh wrote: > ABataev wrote: > > lildmh wrote: > > > ABataev wrote: > > > > lildmh wrote: > > > > > ABataev wrote: > > > > > > lildmh wrote: > > > > > > > ABataev wrote: > > > > > > > > lildmh wrote: > > > > > > > > > ABataev wrote: > > > > > > > > > > lildmh wrote: > > > > > > > > > > > ABataev wrote: > > > > > > > > > > > > lildmh wrote: > > > > > > > > > > > > > ABataev wrote: > > > > > > > > > > > > > > lildmh wrote: > > > > > > > > > > > > > > > ABataev wrote: > > > > > > > > > > > > > > > > lildmh wrote: > > > > > > > > > > > > > > > > > ABataev wrote: > > > > > > > > > > > > > > > > > > lildmh wrote: > > > > > > > > > > > > > > > > > > > ABataev wrote: > > > > > > > > > > > > > > > > > > > > lildmh wrote: > > > > > > > > > > > > > > > > > > > > > ABataev wrote: > > > > > > > > > > > > > > > > > > > > > > Currently `currentComponent` is > > > > > > > > > > > > > > > > > > > > > > generated by the compiler. But can > > > > > > > > > > > > > > > > > > > > > > we instead pass this data as an > > > > > > > > > > > > > > > > > > > > > > extra parameter to this > > > > > > > > > > > > > > > > > > > > > > `omp_mapper` function. > > > > > > > > > > > > > > > > > > > > > Emm, I think this scheme will be very > > > > > > > > > > > > > > > > > > > > > difficult and inefficient. If we pass > > > > > > > > > > > > > > > > > > > > > components as an argument of > > > > > > > > > > > > > > > > > > > > > `omp_mapper` function, it means that > > > > > > > > > > > > > > > > > > > > > the runtime needs to generate all > > > > > > > > > > > > > > > > > > > > > components related to a map clause. I > > > > > > > > > > > > > > > > > > > > > don't think the runtime is able to do > > > > > > > > > > > > > > > > > > > > > that efficiently. On the other hand, > > > > > > > > > > > > > > > > > > > > > in the current scheme, these > > > > > > > > > > > > > > > > > > > > > components are naturally generated by > > > > > > > > > > > > > > > > > > > > > the compiler, and the runtime only > > > > > > > > > > > > > > > > > > > > > needs to know the base pointer, > > > > > > > > > > > > > > > > > > > > > pointer, type, size. etc. > > > > > > > > > > > > > > > > > > > > With the current scheme, we may end > > > > > > > > > > > > > > > > > > > > with the code blowout. We need to > > > > > > > > > > > > > > > > > > > > generate very similar code for > > > > > > > > > > > > > > > > > > > > different types and variables. The > > > > > > > > > > > > > > > > > > > > worst thing here is that we will be > > > > > > > > > > > > > > > > > > > > unable to optimize this huge amount of > > > > > > > > > > > > > > > > > > > > code because the codegen relies on the > > > > > > > > > > > > > > > > > > > > runtime functions and the code cannot > > > > > > > > > > > > > > > > > > > > be inlined. That's why I would like to > > > > > > > > > > > > > > > > > > > > move as much as possible code to the > > > > > > > > > > > > > > > > > > > > runtime rather than to emit it in the > > > > > > > > > > > > > > > > > > > > compiler. > > > > > > > > > > > > > > > > > > > I understand your concerns. I think this > > > > > > > > > > > > > > > > > > > is the best we can do right now. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The most worrisome case will be when we > > > > > > > > > > > > > > > > > > > have nested mappers within each other. In > > > > > > > > > > > > > > > > > > > this case, a mapper function will call > > > > > > > > > > > > > > > > > > > another mapper function. We can inline > > > > > > > > > > > > > > > > > > > the inner mapper functions in this > > > > > > > > > > > > > > > > > > > scenario, so that these mapper function > > > > > > > > > > > > > > > > > > > can be properly optimized. As a result, I > > > > > > > > > > > > > > > > > > > think the performance should be fine. > > > > > > > > > > > > > > > > > > Instead, we can use indirect function calls > > > > > > > > > > > > > > > > > > passed in the array to the runtime. Do you > > > > > > > > > > > > > > > > > > think it is going to be slower? In your > > > > > > > > > > > > > > > > > > current scheme, we generate many runtime > > > > > > > > > > > > > > > > > > calls instead. Could you try to estimate > > > > > > > > > > > > > > > > > > the number of calls in cases if we'll call > > > > > > > > > > > > > > > > > > the mappers through the indirect function > > > > > > > > > > > > > > > > > > calls and in your cuurent scheme, where we > > > > > > > > > > > > > > > > > > need to call the runtime functions many > > > > > > > > > > > > > > > > > > times in each particular mapper? > > > > > > > > > > > > > > > > > Hi Alexey, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sorry I don't understand your idea. What > > > > > > > > > > > > > > > > > indirect function calls do you propose to be > > > > > > > > > > > > > > > > > passed to the runtime? What are these > > > > > > > > > > > > > > > > > functions supposed to do? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The number of function calls will be exactly > > > > > > > > > > > > > > > > > equal to the number of components mapped, no > > > > > > > > > > > > > > > > > matter whether there are nested mappers or > > > > > > > > > > > > > > > > > not. The number of components depend on the > > > > > > > > > > > > > > > > > program. E.g., if we map a large array > > > > > > > > > > > > > > > > > section, then there will be many more > > > > > > > > > > > > > > > > > function calls. > > > > > > > > > > > > > > > > I mean the pointers to the mapper function, > > > > > > > > > > > > > > > > generated by the compiler. In your comment, it > > > > > > > > > > > > > > > > is `c.Mapper()` > > > > > > > > > > > > > > > If we pass nested mapper functions to the > > > > > > > > > > > > > > > runtime, I think it will slow down execution > > > > > > > > > > > > > > > because of the extra level of indirect function > > > > > > > > > > > > > > > calls. E.g., the runtime will call `omp_mapper1`, > > > > > > > > > > > > > > > which calls the runtime back, which calls > > > > > > > > > > > > > > > `omp_mapper2`, .... This can result in a deep > > > > > > > > > > > > > > > call stack. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think the current implementation will be more > > > > > > > > > > > > > > > efficient, which doesn't pass nested mappers to > > > > > > > > > > > > > > > the runtime. One call to the outer most mapper > > > > > > > > > > > > > > > function will have all data mapping done. The > > > > > > > > > > > > > > > call stack will be 2 level deep (the first level > > > > > > > > > > > > > > > is the mapper function, and the second level is > > > > > > > > > > > > > > > `__tgt_push_mapper_component`) in this case from > > > > > > > > > > > > > > > the runtime. There are also more compiler > > > > > > > > > > > > > > > optimization space when we inline all nested > > > > > > > > > > > > > > > mapper functions. > > > > > > > > > > > > > > Yes, if we leave it as is. But if instead of the > > > > > > > > > > > > > > bunch unique functions we'll have the common one, > > > > > > > > > > > > > > that accept list if indirect pointers to functions > > > > > > > > > > > > > > additionally, and move it to the runtime library, > > > > > > > > > > > > > > we won't need those 2 functions we have currently. > > > > > > > > > > > > > > We'll have full access to the mapping data vector > > > > > > > > > > > > > > in the runtime library and won't need to use those > > > > > > > > > > > > > > 2 accessors we have currently. Instead, we'll need > > > > > > > > > > > > > > just one runtime functions, which implements the > > > > > > > > > > > > > > whole mapping logic. We still need to call it > > > > > > > > > > > > > > recursively, but I assume the number of calls will > > > > > > > > > > > > > > remain the same as in the current scheme. Did you > > > > > > > > > > > > > > understand the idea? If yes, it would good if you > > > > > > > > > > > > > > coild try to estimate the number of function calls > > > > > > > > > > > > > > in current scheme and in this new scheme to > > > > > > > > > > > > > > estimate possible pros and cons. > > > > > > > > > > > > > Hi Alexey, > > > > > > > > > > > > > > > > > > > > > > > > > > Could you give an example for this scheme? 1) I don't > > > > > > > > > > > > > understand how the mapper function can have full > > > > > > > > > > > > > access to the mapping data vector without providing > > > > > > > > > > > > > these 2 accessors. 2) I don't think it is possible to > > > > > > > > > > > > > have a common function instead of bunch of unique > > > > > > > > > > > > > functions for each mapper declared. > > > > > > > > > > > > Hi Lingda, something like this. > > > > > > > > > > > > ``` > > > > > > > > > > > > void __tgt_mapper(void *base, void *begin, size_t size, > > > > > > > > > > > > int64_t type, auto components[]) { > > > > > > > > > > > > // Allocate space for an array section first. > > > > > > > > > > > > if (size > 1 && !maptype.IsDelete) > > > > > > > > > > > > <push>(base, begin, size*sizeof(Ty), > > > > > > > > > > > > clearToFrom(type)); > > > > > > > > > > > > > > > > > > > > > > > > // Map members. > > > > > > > > > > > > for (unsigned i = 0; i < size; i++) { > > > > > > > > > > > > // For each component specified by this mapper: > > > > > > > > > > > > for (auto c : components) { > > > > > > > > > > > > if (c.hasMapper()) > > > > > > > > > > > > (*c.Mapper())(c.arg_base, c.arg_begin, > > > > > > > > > > > > c.arg_size, c.arg_type); > > > > > > > > > > > > else > > > > > > > > > > > > <push>(c.arg_base, c.arg_begin, c.arg_size, > > > > > > > > > > > > c.arg_type); > > > > > > > > > > > > } > > > > > > > > > > > > } > > > > > > > > > > > > // Delete the array section. > > > > > > > > > > > > if (size > 1 && maptype.IsDelete) > > > > > > > > > > > > <push>(base, begin, size*sizeof(Ty), > > > > > > > > > > > > clearToFrom(type)); > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > void <type>.mapper(void *base, void *begin, size_t > > > > > > > > > > > > size, int64_t type) { > > > > > > > > > > > > auto sub_components[] = {...}; > > > > > > > > > > > > __tgt_mapper(base, begin, size, type, sub_components); > > > > > > > > > > > > } > > > > > > > > > > > > ``` > > > > > > > > > > > > > > > > > > > > > > > Hi Alexey, > > > > > > > > > > > > > > > > > > > > > > I don't think this scheme is more efficient than the > > > > > > > > > > > current scheme. My reasons are: > > > > > > > > > > > > > > > > > > > > > > 1) Most code here is essentially to generate > > > > > > > > > > > `components`, i.e., we need to generate `c.arg_base, > > > > > > > > > > > c.arg_begin, c.arg_size, c.arg_type` for each `c` in > > > > > > > > > > > `components`, so there will still be a lot of code in > > > > > > > > > > > `<type>.mapper`. It will not reduce the mapper function > > > > > > > > > > > code, i.e., we will still have a bunch of unique mapper > > > > > > > > > > > functions. > > > > > > > > > > > > > > > > > > > > > > 2) This scheme will prevent a lot of compiler > > > > > > > > > > > optimization from happening. In reality, a lot of > > > > > > > > > > > computation should be redundant. E.g., for two components > > > > > > > > > > > `c1` and `c2`, `c1`'s base may be the same as `c2`'s > > > > > > > > > > > begin, so the compiler will be able to eliminate these > > > > > > > > > > > reduction computation, especially when we inline all > > > > > > > > > > > nested mapper functions together. If we move these > > > > > > > > > > > computation into the runtime, the compiler will not be > > > > > > > > > > > able to do such optimization. > > > > > > > > > > > > > > > > > > > > > > 3) In terms of the number of `push` function calls, this > > > > > > > > > > > scheme has the exact same number of calls as the current > > > > > > > > > > > scheme, so I don't think this scheme can bring > > > > > > > > > > > performance benefits. The scheme should perform worse > > > > > > > > > > > than the current scheme, because it reduces the > > > > > > > > > > > opportunities of compiler optimization as mentioned above. > > > > > > > > > > Hi Lingda, I'm trying to simplify the code generated by > > > > > > > > > > clang and avoid some unnecessary code duplications. If the > > > > > > > > > > complexity of this scheme is the same as proposed by you, I > > > > > > > > > > would prefer to use this scheme unless there are some other > > > > > > > > > > opinions. > > > > > > > > > > 1. It is not a problem. This code is unique and is not > > > > > > > > > > duplicated in the different mappers. > > > > > > > > > > 2. Inlining is no solution here. We still generate to much > > > > > > > > > > code, which is almost the same in many cases and it will > > > > > > > > > > lead to very ineffective codegen because we still end up > > > > > > > > > > with a lot of almost the same code. This also might lead to > > > > > > > > > > poor performance. > > > > > > > > > > 3. Yes, the number of pushes is always the same, in all > > > > > > > > > > possible schemes. It would be good to compare somehow the > > > > > > > > > > performance of both schemes, at least preliminary. > > > > > > > > > > > > > > > > > > > > Also, this solution reduces the number of required runtime > > > > > > > > > > functions, instead of 2 we need just 1 and, thus, we need > > > > > > > > > > to make fewer runtime functions calls. > > > > > > > > > > > > > > > > > > > > I think it would better to propose this scheme as an > > > > > > > > > > alternate design and discuss it in the OpenMP telecon. What > > > > > > > > > > do you think? Or we can try to discuss it in the offline > > > > > > > > > > mode via the e-mail with other members. > > > > > > > > > > I'm not trying to convince you to implement this scheme > > > > > > > > > > right now, but it would be good to discuss it. Maybe it > > > > > > > > > > will lead to some better ideas from others? > > > > > > > > > Hi Alexey, > > > > > > > > > > > > > > > > > > I still prefer the current scheme, because: > > > > > > > > > 1) I don't like recursive mapper calls, which goes back to my > > > > > > > > > original scheme a little bit. I really think inlining can > > > > > > > > > make a big difference when we have nested mappers. These > > > > > > > > > compiler optimizations are the keys to have better > > > > > > > > > performance for mappers. > > > > > > > > > 2) I don't think the codegen here is inefficient. Yes there > > > > > > > > > is duplicated code across different mapper functions, but why > > > > > > > > > that will lead to poor performance? > > > > > > > > > 3) Although we have 2 runtime functions now, the > > > > > > > > > `__tgt_mapper_num_components` is called only once per mapper. > > > > > > > > > It should have very negligible performance impact. > > > > > > > > > > > > > > > > > > But if you have a different option, we can discuss it next > > > > > > > > > time in the meeting. I do have a time constraint to work on > > > > > > > > > the mapper implementation. I'll no longer work in this > > > > > > > > > project starting this September, and I have about 30% of my > > > > > > > > > time working on it until then. > > > > > > > > Lingda, > > > > > > > > 1. We have recursive (actually, not recursive, because you > > > > > > > > cannot use types recursively) mappers calls anyway, it is > > > > > > > > nature of struсtures/classes. > > > > > > > > 2. We have a lot of similar code. And I'm not sure that it can > > > > > > > > be optimized out. > > > > > > > > 3. Yes, but it means that we have n extra runtime calls, where > > > > > > > > n is the number of branches in the structure/class tree. > > > > > > > > > > > > > > > > I see :(. I understand your concern. In this case, we could try > > > > > > > > to discuss it offline, in the mailing list, to make it a little > > > > > > > > bit faster. We just need to hear other opinions on this matter, > > > > > > > > maybe there are some other pros and cons for these schemes. > > > > > > > Hi Alexey, > > > > > > > > > > > > > > Sure, let's discuss this in the mailing list. I'll summarize it > > > > > > > and send it to the mailing list later. > > > > > > > > > > > > > > > We have recursive (actually, not recursive, because you cannot > > > > > > > > use types recursively) mappers calls anyway, it is nature of > > > > > > > > struсtures/classes. > > > > > > > We won't have recursive calls with inlining. > > > > > > > > > > > > > > > We have a lot of similar code. And I'm not sure that it can be > > > > > > > > optimized out. > > > > > > > I think it's even harder to optimized these code out when we move > > > > > > > them into the runtime. > > > > > > > > > > > > > > > Yes, but it means that we have n extra runtime calls, where n > > > > > > > > is the number of branches in the structure/class tree. > > > > > > > I don't quite understand. It's still equal to the number of > > > > > > > mappers in any case. > > > > > > > Sure, let's discuss this in the mailing list. I'll summarize it > > > > > > > and send it to the mailing list later. > > > > > > > > > > > > Good, thanks! > > > > > > > > > > > > > We won't have recursive calls with inlining. > > > > > > > > > > > > We won't have recursive calls anyway (recursive types are not > > > > > > allowed). Plus, I'm not sure that inlining is the best option here. > > > > > > We have a lot of code for each mapper and I'm not sure that the > > > > > > optimizer will be able to squash it effectively. > > > > > > > > > > > > > I think it's even harder to optimized these code out when we move > > > > > > > them into the runtime. > > > > > > > > > > > > Definitely not, unless we use LTO or inlined runtime. > > > > > > > > > > > > We won't have recursive calls anyway (recursive types are not > > > > > > allowed). Plus, I'm not sure that inlining is the best option here. > > > > > > We have a lot of code for each mapper and I'm not sure that the > > > > > > optimizer will be able to squash it effectively. > > > > > Sorry I should not say recursive calls. Here it needs to > > > > > "recursively" call other mapper functions in case of nested mappers, > > > > > but we don't need it in case of inlining. > > > > > > > > > > > Definitely not, unless we use LTO or inlined runtime. > > > > > But you are proposing to move many code to the runtime here, right? > > > > > That doesn't make sense to me. > > > > > > > > > > But you are proposing to move much code to the runtime here, right? > > > > > That doesn't make sense to me. > > > > > > > > I'm just not sure that there going be significant problems with the > > > > performance because of that. And it significantly simplifies codegen in > > > > the compiler and moves the common part into a single function. > > > > > > > > Plus, if in future we'll need to modify this functionality for some > > > > reason, 2 different versions of the compiler will produce incompatible > > > > code. With my scheme, you still can use old runtime and have the same > > > > functionality as the old compiler and the new one. > > > Hi Alexey, > > > > > > I think more carefully about your scheme, and I don't think we can solve > > > the 2 problems below with this scheme: > > > > > > 1. In the example you gave before, the compiler needs to generate all map > > > types and pass them to `__tgt_mapper` through `sub_components`. But in > > > this case, the compiler won't be able to generate the correct `MEMBER_OF` > > > field in map type. As a result, the runtime has to fix it using the > > > mechanism we already have here: `__tgt_mapper_num_components`. This not > > > only increases complexity, but also, it means the runtime needs further > > > manipulation of the map type, which creates locality issues. While in the > > > current scheme, the map type is generated by compiler once, so the data > > > locality will be very good in this case. > > > > > > 2. `sub_components` includes all components that should be mapped. If we > > > are mapping an array, this means we need to map many components, which > > > will need to allocate memory for `sub_components` in the heap. This > > > creates further memory management burden and is not an efficient way to > > > use memory. > > > > > > Based on these reasons, I think the current scheme is still more > > > preferable. > > Hi Lingda, > > 1. Actually, I thought that the runtime function `__tgt_mapper` will do > > this, not the compiler. > > 2. Why do we need to allocate it on the heap? We can allocate it on the > > stack. > 1. In your scheme, both compiler and `__tgt_mapper` need to do this: the > compiler will generate other parts in the type, e.g., `TO` `FROM` bits and > basic `MEMBER_OF` bits. Then `__tgt_mapper` needs to modify the `MEMBER_OF` > bits later. Since there are a lot of other memory accesses between the > compiler and `__tgt_mapper` operations to the same map type, it's very likely > the map type will not stay in the cache, which causes locality problem. > > 2. Assume we are mapping an array with 1000000 elements, and each elements > have 5 components. For each component, we need `base, begin_ptr, size, type, > mapper`, which are 40 bytes. Together, we will need 1000000 * 5 * 40 = 200MB > of space for this array, which stack cannnot handle. 1. I don't think it is a big problem, this part of the code is executed on the CPU and I don't think it will lead to significant overhead. 2. When we map an array, we do not map it element-by-element, so we don't need 10000 records. Moreover, we try to merge contiguous parts into single one, reducing the total number of elements. CHANGES SINCE LAST ACTION https://reviews.llvm.org/D59474/new/ https://reviews.llvm.org/D59474 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits