ABataev added inline comments.
================ Comment at: lib/CodeGen/CGOpenMPRuntime.cpp:8739 +/// // For each component specified by this mapper: +/// if (currentComponent.hasMapper()) +/// (*currentComponent.Mapper())(rt_mapper_handle, arg_base, arg_begin, ---------------- lildmh wrote: > ABataev wrote: > > lildmh wrote: > > > ABataev wrote: > > > > lildmh wrote: > > > > > ABataev wrote: > > > > > > lildmh wrote: > > > > > > > ABataev wrote: > > > > > > > > lildmh wrote: > > > > > > > > > ABataev wrote: > > > > > > > > > > lildmh wrote: > > > > > > > > > > > ABataev wrote: > > > > > > > > > > > > lildmh wrote: > > > > > > > > > > > > > ABataev wrote: > > > > > > > > > > > > > > lildmh wrote: > > > > > > > > > > > > > > > ABataev wrote: > > > > > > > > > > > > > > > > lildmh wrote: > > > > > > > > > > > > > > > > > ABataev wrote: > > > > > > > > > > > > > > > > > > lildmh wrote: > > > > > > > > > > > > > > > > > > > ABataev wrote: > > > > > > > > > > > > > > > > > > > > Currently `currentComponent` is > > > > > > > > > > > > > > > > > > > > generated by the compiler. But can we > > > > > > > > > > > > > > > > > > > > instead pass this data as an extra > > > > > > > > > > > > > > > > > > > > parameter to this `omp_mapper` function. > > > > > > > > > > > > > > > > > > > Emm, I think this scheme will be very > > > > > > > > > > > > > > > > > > > difficult and inefficient. If we pass > > > > > > > > > > > > > > > > > > > components as an argument of `omp_mapper` > > > > > > > > > > > > > > > > > > > function, it means that the runtime needs > > > > > > > > > > > > > > > > > > > to generate all components related to a > > > > > > > > > > > > > > > > > > > map clause. I don't think the runtime is > > > > > > > > > > > > > > > > > > > able to do that efficiently. On the other > > > > > > > > > > > > > > > > > > > hand, in the current scheme, these > > > > > > > > > > > > > > > > > > > components are naturally generated by the > > > > > > > > > > > > > > > > > > > compiler, and the runtime only needs to > > > > > > > > > > > > > > > > > > > know the base pointer, pointer, type, > > > > > > > > > > > > > > > > > > > size. etc. > > > > > > > > > > > > > > > > > > With the current scheme, we may end with > > > > > > > > > > > > > > > > > > the code blowout. We need to generate very > > > > > > > > > > > > > > > > > > similar code for different types and > > > > > > > > > > > > > > > > > > variables. The worst thing here is that we > > > > > > > > > > > > > > > > > > will be unable to optimize this huge amount > > > > > > > > > > > > > > > > > > of code because the codegen relies on the > > > > > > > > > > > > > > > > > > runtime functions and the code cannot be > > > > > > > > > > > > > > > > > > inlined. That's why I would like to move as > > > > > > > > > > > > > > > > > > much as possible code to the runtime rather > > > > > > > > > > > > > > > > > > than to emit it in the compiler. > > > > > > > > > > > > > > > > > I understand your concerns. I think this is > > > > > > > > > > > > > > > > > the best we can do right now. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The most worrisome case will be when we have > > > > > > > > > > > > > > > > > nested mappers within each other. In this > > > > > > > > > > > > > > > > > case, a mapper function will call another > > > > > > > > > > > > > > > > > mapper function. We can inline the inner > > > > > > > > > > > > > > > > > mapper functions in this scenario, so that > > > > > > > > > > > > > > > > > these mapper function can be properly > > > > > > > > > > > > > > > > > optimized. As a result, I think the > > > > > > > > > > > > > > > > > performance should be fine. > > > > > > > > > > > > > > > > Instead, we can use indirect function calls > > > > > > > > > > > > > > > > passed in the array to the runtime. Do you > > > > > > > > > > > > > > > > think it is going to be slower? In your current > > > > > > > > > > > > > > > > scheme, we generate many runtime calls instead. > > > > > > > > > > > > > > > > Could you try to estimate the number of calls > > > > > > > > > > > > > > > > in cases if we'll call the mappers through the > > > > > > > > > > > > > > > > indirect function calls and in your cuurent > > > > > > > > > > > > > > > > scheme, where we need to call the runtime > > > > > > > > > > > > > > > > functions many times in each particular mapper? > > > > > > > > > > > > > > > Hi Alexey, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sorry I don't understand your idea. What indirect > > > > > > > > > > > > > > > function calls do you propose to be passed to the > > > > > > > > > > > > > > > runtime? What are these functions supposed to do? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The number of function calls will be exactly > > > > > > > > > > > > > > > equal to the number of components mapped, no > > > > > > > > > > > > > > > matter whether there are nested mappers or not. > > > > > > > > > > > > > > > The number of components depend on the program. > > > > > > > > > > > > > > > E.g., if we map a large array section, then there > > > > > > > > > > > > > > > will be many more function calls. > > > > > > > > > > > > > > I mean the pointers to the mapper function, > > > > > > > > > > > > > > generated by the compiler. In your comment, it is > > > > > > > > > > > > > > `c.Mapper()` > > > > > > > > > > > > > If we pass nested mapper functions to the runtime, I > > > > > > > > > > > > > think it will slow down execution because of the > > > > > > > > > > > > > extra level of indirect function calls. E.g., the > > > > > > > > > > > > > runtime will call `omp_mapper1`, which calls the > > > > > > > > > > > > > runtime back, which calls `omp_mapper2`, .... This > > > > > > > > > > > > > can result in a deep call stack. > > > > > > > > > > > > > > > > > > > > > > > > > > I think the current implementation will be more > > > > > > > > > > > > > efficient, which doesn't pass nested mappers to the > > > > > > > > > > > > > runtime. One call to the outer most mapper function > > > > > > > > > > > > > will have all data mapping done. The call stack will > > > > > > > > > > > > > be 2 level deep (the first level is the mapper > > > > > > > > > > > > > function, and the second level is > > > > > > > > > > > > > `__tgt_push_mapper_component`) in this case from the > > > > > > > > > > > > > runtime. There are also more compiler optimization > > > > > > > > > > > > > space when we inline all nested mapper functions. > > > > > > > > > > > > Yes, if we leave it as is. But if instead of the bunch > > > > > > > > > > > > unique functions we'll have the common one, that accept > > > > > > > > > > > > list if indirect pointers to functions additionally, > > > > > > > > > > > > and move it to the runtime library, we won't need those > > > > > > > > > > > > 2 functions we have currently. We'll have full access > > > > > > > > > > > > to the mapping data vector in the runtime library and > > > > > > > > > > > > won't need to use those 2 accessors we have currently. > > > > > > > > > > > > Instead, we'll need just one runtime functions, which > > > > > > > > > > > > implements the whole mapping logic. We still need to > > > > > > > > > > > > call it recursively, but I assume the number of calls > > > > > > > > > > > > will remain the same as in the current scheme. Did you > > > > > > > > > > > > understand the idea? If yes, it would good if you coild > > > > > > > > > > > > try to estimate the number of function calls in current > > > > > > > > > > > > scheme and in this new scheme to estimate possible pros > > > > > > > > > > > > and cons. > > > > > > > > > > > Hi Alexey, > > > > > > > > > > > > > > > > > > > > > > Could you give an example for this scheme? 1) I don't > > > > > > > > > > > understand how the mapper function can have full access > > > > > > > > > > > to the mapping data vector without providing these 2 > > > > > > > > > > > accessors. 2) I don't think it is possible to have a > > > > > > > > > > > common function instead of bunch of unique functions for > > > > > > > > > > > each mapper declared. > > > > > > > > > > Hi Lingda, something like this. > > > > > > > > > > ``` > > > > > > > > > > void __tgt_mapper(void *base, void *begin, size_t size, > > > > > > > > > > int64_t type, auto components[]) { > > > > > > > > > > // Allocate space for an array section first. > > > > > > > > > > if (size > 1 && !maptype.IsDelete) > > > > > > > > > > <push>(base, begin, size*sizeof(Ty), > > > > > > > > > > clearToFrom(type)); > > > > > > > > > > > > > > > > > > > > // Map members. > > > > > > > > > > for (unsigned i = 0; i < size; i++) { > > > > > > > > > > // For each component specified by this mapper: > > > > > > > > > > for (auto c : components) { > > > > > > > > > > if (c.hasMapper()) > > > > > > > > > > (*c.Mapper())(c.arg_base, c.arg_begin, c.arg_size, > > > > > > > > > > c.arg_type); > > > > > > > > > > else > > > > > > > > > > <push>(c.arg_base, c.arg_begin, c.arg_size, > > > > > > > > > > c.arg_type); > > > > > > > > > > } > > > > > > > > > > } > > > > > > > > > > // Delete the array section. > > > > > > > > > > if (size > 1 && maptype.IsDelete) > > > > > > > > > > <push>(base, begin, size*sizeof(Ty), clearToFrom(type)); > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > void <type>.mapper(void *base, void *begin, size_t size, > > > > > > > > > > int64_t type) { > > > > > > > > > > auto sub_components[] = {...}; > > > > > > > > > > __tgt_mapper(base, begin, size, type, sub_components); > > > > > > > > > > } > > > > > > > > > > ``` > > > > > > > > > > > > > > > > > > > Hi Alexey, > > > > > > > > > > > > > > > > > > I don't think this scheme is more efficient than the current > > > > > > > > > scheme. My reasons are: > > > > > > > > > > > > > > > > > > 1) Most code here is essentially to generate `components`, > > > > > > > > > i.e., we need to generate `c.arg_base, c.arg_begin, > > > > > > > > > c.arg_size, c.arg_type` for each `c` in `components`, so > > > > > > > > > there will still be a lot of code in `<type>.mapper`. It will > > > > > > > > > not reduce the mapper function code, i.e., we will still have > > > > > > > > > a bunch of unique mapper functions. > > > > > > > > > > > > > > > > > > 2) This scheme will prevent a lot of compiler optimization > > > > > > > > > from happening. In reality, a lot of computation should be > > > > > > > > > redundant. E.g., for two components `c1` and `c2`, `c1`'s > > > > > > > > > base may be the same as `c2`'s begin, so the compiler will be > > > > > > > > > able to eliminate these reduction computation, especially > > > > > > > > > when we inline all nested mapper functions together. If we > > > > > > > > > move these computation into the runtime, the compiler will > > > > > > > > > not be able to do such optimization. > > > > > > > > > > > > > > > > > > 3) In terms of the number of `push` function calls, this > > > > > > > > > scheme has the exact same number of calls as the current > > > > > > > > > scheme, so I don't think this scheme can bring performance > > > > > > > > > benefits. The scheme should perform worse than the current > > > > > > > > > scheme, because it reduces the opportunities of compiler > > > > > > > > > optimization as mentioned above. > > > > > > > > Hi Lingda, I'm trying to simplify the code generated by clang > > > > > > > > and avoid some unnecessary code duplications. If the complexity > > > > > > > > of this scheme is the same as proposed by you, I would prefer > > > > > > > > to use this scheme unless there are some other opinions. > > > > > > > > 1. It is not a problem. This code is unique and is not > > > > > > > > duplicated in the different mappers. > > > > > > > > 2. Inlining is no solution here. We still generate to much > > > > > > > > code, which is almost the same in many cases and it will lead > > > > > > > > to very ineffective codegen because we still end up with a lot > > > > > > > > of almost the same code. This also might lead to poor > > > > > > > > performance. > > > > > > > > 3. Yes, the number of pushes is always the same, in all > > > > > > > > possible schemes. It would be good to compare somehow the > > > > > > > > performance of both schemes, at least preliminary. > > > > > > > > > > > > > > > > Also, this solution reduces the number of required runtime > > > > > > > > functions, instead of 2 we need just 1 and, thus, we need to > > > > > > > > make fewer runtime functions calls. > > > > > > > > > > > > > > > > I think it would better to propose this scheme as an alternate > > > > > > > > design and discuss it in the OpenMP telecon. What do you think? > > > > > > > > Or we can try to discuss it in the offline mode via the e-mail > > > > > > > > with other members. > > > > > > > > I'm not trying to convince you to implement this scheme right > > > > > > > > now, but it would be good to discuss it. Maybe it will lead to > > > > > > > > some better ideas from others? > > > > > > > Hi Alexey, > > > > > > > > > > > > > > I still prefer the current scheme, because: > > > > > > > 1) I don't like recursive mapper calls, which goes back to my > > > > > > > original scheme a little bit. I really think inlining can make a > > > > > > > big difference when we have nested mappers. These compiler > > > > > > > optimizations are the keys to have better performance for mappers. > > > > > > > 2) I don't think the codegen here is inefficient. Yes there is > > > > > > > duplicated code across different mapper functions, but why that > > > > > > > will lead to poor performance? > > > > > > > 3) Although we have 2 runtime functions now, the > > > > > > > `__tgt_mapper_num_components` is called only once per mapper. It > > > > > > > should have very negligible performance impact. > > > > > > > > > > > > > > But if you have a different option, we can discuss it next time > > > > > > > in the meeting. I do have a time constraint to work on the mapper > > > > > > > implementation. I'll no longer work in this project starting this > > > > > > > September, and I have about 30% of my time working on it until > > > > > > > then. > > > > > > Lingda, > > > > > > 1. We have recursive (actually, not recursive, because you cannot > > > > > > use types recursively) mappers calls anyway, it is nature of > > > > > > struсtures/classes. > > > > > > 2. We have a lot of similar code. And I'm not sure that it can be > > > > > > optimized out. > > > > > > 3. Yes, but it means that we have n extra runtime calls, where n is > > > > > > the number of branches in the structure/class tree. > > > > > > > > > > > > I see :(. I understand your concern. In this case, we could try to > > > > > > discuss it offline, in the mailing list, to make it a little bit > > > > > > faster. We just need to hear other opinions on this matter, maybe > > > > > > there are some other pros and cons for these schemes. > > > > > Hi Alexey, > > > > > > > > > > Sure, let's discuss this in the mailing list. I'll summarize it and > > > > > send it to the mailing list later. > > > > > > > > > > > We have recursive (actually, not recursive, because you cannot use > > > > > > types recursively) mappers calls anyway, it is nature of > > > > > > struсtures/classes. > > > > > We won't have recursive calls with inlining. > > > > > > > > > > > We have a lot of similar code. And I'm not sure that it can be > > > > > > optimized out. > > > > > I think it's even harder to optimized these code out when we move > > > > > them into the runtime. > > > > > > > > > > > Yes, but it means that we have n extra runtime calls, where n is > > > > > > the number of branches in the structure/class tree. > > > > > I don't quite understand. It's still equal to the number of mappers > > > > > in any case. > > > > > Sure, let's discuss this in the mailing list. I'll summarize it and > > > > > send it to the mailing list later. > > > > > > > > Good, thanks! > > > > > > > > > We won't have recursive calls with inlining. > > > > > > > > We won't have recursive calls anyway (recursive types are not allowed). > > > > Plus, I'm not sure that inlining is the best option here. We have a lot > > > > of code for each mapper and I'm not sure that the optimizer will be > > > > able to squash it effectively. > > > > > > > > > I think it's even harder to optimized these code out when we move > > > > > them into the runtime. > > > > > > > > Definitely not, unless we use LTO or inlined runtime. > > > > > > > > We won't have recursive calls anyway (recursive types are not allowed). > > > > Plus, I'm not sure that inlining is the best option here. We have a lot > > > > of code for each mapper and I'm not sure that the optimizer will be > > > > able to squash it effectively. > > > Sorry I should not say recursive calls. Here it needs to "recursively" > > > call other mapper functions in case of nested mappers, but we don't need > > > it in case of inlining. > > > > > > > Definitely not, unless we use LTO or inlined runtime. > > > But you are proposing to move many code to the runtime here, right? That > > > doesn't make sense to me. > > > > > > But you are proposing to move much code to the runtime here, right? That > > > doesn't make sense to me. > > > > I'm just not sure that there going be significant problems with the > > performance because of that. And it significantly simplifies codegen in the > > compiler and moves the common part into a single function. > > > > Plus, if in future we'll need to modify this functionality for some reason, > > 2 different versions of the compiler will produce incompatible code. With > > my scheme, you still can use old runtime and have the same functionality as > > the old compiler and the new one. > Hi Alexey, > > I think more carefully about your scheme, and I don't think we can solve the > 2 problems below with this scheme: > > 1. In the example you gave before, the compiler needs to generate all map > types and pass them to `__tgt_mapper` through `sub_components`. But in this > case, the compiler won't be able to generate the correct `MEMBER_OF` field in > map type. As a result, the runtime has to fix it using the mechanism we > already have here: `__tgt_mapper_num_components`. This not only increases > complexity, but also, it means the runtime needs further manipulation of the > map type, which creates locality issues. While in the current scheme, the map > type is generated by compiler once, so the data locality will be very good in > this case. > > 2. `sub_components` includes all components that should be mapped. If we are > mapping an array, this means we need to map many components, which will need > to allocate memory for `sub_components` in the heap. This creates further > memory management burden and is not an efficient way to use memory. > > Based on these reasons, I think the current scheme is still more preferable. Hi Lingda, 1. Actually, I thought that the runtime function `__tgt_mapper` will do this, not the compiler. 2. Why do we need to allocate it on the heap? We can allocate it on the stack. CHANGES SINCE LAST ACTION https://reviews.llvm.org/D59474/new/ https://reviews.llvm.org/D59474 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits