ABataev added inline comments.
================ Comment at: lib/CodeGen/CGOpenMPRuntime.cpp:8739 +/// // For each component specified by this mapper: +/// if (currentComponent.hasMapper()) +/// (*currentComponent.Mapper())(rt_mapper_handle, arg_base, arg_begin, ---------------- lildmh wrote: > ABataev wrote: > > lildmh wrote: > > > ABataev wrote: > > > > lildmh wrote: > > > > > ABataev wrote: > > > > > > lildmh wrote: > > > > > > > ABataev wrote: > > > > > > > > lildmh wrote: > > > > > > > > > ABataev wrote: > > > > > > > > > > lildmh wrote: > > > > > > > > > > > ABataev wrote: > > > > > > > > > > > > lildmh wrote: > > > > > > > > > > > > > ABataev wrote: > > > > > > > > > > > > > > lildmh wrote: > > > > > > > > > > > > > > > ABataev wrote: > > > > > > > > > > > > > > > > lildmh wrote: > > > > > > > > > > > > > > > > > ABataev wrote: > > > > > > > > > > > > > > > > > > lildmh wrote: > > > > > > > > > > > > > > > > > > > ABataev wrote: > > > > > > > > > > > > > > > > > > > > lildmh wrote: > > > > > > > > > > > > > > > > > > > > > ABataev wrote: > > > > > > > > > > > > > > > > > > > > > > lildmh wrote: > > > > > > > > > > > > > > > > > > > > > > > ABataev wrote: > > > > > > > > > > > > > > > > > > > > > > > > Currently `currentComponent` is > > > > > > > > > > > > > > > > > > > > > > > > generated by the compiler. But > > > > > > > > > > > > > > > > > > > > > > > > can we instead pass this data > > > > > > > > > > > > > > > > > > > > > > > > as an extra parameter to this > > > > > > > > > > > > > > > > > > > > > > > > `omp_mapper` function. > > > > > > > > > > > > > > > > > > > > > > > Emm, I think this scheme will be > > > > > > > > > > > > > > > > > > > > > > > very difficult and inefficient. > > > > > > > > > > > > > > > > > > > > > > > If we pass components as an > > > > > > > > > > > > > > > > > > > > > > > argument of `omp_mapper` > > > > > > > > > > > > > > > > > > > > > > > function, it means that the > > > > > > > > > > > > > > > > > > > > > > > runtime needs to generate all > > > > > > > > > > > > > > > > > > > > > > > components related to a map > > > > > > > > > > > > > > > > > > > > > > > clause. I don't think the runtime > > > > > > > > > > > > > > > > > > > > > > > is able to do that efficiently. > > > > > > > > > > > > > > > > > > > > > > > On the other hand, in the current > > > > > > > > > > > > > > > > > > > > > > > scheme, these components are > > > > > > > > > > > > > > > > > > > > > > > naturally generated by the > > > > > > > > > > > > > > > > > > > > > > > compiler, and the runtime only > > > > > > > > > > > > > > > > > > > > > > > needs to know the base pointer, > > > > > > > > > > > > > > > > > > > > > > > pointer, type, size. etc. > > > > > > > > > > > > > > > > > > > > > > With the current scheme, we may end > > > > > > > > > > > > > > > > > > > > > > with the code blowout. We need to > > > > > > > > > > > > > > > > > > > > > > generate very similar code for > > > > > > > > > > > > > > > > > > > > > > different types and variables. The > > > > > > > > > > > > > > > > > > > > > > worst thing here is that we will be > > > > > > > > > > > > > > > > > > > > > > unable to optimize this huge amount > > > > > > > > > > > > > > > > > > > > > > of code because the codegen relies > > > > > > > > > > > > > > > > > > > > > > on the runtime functions and the > > > > > > > > > > > > > > > > > > > > > > code cannot be inlined. That's why > > > > > > > > > > > > > > > > > > > > > > I would like to move as much as > > > > > > > > > > > > > > > > > > > > > > possible code to the runtime rather > > > > > > > > > > > > > > > > > > > > > > than to emit it in the compiler. > > > > > > > > > > > > > > > > > > > > > I understand your concerns. I think > > > > > > > > > > > > > > > > > > > > > this is the best we can do right now. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The most worrisome case will be when > > > > > > > > > > > > > > > > > > > > > we have nested mappers within each > > > > > > > > > > > > > > > > > > > > > other. In this case, a mapper > > > > > > > > > > > > > > > > > > > > > function will call another mapper > > > > > > > > > > > > > > > > > > > > > function. We can inline the inner > > > > > > > > > > > > > > > > > > > > > mapper functions in this scenario, so > > > > > > > > > > > > > > > > > > > > > that these mapper function can be > > > > > > > > > > > > > > > > > > > > > properly optimized. As a result, I > > > > > > > > > > > > > > > > > > > > > think the performance should be fine. > > > > > > > > > > > > > > > > > > > > Instead, we can use indirect function > > > > > > > > > > > > > > > > > > > > calls passed in the array to the > > > > > > > > > > > > > > > > > > > > runtime. Do you think it is going to be > > > > > > > > > > > > > > > > > > > > slower? In your current scheme, we > > > > > > > > > > > > > > > > > > > > generate many runtime calls instead. > > > > > > > > > > > > > > > > > > > > Could you try to estimate the number of > > > > > > > > > > > > > > > > > > > > calls in cases if we'll call the > > > > > > > > > > > > > > > > > > > > mappers through the indirect function > > > > > > > > > > > > > > > > > > > > calls and in your cuurent scheme, where > > > > > > > > > > > > > > > > > > > > we need to call the runtime functions > > > > > > > > > > > > > > > > > > > > many times in each particular mapper? > > > > > > > > > > > > > > > > > > > Hi Alexey, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sorry I don't understand your idea. What > > > > > > > > > > > > > > > > > > > indirect function calls do you propose to > > > > > > > > > > > > > > > > > > > be passed to the runtime? What are these > > > > > > > > > > > > > > > > > > > functions supposed to do? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The number of function calls will be > > > > > > > > > > > > > > > > > > > exactly equal to the number of components > > > > > > > > > > > > > > > > > > > mapped, no matter whether there are > > > > > > > > > > > > > > > > > > > nested mappers or not. The number of > > > > > > > > > > > > > > > > > > > components depend on the program. E.g., > > > > > > > > > > > > > > > > > > > if we map a large array section, then > > > > > > > > > > > > > > > > > > > there will be many more function calls. > > > > > > > > > > > > > > > > > > I mean the pointers to the mapper function, > > > > > > > > > > > > > > > > > > generated by the compiler. In your comment, > > > > > > > > > > > > > > > > > > it is `c.Mapper()` > > > > > > > > > > > > > > > > > If we pass nested mapper functions to the > > > > > > > > > > > > > > > > > runtime, I think it will slow down execution > > > > > > > > > > > > > > > > > because of the extra level of indirect > > > > > > > > > > > > > > > > > function calls. E.g., the runtime will call > > > > > > > > > > > > > > > > > `omp_mapper1`, which calls the runtime back, > > > > > > > > > > > > > > > > > which calls `omp_mapper2`, .... This can > > > > > > > > > > > > > > > > > result in a deep call stack. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think the current implementation will be > > > > > > > > > > > > > > > > > more efficient, which doesn't pass nested > > > > > > > > > > > > > > > > > mappers to the runtime. One call to the outer > > > > > > > > > > > > > > > > > most mapper function will have all data > > > > > > > > > > > > > > > > > mapping done. The call stack will be 2 level > > > > > > > > > > > > > > > > > deep (the first level is the mapper function, > > > > > > > > > > > > > > > > > and the second level is > > > > > > > > > > > > > > > > > `__tgt_push_mapper_component`) in this case > > > > > > > > > > > > > > > > > from the runtime. There are also more > > > > > > > > > > > > > > > > > compiler optimization space when we inline > > > > > > > > > > > > > > > > > all nested mapper functions. > > > > > > > > > > > > > > > > Yes, if we leave it as is. But if instead of > > > > > > > > > > > > > > > > the bunch unique functions we'll have the > > > > > > > > > > > > > > > > common one, that accept list if indirect > > > > > > > > > > > > > > > > pointers to functions additionally, and move it > > > > > > > > > > > > > > > > to the runtime library, we won't need those 2 > > > > > > > > > > > > > > > > functions we have currently. We'll have full > > > > > > > > > > > > > > > > access to the mapping data vector in the > > > > > > > > > > > > > > > > runtime library and won't need to use those 2 > > > > > > > > > > > > > > > > accessors we have currently. Instead, we'll > > > > > > > > > > > > > > > > need just one runtime functions, which > > > > > > > > > > > > > > > > implements the whole mapping logic. We still > > > > > > > > > > > > > > > > need to call it recursively, but I assume the > > > > > > > > > > > > > > > > number of calls will remain the same as in the > > > > > > > > > > > > > > > > current scheme. Did you understand the idea? If > > > > > > > > > > > > > > > > yes, it would good if you coild try to estimate > > > > > > > > > > > > > > > > the number of function calls in current scheme > > > > > > > > > > > > > > > > and in this new scheme to estimate possible > > > > > > > > > > > > > > > > pros and cons. > > > > > > > > > > > > > > > Hi Alexey, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Could you give an example for this scheme? 1) I > > > > > > > > > > > > > > > don't understand how the mapper function can have > > > > > > > > > > > > > > > full access to the mapping data vector without > > > > > > > > > > > > > > > providing these 2 accessors. 2) I don't think it > > > > > > > > > > > > > > > is possible to have a common function instead of > > > > > > > > > > > > > > > bunch of unique functions for each mapper > > > > > > > > > > > > > > > declared. > > > > > > > > > > > > > > Hi Lingda, something like this. > > > > > > > > > > > > > > ``` > > > > > > > > > > > > > > void __tgt_mapper(void *base, void *begin, size_t > > > > > > > > > > > > > > size, int64_t type, auto components[]) { > > > > > > > > > > > > > > // Allocate space for an array section first. > > > > > > > > > > > > > > if (size > 1 && !maptype.IsDelete) > > > > > > > > > > > > > > <push>(base, begin, size*sizeof(Ty), > > > > > > > > > > > > > > clearToFrom(type)); > > > > > > > > > > > > > > > > > > > > > > > > > > > > // Map members. > > > > > > > > > > > > > > for (unsigned i = 0; i < size; i++) { > > > > > > > > > > > > > > // For each component specified by this mapper: > > > > > > > > > > > > > > for (auto c : components) { > > > > > > > > > > > > > > if (c.hasMapper()) > > > > > > > > > > > > > > (*c.Mapper())(c.arg_base, c.arg_begin, > > > > > > > > > > > > > > c.arg_size, c.arg_type); > > > > > > > > > > > > > > else > > > > > > > > > > > > > > <push>(c.arg_base, c.arg_begin, > > > > > > > > > > > > > > c.arg_size, c.arg_type); > > > > > > > > > > > > > > } > > > > > > > > > > > > > > } > > > > > > > > > > > > > > // Delete the array section. > > > > > > > > > > > > > > if (size > 1 && maptype.IsDelete) > > > > > > > > > > > > > > <push>(base, begin, size*sizeof(Ty), > > > > > > > > > > > > > > clearToFrom(type)); > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > void <type>.mapper(void *base, void *begin, size_t > > > > > > > > > > > > > > size, int64_t type) { > > > > > > > > > > > > > > auto sub_components[] = {...}; > > > > > > > > > > > > > > __tgt_mapper(base, begin, size, type, > > > > > > > > > > > > > > sub_components); > > > > > > > > > > > > > > } > > > > > > > > > > > > > > ``` > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Alexey, > > > > > > > > > > > > > > > > > > > > > > > > > > I don't think this scheme is more efficient than the > > > > > > > > > > > > > current scheme. My reasons are: > > > > > > > > > > > > > > > > > > > > > > > > > > 1) Most code here is essentially to generate > > > > > > > > > > > > > `components`, i.e., we need to generate `c.arg_base, > > > > > > > > > > > > > c.arg_begin, c.arg_size, c.arg_type` for each `c` in > > > > > > > > > > > > > `components`, so there will still be a lot of code in > > > > > > > > > > > > > `<type>.mapper`. It will not reduce the mapper > > > > > > > > > > > > > function code, i.e., we will still have a bunch of > > > > > > > > > > > > > unique mapper functions. > > > > > > > > > > > > > > > > > > > > > > > > > > 2) This scheme will prevent a lot of compiler > > > > > > > > > > > > > optimization from happening. In reality, a lot of > > > > > > > > > > > > > computation should be redundant. E.g., for two > > > > > > > > > > > > > components `c1` and `c2`, `c1`'s base may be the same > > > > > > > > > > > > > as `c2`'s begin, so the compiler will be able to > > > > > > > > > > > > > eliminate these reduction computation, especially > > > > > > > > > > > > > when we inline all nested mapper functions together. > > > > > > > > > > > > > If we move these computation into the runtime, the > > > > > > > > > > > > > compiler will not be able to do such optimization. > > > > > > > > > > > > > > > > > > > > > > > > > > 3) In terms of the number of `push` function calls, > > > > > > > > > > > > > this scheme has the exact same number of calls as the > > > > > > > > > > > > > current scheme, so I don't think this scheme can > > > > > > > > > > > > > bring performance benefits. The scheme should perform > > > > > > > > > > > > > worse than the current scheme, because it reduces the > > > > > > > > > > > > > opportunities of compiler optimization as mentioned > > > > > > > > > > > > > above. > > > > > > > > > > > > Hi Lingda, I'm trying to simplify the code generated by > > > > > > > > > > > > clang and avoid some unnecessary code duplications. If > > > > > > > > > > > > the complexity of this scheme is the same as proposed > > > > > > > > > > > > by you, I would prefer to use this scheme unless there > > > > > > > > > > > > are some other opinions. > > > > > > > > > > > > 1. It is not a problem. This code is unique and is not > > > > > > > > > > > > duplicated in the different mappers. > > > > > > > > > > > > 2. Inlining is no solution here. We still generate to > > > > > > > > > > > > much code, which is almost the same in many cases and > > > > > > > > > > > > it will lead to very ineffective codegen because we > > > > > > > > > > > > still end up with a lot of almost the same code. This > > > > > > > > > > > > also might lead to poor performance. > > > > > > > > > > > > 3. Yes, the number of pushes is always the same, in all > > > > > > > > > > > > possible schemes. It would be good to compare somehow > > > > > > > > > > > > the performance of both schemes, at least preliminary. > > > > > > > > > > > > > > > > > > > > > > > > Also, this solution reduces the number of required > > > > > > > > > > > > runtime functions, instead of 2 we need just 1 and, > > > > > > > > > > > > thus, we need to make fewer runtime functions calls. > > > > > > > > > > > > > > > > > > > > > > > > I think it would better to propose this scheme as an > > > > > > > > > > > > alternate design and discuss it in the OpenMP telecon. > > > > > > > > > > > > What do you think? Or we can try to discuss it in the > > > > > > > > > > > > offline mode via the e-mail with other members. > > > > > > > > > > > > I'm not trying to convince you to implement this scheme > > > > > > > > > > > > right now, but it would be good to discuss it. Maybe it > > > > > > > > > > > > will lead to some better ideas from others? > > > > > > > > > > > Hi Alexey, > > > > > > > > > > > > > > > > > > > > > > I still prefer the current scheme, because: > > > > > > > > > > > 1) I don't like recursive mapper calls, which goes back > > > > > > > > > > > to my original scheme a little bit. I really think > > > > > > > > > > > inlining can make a big difference when we have nested > > > > > > > > > > > mappers. These compiler optimizations are the keys to > > > > > > > > > > > have better performance for mappers. > > > > > > > > > > > 2) I don't think the codegen here is inefficient. Yes > > > > > > > > > > > there is duplicated code across different mapper > > > > > > > > > > > functions, but why that will lead to poor performance? > > > > > > > > > > > 3) Although we have 2 runtime functions now, the > > > > > > > > > > > `__tgt_mapper_num_components` is called only once per > > > > > > > > > > > mapper. It should have very negligible performance impact. > > > > > > > > > > > > > > > > > > > > > > But if you have a different option, we can discuss it > > > > > > > > > > > next time in the meeting. I do have a time constraint to > > > > > > > > > > > work on the mapper implementation. I'll no longer work in > > > > > > > > > > > this project starting this September, and I have about > > > > > > > > > > > 30% of my time working on it until then. > > > > > > > > > > Lingda, > > > > > > > > > > 1. We have recursive (actually, not recursive, because you > > > > > > > > > > cannot use types recursively) mappers calls anyway, it is > > > > > > > > > > nature of struсtures/classes. > > > > > > > > > > 2. We have a lot of similar code. And I'm not sure that it > > > > > > > > > > can be optimized out. > > > > > > > > > > 3. Yes, but it means that we have n extra runtime calls, > > > > > > > > > > where n is the number of branches in the structure/class > > > > > > > > > > tree. > > > > > > > > > > > > > > > > > > > > I see :(. I understand your concern. In this case, we could > > > > > > > > > > try to discuss it offline, in the mailing list, to make it > > > > > > > > > > a little bit faster. We just need to hear other opinions on > > > > > > > > > > this matter, maybe there are some other pros and cons for > > > > > > > > > > these schemes. > > > > > > > > > Hi Alexey, > > > > > > > > > > > > > > > > > > Sure, let's discuss this in the mailing list. I'll summarize > > > > > > > > > it and send it to the mailing list later. > > > > > > > > > > > > > > > > > > > We have recursive (actually, not recursive, because you > > > > > > > > > > cannot use types recursively) mappers calls anyway, it is > > > > > > > > > > nature of struсtures/classes. > > > > > > > > > We won't have recursive calls with inlining. > > > > > > > > > > > > > > > > > > > We have a lot of similar code. And I'm not sure that it can > > > > > > > > > > be optimized out. > > > > > > > > > I think it's even harder to optimized these code out when we > > > > > > > > > move them into the runtime. > > > > > > > > > > > > > > > > > > > Yes, but it means that we have n extra runtime calls, where > > > > > > > > > > n is the number of branches in the structure/class tree. > > > > > > > > > I don't quite understand. It's still equal to the number of > > > > > > > > > mappers in any case. > > > > > > > > > Sure, let's discuss this in the mailing list. I'll summarize > > > > > > > > > it and send it to the mailing list later. > > > > > > > > > > > > > > > > Good, thanks! > > > > > > > > > > > > > > > > > We won't have recursive calls with inlining. > > > > > > > > > > > > > > > > We won't have recursive calls anyway (recursive types are not > > > > > > > > allowed). Plus, I'm not sure that inlining is the best option > > > > > > > > here. We have a lot of code for each mapper and I'm not sure > > > > > > > > that the optimizer will be able to squash it effectively. > > > > > > > > > > > > > > > > > I think it's even harder to optimized these code out when we > > > > > > > > > move them into the runtime. > > > > > > > > > > > > > > > > Definitely not, unless we use LTO or inlined runtime. > > > > > > > > > > > > > > > > We won't have recursive calls anyway (recursive types are not > > > > > > > > allowed). Plus, I'm not sure that inlining is the best option > > > > > > > > here. We have a lot of code for each mapper and I'm not sure > > > > > > > > that the optimizer will be able to squash it effectively. > > > > > > > Sorry I should not say recursive calls. Here it needs to > > > > > > > "recursively" call other mapper functions in case of nested > > > > > > > mappers, but we don't need it in case of inlining. > > > > > > > > > > > > > > > Definitely not, unless we use LTO or inlined runtime. > > > > > > > But you are proposing to move many code to the runtime here, > > > > > > > right? That doesn't make sense to me. > > > > > > > > > > > > > > But you are proposing to move much code to the runtime here, > > > > > > > right? That doesn't make sense to me. > > > > > > > > > > > > I'm just not sure that there going be significant problems with the > > > > > > performance because of that. And it significantly simplifies > > > > > > codegen in the compiler and moves the common part into a single > > > > > > function. > > > > > > > > > > > > Plus, if in future we'll need to modify this functionality for some > > > > > > reason, 2 different versions of the compiler will produce > > > > > > incompatible code. With my scheme, you still can use old runtime > > > > > > and have the same functionality as the old compiler and the new one. > > > > > Hi Alexey, > > > > > > > > > > I think more carefully about your scheme, and I don't think we can > > > > > solve the 2 problems below with this scheme: > > > > > > > > > > 1. In the example you gave before, the compiler needs to generate all > > > > > map types and pass them to `__tgt_mapper` through `sub_components`. > > > > > But in this case, the compiler won't be able to generate the correct > > > > > `MEMBER_OF` field in map type. As a result, the runtime has to fix it > > > > > using the mechanism we already have here: > > > > > `__tgt_mapper_num_components`. This not only increases complexity, > > > > > but also, it means the runtime needs further manipulation of the map > > > > > type, which creates locality issues. While in the current scheme, the > > > > > map type is generated by compiler once, so the data locality will be > > > > > very good in this case. > > > > > > > > > > 2. `sub_components` includes all components that should be mapped. If > > > > > we are mapping an array, this means we need to map many components, > > > > > which will need to allocate memory for `sub_components` in the heap. > > > > > This creates further memory management burden and is not an efficient > > > > > way to use memory. > > > > > > > > > > Based on these reasons, I think the current scheme is still more > > > > > preferable. > > > > Hi Lingda, > > > > 1. Actually, I thought that the runtime function `__tgt_mapper` will do > > > > this, not the compiler. > > > > 2. Why do we need to allocate it on the heap? We can allocate it on the > > > > stack. > > > 1. In your scheme, both compiler and `__tgt_mapper` need to do this: the > > > compiler will generate other parts in the type, e.g., `TO` `FROM` bits > > > and basic `MEMBER_OF` bits. Then `__tgt_mapper` needs to modify the > > > `MEMBER_OF` bits later. Since there are a lot of other memory accesses > > > between the compiler and `__tgt_mapper` operations to the same map type, > > > it's very likely the map type will not stay in the cache, which causes > > > locality problem. > > > > > > 2. Assume we are mapping an array with 1000000 elements, and each > > > elements have 5 components. For each component, we need `base, begin_ptr, > > > size, type, mapper`, which are 40 bytes. Together, we will need 1000000 * > > > 5 * 40 = 200MB of space for this array, which stack cannnot handle. > > 1. I don't think it is a big problem, this part of the code is executed on > > the CPU and I don't think it will lead to significant overhead. > > 2. When we map an array, we do not map it element-by-element, so we don't > > need 10000 records. Moreover, we try to merge contiguous parts into single > > one, reducing the total number of elements. > 1. I think it is a problem. Beside, doing so is not elegant: why having a > single thing (setting the map type) done in 2 places while we can do it in > one place? > > 2. We need to map them element by element because it is not always possible > to merge contiguous parts together (there may be no contiguous parts based on > the mapper). And merging parts together will be a complex procedure: I don't > think it can be done in the runtime because many code is moved into the > runtime now. In contrast, the compiler will have better opportunities to > merge things. > > Besides, I don't think there is a valid reason that the current scheme is not > good. You mentioned it's complex codegen. But it only has less 200 loc here, > I don't see why it is complex. 2. I rather doubt that we will need to map a record with 100000 fields element-by-element. I think it would be better to share it with others and listen to their opinions. It is better to spend some extra time to provide good design. You can include your doubts in the description of the new scheme, of course. CHANGES SINCE LAST ACTION https://reviews.llvm.org/D59474/new/ https://reviews.llvm.org/D59474 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits