ABataev added inline comments.
================ Comment at: lib/CodeGen/CGOpenMPRuntime.cpp:8739 +/// // For each component specified by this mapper: +/// if (currentComponent.hasMapper()) +/// (*currentComponent.Mapper())(rt_mapper_handle, arg_base, arg_begin, ---------------- lildmh wrote: > ABataev wrote: > > lildmh wrote: > > > ABataev wrote: > > > > lildmh wrote: > > > > > ABataev wrote: > > > > > > lildmh wrote: > > > > > > > ABataev wrote: > > > > > > > > lildmh wrote: > > > > > > > > > ABataev wrote: > > > > > > > > > > lildmh wrote: > > > > > > > > > > > ABataev wrote: > > > > > > > > > > > > lildmh wrote: > > > > > > > > > > > > > ABataev wrote: > > > > > > > > > > > > > > lildmh wrote: > > > > > > > > > > > > > > > ABataev wrote: > > > > > > > > > > > > > > > > lildmh wrote: > > > > > > > > > > > > > > > > > ABataev wrote: > > > > > > > > > > > > > > > > > > lildmh wrote: > > > > > > > > > > > > > > > > > > > ABataev wrote: > > > > > > > > > > > > > > > > > > > > lildmh wrote: > > > > > > > > > > > > > > > > > > > > > ABataev wrote: > > > > > > > > > > > > > > > > > > > > > > lildmh wrote: > > > > > > > > > > > > > > > > > > > > > > > ABataev wrote: > > > > > > > > > > > > > > > > > > > > > > > > lildmh wrote: > > > > > > > > > > > > > > > > > > > > > > > > > ABataev wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > Currently > > > > > > > > > > > > > > > > > > > > > > > > > > `currentComponent` is > > > > > > > > > > > > > > > > > > > > > > > > > > generated by the compiler. > > > > > > > > > > > > > > > > > > > > > > > > > > But can we instead pass > > > > > > > > > > > > > > > > > > > > > > > > > > this data as an extra > > > > > > > > > > > > > > > > > > > > > > > > > > parameter to this > > > > > > > > > > > > > > > > > > > > > > > > > > `omp_mapper` function. > > > > > > > > > > > > > > > > > > > > > > > > > Emm, I think this scheme will > > > > > > > > > > > > > > > > > > > > > > > > > be very difficult and > > > > > > > > > > > > > > > > > > > > > > > > > inefficient. If we pass > > > > > > > > > > > > > > > > > > > > > > > > > components as an argument of > > > > > > > > > > > > > > > > > > > > > > > > > `omp_mapper` function, it > > > > > > > > > > > > > > > > > > > > > > > > > means that the runtime needs > > > > > > > > > > > > > > > > > > > > > > > > > to generate all components > > > > > > > > > > > > > > > > > > > > > > > > > related to a map clause. I > > > > > > > > > > > > > > > > > > > > > > > > > don't think the runtime is > > > > > > > > > > > > > > > > > > > > > > > > > able to do that efficiently. > > > > > > > > > > > > > > > > > > > > > > > > > On the other hand, in the > > > > > > > > > > > > > > > > > > > > > > > > > current scheme, these > > > > > > > > > > > > > > > > > > > > > > > > > components are naturally > > > > > > > > > > > > > > > > > > > > > > > > > generated by the compiler, > > > > > > > > > > > > > > > > > > > > > > > > > and the runtime only needs to > > > > > > > > > > > > > > > > > > > > > > > > > know the base pointer, > > > > > > > > > > > > > > > > > > > > > > > > > pointer, type, size. etc. > > > > > > > > > > > > > > > > > > > > > > > > With the current scheme, we may > > > > > > > > > > > > > > > > > > > > > > > > end with the code blowout. We > > > > > > > > > > > > > > > > > > > > > > > > need to generate very similar > > > > > > > > > > > > > > > > > > > > > > > > code for different types and > > > > > > > > > > > > > > > > > > > > > > > > variables. The worst thing here > > > > > > > > > > > > > > > > > > > > > > > > is that we will be unable to > > > > > > > > > > > > > > > > > > > > > > > > optimize this huge amount of > > > > > > > > > > > > > > > > > > > > > > > > code because the codegen relies > > > > > > > > > > > > > > > > > > > > > > > > on the runtime functions and > > > > > > > > > > > > > > > > > > > > > > > > the code cannot be inlined. > > > > > > > > > > > > > > > > > > > > > > > > That's why I would like to move > > > > > > > > > > > > > > > > > > > > > > > > as much as possible code to the > > > > > > > > > > > > > > > > > > > > > > > > runtime rather than to emit it > > > > > > > > > > > > > > > > > > > > > > > > in the compiler. > > > > > > > > > > > > > > > > > > > > > > > I understand your concerns. I > > > > > > > > > > > > > > > > > > > > > > > think this is the best we can do > > > > > > > > > > > > > > > > > > > > > > > right now. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The most worrisome case will be > > > > > > > > > > > > > > > > > > > > > > > when we have nested mappers > > > > > > > > > > > > > > > > > > > > > > > within each other. In this case, > > > > > > > > > > > > > > > > > > > > > > > a mapper function will call > > > > > > > > > > > > > > > > > > > > > > > another mapper function. We can > > > > > > > > > > > > > > > > > > > > > > > inline the inner mapper functions > > > > > > > > > > > > > > > > > > > > > > > in this scenario, so that these > > > > > > > > > > > > > > > > > > > > > > > mapper function can be properly > > > > > > > > > > > > > > > > > > > > > > > optimized. As a result, I think > > > > > > > > > > > > > > > > > > > > > > > the performance should be fine. > > > > > > > > > > > > > > > > > > > > > > Instead, we can use indirect > > > > > > > > > > > > > > > > > > > > > > function calls passed in the array > > > > > > > > > > > > > > > > > > > > > > to the runtime. Do you think it is > > > > > > > > > > > > > > > > > > > > > > going to be slower? In your current > > > > > > > > > > > > > > > > > > > > > > scheme, we generate many runtime > > > > > > > > > > > > > > > > > > > > > > calls instead. Could you try to > > > > > > > > > > > > > > > > > > > > > > estimate the number of calls in > > > > > > > > > > > > > > > > > > > > > > cases if we'll call the mappers > > > > > > > > > > > > > > > > > > > > > > through the indirect function calls > > > > > > > > > > > > > > > > > > > > > > and in your cuurent scheme, where > > > > > > > > > > > > > > > > > > > > > > we need to call the runtime > > > > > > > > > > > > > > > > > > > > > > functions many times in each > > > > > > > > > > > > > > > > > > > > > > particular mapper? > > > > > > > > > > > > > > > > > > > > > Hi Alexey, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sorry I don't understand your idea. > > > > > > > > > > > > > > > > > > > > > What indirect function calls do you > > > > > > > > > > > > > > > > > > > > > propose to be passed to the runtime? > > > > > > > > > > > > > > > > > > > > > What are these functions supposed to > > > > > > > > > > > > > > > > > > > > > do? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The number of function calls will be > > > > > > > > > > > > > > > > > > > > > exactly equal to the number of > > > > > > > > > > > > > > > > > > > > > components mapped, no matter whether > > > > > > > > > > > > > > > > > > > > > there are nested mappers or not. The > > > > > > > > > > > > > > > > > > > > > number of components depend on the > > > > > > > > > > > > > > > > > > > > > program. E.g., if we map a large > > > > > > > > > > > > > > > > > > > > > array section, then there will be > > > > > > > > > > > > > > > > > > > > > many more function calls. > > > > > > > > > > > > > > > > > > > > I mean the pointers to the mapper > > > > > > > > > > > > > > > > > > > > function, generated by the compiler. In > > > > > > > > > > > > > > > > > > > > your comment, it is `c.Mapper()` > > > > > > > > > > > > > > > > > > > If we pass nested mapper functions to the > > > > > > > > > > > > > > > > > > > runtime, I think it will slow down > > > > > > > > > > > > > > > > > > > execution because of the extra level of > > > > > > > > > > > > > > > > > > > indirect function calls. E.g., the > > > > > > > > > > > > > > > > > > > runtime will call `omp_mapper1`, which > > > > > > > > > > > > > > > > > > > calls the runtime back, which calls > > > > > > > > > > > > > > > > > > > `omp_mapper2`, .... This can result in a > > > > > > > > > > > > > > > > > > > deep call stack. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think the current implementation will > > > > > > > > > > > > > > > > > > > be more efficient, which doesn't pass > > > > > > > > > > > > > > > > > > > nested mappers to the runtime. One call > > > > > > > > > > > > > > > > > > > to the outer most mapper function will > > > > > > > > > > > > > > > > > > > have all data mapping done. The call > > > > > > > > > > > > > > > > > > > stack will be 2 level deep (the first > > > > > > > > > > > > > > > > > > > level is the mapper function, and the > > > > > > > > > > > > > > > > > > > second level is > > > > > > > > > > > > > > > > > > > `__tgt_push_mapper_component`) in this > > > > > > > > > > > > > > > > > > > case from the runtime. There are also > > > > > > > > > > > > > > > > > > > more compiler optimization space when we > > > > > > > > > > > > > > > > > > > inline all nested mapper functions. > > > > > > > > > > > > > > > > > > Yes, if we leave it as is. But if instead > > > > > > > > > > > > > > > > > > of the bunch unique functions we'll have > > > > > > > > > > > > > > > > > > the common one, that accept list if > > > > > > > > > > > > > > > > > > indirect pointers to functions > > > > > > > > > > > > > > > > > > additionally, and move it to the runtime > > > > > > > > > > > > > > > > > > library, we won't need those 2 functions we > > > > > > > > > > > > > > > > > > have currently. We'll have full access to > > > > > > > > > > > > > > > > > > the mapping data vector in the runtime > > > > > > > > > > > > > > > > > > library and won't need to use those 2 > > > > > > > > > > > > > > > > > > accessors we have currently. Instead, we'll > > > > > > > > > > > > > > > > > > need just one runtime functions, which > > > > > > > > > > > > > > > > > > implements the whole mapping logic. We > > > > > > > > > > > > > > > > > > still need to call it recursively, but I > > > > > > > > > > > > > > > > > > assume the number of calls will remain the > > > > > > > > > > > > > > > > > > same as in the current scheme. Did you > > > > > > > > > > > > > > > > > > understand the idea? If yes, it would good > > > > > > > > > > > > > > > > > > if you coild try to estimate the number of > > > > > > > > > > > > > > > > > > function calls in current scheme and in > > > > > > > > > > > > > > > > > > this new scheme to estimate possible pros > > > > > > > > > > > > > > > > > > and cons. > > > > > > > > > > > > > > > > > Hi Alexey, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Could you give an example for this scheme? 1) > > > > > > > > > > > > > > > > > I don't understand how the mapper function > > > > > > > > > > > > > > > > > can have full access to the mapping data > > > > > > > > > > > > > > > > > vector without providing these 2 accessors. > > > > > > > > > > > > > > > > > 2) I don't think it is possible to have a > > > > > > > > > > > > > > > > > common function instead of bunch of unique > > > > > > > > > > > > > > > > > functions for each mapper declared. > > > > > > > > > > > > > > > > Hi Lingda, something like this. > > > > > > > > > > > > > > > > ``` > > > > > > > > > > > > > > > > void __tgt_mapper(void *base, void *begin, > > > > > > > > > > > > > > > > size_t size, int64_t type, auto components[]) { > > > > > > > > > > > > > > > > // Allocate space for an array section first. > > > > > > > > > > > > > > > > if (size > 1 && !maptype.IsDelete) > > > > > > > > > > > > > > > > <push>(base, begin, size*sizeof(Ty), > > > > > > > > > > > > > > > > clearToFrom(type)); > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > // Map members. > > > > > > > > > > > > > > > > for (unsigned i = 0; i < size; i++) { > > > > > > > > > > > > > > > > // For each component specified by this > > > > > > > > > > > > > > > > mapper: > > > > > > > > > > > > > > > > for (auto c : components) { > > > > > > > > > > > > > > > > if (c.hasMapper()) > > > > > > > > > > > > > > > > (*c.Mapper())(c.arg_base, c.arg_begin, > > > > > > > > > > > > > > > > c.arg_size, c.arg_type); > > > > > > > > > > > > > > > > else > > > > > > > > > > > > > > > > <push>(c.arg_base, c.arg_begin, > > > > > > > > > > > > > > > > c.arg_size, c.arg_type); > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > // Delete the array section. > > > > > > > > > > > > > > > > if (size > 1 && maptype.IsDelete) > > > > > > > > > > > > > > > > <push>(base, begin, size*sizeof(Ty), > > > > > > > > > > > > > > > > clearToFrom(type)); > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > void <type>.mapper(void *base, void *begin, > > > > > > > > > > > > > > > > size_t size, int64_t type) { > > > > > > > > > > > > > > > > auto sub_components[] = {...}; > > > > > > > > > > > > > > > > __tgt_mapper(base, begin, size, type, > > > > > > > > > > > > > > > > sub_components); > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > ``` > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Alexey, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I don't think this scheme is more efficient than > > > > > > > > > > > > > > > the current scheme. My reasons are: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1) Most code here is essentially to generate > > > > > > > > > > > > > > > `components`, i.e., we need to generate > > > > > > > > > > > > > > > `c.arg_base, c.arg_begin, c.arg_size, c.arg_type` > > > > > > > > > > > > > > > for each `c` in `components`, so there will still > > > > > > > > > > > > > > > be a lot of code in `<type>.mapper`. It will not > > > > > > > > > > > > > > > reduce the mapper function code, i.e., we will > > > > > > > > > > > > > > > still have a bunch of unique mapper functions. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2) This scheme will prevent a lot of compiler > > > > > > > > > > > > > > > optimization from happening. In reality, a lot of > > > > > > > > > > > > > > > computation should be redundant. E.g., for two > > > > > > > > > > > > > > > components `c1` and `c2`, `c1`'s base may be the > > > > > > > > > > > > > > > same as `c2`'s begin, so the compiler will be > > > > > > > > > > > > > > > able to eliminate these reduction computation, > > > > > > > > > > > > > > > especially when we inline all nested mapper > > > > > > > > > > > > > > > functions together. If we move these computation > > > > > > > > > > > > > > > into the runtime, the compiler will not be able > > > > > > > > > > > > > > > to do such optimization. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3) In terms of the number of `push` function > > > > > > > > > > > > > > > calls, this scheme has the exact same number of > > > > > > > > > > > > > > > calls as the current scheme, so I don't think > > > > > > > > > > > > > > > this scheme can bring performance benefits. The > > > > > > > > > > > > > > > scheme should perform worse than the current > > > > > > > > > > > > > > > scheme, because it reduces the opportunities of > > > > > > > > > > > > > > > compiler optimization as mentioned above. > > > > > > > > > > > > > > Hi Lingda, I'm trying to simplify the code > > > > > > > > > > > > > > generated by clang and avoid some unnecessary code > > > > > > > > > > > > > > duplications. If the complexity of this scheme is > > > > > > > > > > > > > > the same as proposed by you, I would prefer to use > > > > > > > > > > > > > > this scheme unless there are some other opinions. > > > > > > > > > > > > > > 1. It is not a problem. This code is unique and is > > > > > > > > > > > > > > not duplicated in the different mappers. > > > > > > > > > > > > > > 2. Inlining is no solution here. We still generate > > > > > > > > > > > > > > to much code, which is almost the same in many > > > > > > > > > > > > > > cases and it will lead to very ineffective codegen > > > > > > > > > > > > > > because we still end up with a lot of almost the > > > > > > > > > > > > > > same code. This also might lead to poor performance. > > > > > > > > > > > > > > 3. Yes, the number of pushes is always the same, in > > > > > > > > > > > > > > all possible schemes. It would be good to compare > > > > > > > > > > > > > > somehow the performance of both schemes, at least > > > > > > > > > > > > > > preliminary. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Also, this solution reduces the number of required > > > > > > > > > > > > > > runtime functions, instead of 2 we need just 1 and, > > > > > > > > > > > > > > thus, we need to make fewer runtime functions calls. > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think it would better to propose this scheme as > > > > > > > > > > > > > > an alternate design and discuss it in the OpenMP > > > > > > > > > > > > > > telecon. What do you think? Or we can try to > > > > > > > > > > > > > > discuss it in the offline mode via the e-mail with > > > > > > > > > > > > > > other members. > > > > > > > > > > > > > > I'm not trying to convince you to implement this > > > > > > > > > > > > > > scheme right now, but it would be good to discuss > > > > > > > > > > > > > > it. Maybe it will lead to some better ideas from > > > > > > > > > > > > > > others? > > > > > > > > > > > > > Hi Alexey, > > > > > > > > > > > > > > > > > > > > > > > > > > I still prefer the current scheme, because: > > > > > > > > > > > > > 1) I don't like recursive mapper calls, which goes > > > > > > > > > > > > > back to my original scheme a little bit. I really > > > > > > > > > > > > > think inlining can make a big difference when we have > > > > > > > > > > > > > nested mappers. These compiler optimizations are the > > > > > > > > > > > > > keys to have better performance for mappers. > > > > > > > > > > > > > 2) I don't think the codegen here is inefficient. Yes > > > > > > > > > > > > > there is duplicated code across different mapper > > > > > > > > > > > > > functions, but why that will lead to poor performance? > > > > > > > > > > > > > 3) Although we have 2 runtime functions now, the > > > > > > > > > > > > > `__tgt_mapper_num_components` is called only once per > > > > > > > > > > > > > mapper. It should have very negligible performance > > > > > > > > > > > > > impact. > > > > > > > > > > > > > > > > > > > > > > > > > > But if you have a different option, we can discuss it > > > > > > > > > > > > > next time in the meeting. I do have a time constraint > > > > > > > > > > > > > to work on the mapper implementation. I'll no longer > > > > > > > > > > > > > work in this project starting this September, and I > > > > > > > > > > > > > have about 30% of my time working on it until then. > > > > > > > > > > > > Lingda, > > > > > > > > > > > > 1. We have recursive (actually, not recursive, because > > > > > > > > > > > > you cannot use types recursively) mappers calls anyway, > > > > > > > > > > > > it is nature of struсtures/classes. > > > > > > > > > > > > 2. We have a lot of similar code. And I'm not sure that > > > > > > > > > > > > it can be optimized out. > > > > > > > > > > > > 3. Yes, but it means that we have n extra runtime > > > > > > > > > > > > calls, where n is the number of branches in the > > > > > > > > > > > > structure/class tree. > > > > > > > > > > > > > > > > > > > > > > > > I see :(. I understand your concern. In this case, we > > > > > > > > > > > > could try to discuss it offline, in the mailing list, > > > > > > > > > > > > to make it a little bit faster. We just need to hear > > > > > > > > > > > > other opinions on this matter, maybe there are some > > > > > > > > > > > > other pros and cons for these schemes. > > > > > > > > > > > Hi Alexey, > > > > > > > > > > > > > > > > > > > > > > Sure, let's discuss this in the mailing list. I'll > > > > > > > > > > > summarize it and send it to the mailing list later. > > > > > > > > > > > > > > > > > > > > > > > We have recursive (actually, not recursive, because you > > > > > > > > > > > > cannot use types recursively) mappers calls anyway, it > > > > > > > > > > > > is nature of struсtures/classes. > > > > > > > > > > > We won't have recursive calls with inlining. > > > > > > > > > > > > > > > > > > > > > > > We have a lot of similar code. And I'm not sure that it > > > > > > > > > > > > can be optimized out. > > > > > > > > > > > I think it's even harder to optimized these code out when > > > > > > > > > > > we move them into the runtime. > > > > > > > > > > > > > > > > > > > > > > > Yes, but it means that we have n extra runtime calls, > > > > > > > > > > > > where n is the number of branches in the > > > > > > > > > > > > structure/class tree. > > > > > > > > > > > I don't quite understand. It's still equal to the number > > > > > > > > > > > of mappers in any case. > > > > > > > > > > > Sure, let's discuss this in the mailing list. I'll > > > > > > > > > > > summarize it and send it to the mailing list later. > > > > > > > > > > > > > > > > > > > > Good, thanks! > > > > > > > > > > > > > > > > > > > > > We won't have recursive calls with inlining. > > > > > > > > > > > > > > > > > > > > We won't have recursive calls anyway (recursive types are > > > > > > > > > > not allowed). Plus, I'm not sure that inlining is the best > > > > > > > > > > option here. We have a lot of code for each mapper and I'm > > > > > > > > > > not sure that the optimizer will be able to squash it > > > > > > > > > > effectively. > > > > > > > > > > > > > > > > > > > > > I think it's even harder to optimized these code out when > > > > > > > > > > > we move them into the runtime. > > > > > > > > > > > > > > > > > > > > Definitely not, unless we use LTO or inlined runtime. > > > > > > > > > > > > > > > > > > > > We won't have recursive calls anyway (recursive types are > > > > > > > > > > not allowed). Plus, I'm not sure that inlining is the best > > > > > > > > > > option here. We have a lot of code for each mapper and I'm > > > > > > > > > > not sure that the optimizer will be able to squash it > > > > > > > > > > effectively. > > > > > > > > > Sorry I should not say recursive calls. Here it needs to > > > > > > > > > "recursively" call other mapper functions in case of nested > > > > > > > > > mappers, but we don't need it in case of inlining. > > > > > > > > > > > > > > > > > > > Definitely not, unless we use LTO or inlined runtime. > > > > > > > > > But you are proposing to move many code to the runtime here, > > > > > > > > > right? That doesn't make sense to me. > > > > > > > > > > > > > > > > > > But you are proposing to move much code to the runtime here, > > > > > > > > > right? That doesn't make sense to me. > > > > > > > > > > > > > > > > I'm just not sure that there going be significant problems with > > > > > > > > the performance because of that. And it significantly > > > > > > > > simplifies codegen in the compiler and moves the common part > > > > > > > > into a single function. > > > > > > > > > > > > > > > > Plus, if in future we'll need to modify this functionality for > > > > > > > > some reason, 2 different versions of the compiler will produce > > > > > > > > incompatible code. With my scheme, you still can use old > > > > > > > > runtime and have the same functionality as the old compiler and > > > > > > > > the new one. > > > > > > > Hi Alexey, > > > > > > > > > > > > > > I think more carefully about your scheme, and I don't think we > > > > > > > can solve the 2 problems below with this scheme: > > > > > > > > > > > > > > 1. In the example you gave before, the compiler needs to generate > > > > > > > all map types and pass them to `__tgt_mapper` through > > > > > > > `sub_components`. But in this case, the compiler won't be able to > > > > > > > generate the correct `MEMBER_OF` field in map type. As a result, > > > > > > > the runtime has to fix it using the mechanism we already have > > > > > > > here: `__tgt_mapper_num_components`. This not only increases > > > > > > > complexity, but also, it means the runtime needs further > > > > > > > manipulation of the map type, which creates locality issues. > > > > > > > While in the current scheme, the map type is generated by > > > > > > > compiler once, so the data locality will be very good in this > > > > > > > case. > > > > > > > > > > > > > > 2. `sub_components` includes all components that should be > > > > > > > mapped. If we are mapping an array, this means we need to map > > > > > > > many components, which will need to allocate memory for > > > > > > > `sub_components` in the heap. This creates further memory > > > > > > > management burden and is not an efficient way to use memory. > > > > > > > > > > > > > > Based on these reasons, I think the current scheme is still more > > > > > > > preferable. > > > > > > Hi Lingda, > > > > > > 1. Actually, I thought that the runtime function `__tgt_mapper` > > > > > > will do this, not the compiler. > > > > > > 2. Why do we need to allocate it on the heap? We can allocate it on > > > > > > the stack. > > > > > 1. In your scheme, both compiler and `__tgt_mapper` need to do this: > > > > > the compiler will generate other parts in the type, e.g., `TO` `FROM` > > > > > bits and basic `MEMBER_OF` bits. Then `__tgt_mapper` needs to modify > > > > > the `MEMBER_OF` bits later. Since there are a lot of other memory > > > > > accesses between the compiler and `__tgt_mapper` operations to the > > > > > same map type, it's very likely the map type will not stay in the > > > > > cache, which causes locality problem. > > > > > > > > > > 2. Assume we are mapping an array with 1000000 elements, and each > > > > > elements have 5 components. For each component, we need `base, > > > > > begin_ptr, size, type, mapper`, which are 40 bytes. Together, we will > > > > > need 1000000 * 5 * 40 = 200MB of space for this array, which stack > > > > > cannnot handle. > > > > 1. I don't think it is a big problem, this part of the code is executed > > > > on the CPU and I don't think it will lead to significant overhead. > > > > 2. When we map an array, we do not map it element-by-element, so we > > > > don't need 10000 records. Moreover, we try to merge contiguous parts > > > > into single one, reducing the total number of elements. > > > 1. I think it is a problem. Beside, doing so is not elegant: why having a > > > single thing (setting the map type) done in 2 places while we can do it > > > in one place? > > > > > > 2. We need to map them element by element because it is not always > > > possible to merge contiguous parts together (there may be no contiguous > > > parts based on the mapper). And merging parts together will be a complex > > > procedure: I don't think it can be done in the runtime because many code > > > is moved into the runtime now. In contrast, the compiler will have better > > > opportunities to merge things. > > > > > > Besides, I don't think there is a valid reason that the current scheme is > > > not good. You mentioned it's complex codegen. But it only has less 200 > > > loc here, I don't see why it is complex. > > 2. I rather doubt that we will need to map a record with 100000 fields > > element-by-element. > > > > I think it would be better to share it with others and listen to their > > opinions. It is better to spend some extra time to provide good design. > > You can include your doubts in the description of the new scheme, of course. > 2. The implementation should work for any case anyway. Besides, I think > mapping a large array should be actually a common case. The users don't want > to map 1000000 elements by themselves, so they will want to use mapper to let > the system do it automatically. > > Sure, I can release the discussion to the mailing list. I don't see a reason > to use the new scheme now. Lingda, I meant to send the message to the OpenMP telecon list :) Could you forward the same e-mail to Ravi and others, please? CHANGES SINCE LAST ACTION https://reviews.llvm.org/D59474/new/ https://reviews.llvm.org/D59474 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits