Hi! On Wed, 10 Feb 2016 16:27:40 +0100, Bernd Schmidt <bschm...@redhat.com> wrote: > On 02/10/2016 03:39 PM, Thomas Schwinge wrote: > > > Yes, we need a hammer that big: we have to ensure consistency between > > data regions on the device and code offloading to the device, as > > otherwise we'll very easily run into inconsistencies, because of the > > non-shared memory. In the general case, it's "all or nothing": you > > either have to offload all kernels or none of them. > > That's unfortunately not the impression I got from the earlier > discussion
:-( > and this seems to imply that one unprofitable kernel would > disable all the others Correct. > - IMO this is not acceptable. Why? A user of GCC has no intrinsic interest in getting OpenACC kernels constructs' code offloaded; the user wants his code to execute as fast as possible. If you consider the whole of OpenACC kernels code offloading as a compiler optimization, then it's fine for GCC to abort this "optimization" if it's reasonably clear that this transformation (code offloading) will not be profitable -- just like what GCC does with other possible code optimizations/transformations. As I've said before, profiling the execution times of several real-world codes has shown that under the assumtion that parloops fails to parallelize one kernel (one out of possibly many), this one kernel has always been a "hot spot", and avoiding offloading in this case has always helped prevent performance degradation below host-fallback performance. It's of course unfortunate that we have to disable our offloading machinery for a lot of codes using OpenACC kernels, but given the current state of OpenACC kernels parallelization analysis (parloops), doing so is still profitable for a user, compared to regressed performance with single-threaded offloaded execution. Of course... > There need to be > more compiler smarts to figure out whether a kernel is a valid candidate > for skipping the offloading. ... that would be better, obviously. But, I suggest we work on that incrementally, after fixing the performance regression with my "avoid offloading" patch. I have difficulties coming up with an algorithm/parametrization to have the compiler/runtime decide whether offloading will be profitable given input parameters such as a ratio of parallelized/single-threaded kernels. So I'm all ears to suggestions in that regard. Consider: if we encounter a single-threaded kernel, the compiler (parloops) has just given up "understanding" the user's code. And again, implementing such heuristics to me sounds like incremental follow-up projects, quite possibly in combination with generally improving OpenACC kernels handling/parloops. Grüße Thomas