Hi!

On Wed, 10 Feb 2016 16:27:40 +0100, Bernd Schmidt <bschm...@redhat.com> wrote:
> On 02/10/2016 03:39 PM, Thomas Schwinge wrote:
> 
> > Yes, we need a hammer that big: we have to ensure consistency between
> > data regions on the device and code offloading to the device, as
> > otherwise we'll very easily run into inconsistencies, because of the
> > non-shared memory.  In the general case, it's "all or nothing": you
> > either have to offload all kernels or none of them.
> 
> That's unfortunately not the impression I got from the earlier 
> discussion

:-(

> and this seems to imply that one unprofitable kernel would 
> disable all the others

Correct.

> - IMO this is not acceptable.

Why?  A user of GCC has no intrinsic interest in getting OpenACC kernels
constructs' code offloaded; the user wants his code to execute as fast as
possible.

If you consider the whole of OpenACC kernels code offloading as a
compiler optimization, then it's fine for GCC to abort this
"optimization" if it's reasonably clear that this transformation (code
offloading) will not be profitable -- just like what GCC does with other
possible code optimizations/transformations.  As I've said before,
profiling the execution times of several real-world codes has shown that
under the assumtion that parloops fails to parallelize one kernel (one
out of possibly many), this one kernel has always been a "hot spot", and
avoiding offloading in this case has always helped prevent performance
degradation below host-fallback performance.

It's of course unfortunate that we have to disable our offloading
machinery for a lot of codes using OpenACC kernels, but given the current
state of OpenACC kernels parallelization analysis (parloops), doing so is
still profitable for a user, compared to regressed performance with
single-threaded offloaded execution.

Of course...

> There need to be 
> more compiler smarts to figure out whether a kernel is a valid candidate 
> for skipping the offloading.

... that would be better, obviously.  But, I suggest we work on that
incrementally, after fixing the performance regression with my "avoid
offloading" patch.

I have difficulties coming up with an algorithm/parametrization to have
the compiler/runtime decide whether offloading will be profitable given
input parameters such as a ratio of parallelized/single-threaded kernels.
So I'm all ears to suggestions in that regard.  Consider: if we encounter
a single-threaded kernel, the compiler (parloops) has just given up
"understanding" the user's code.  And again, implementing such heuristics
to me sounds like incremental follow-up projects, quite possibly in
combination with generally improving OpenACC kernels handling/parloops.


Grüße
 Thomas

Reply via email to