Hi all,
On 12/01/16 11:32, James Greenhalgh wrote:
On Mon, Jan 11, 2016 at 04:57:56PM -0600, Evandro Menezes wrote:
On 01/11/2016 05:53 AM, James Greenhalgh wrote:
I'd like to switch the logic around in aarch64.c such that
-mlow-precision-recip-sqrt causes us to always emit the low-precision
software expansion for reciprocal square root. I have two reasons to do
this; first is consistency across -mcpu targets, second is enabling more
-mcpu targets to use the flag for peak tuning.
I don't much like that the precision we use for -mlow-precision-recip-sqrt
differs between cores (and possibly compiler revisions). Yes, we're
under -ffast-math but I take this flag to mean the user explicitly wants the
low-precision expansion, and we should not diverge from that based on an
internal decision as to what is optimal for performance in the
high-precision case. I'd prefer to keep things as predictable as possible,
and here that means always emitting the low-precision expansion when asked.
Judging by the comments in the thread proposing the reciprocal square
root optimisation, this will benefit all cores currently supported by GCC.
To be clear, we would still not expand in the high-precision case for any
cores which do not explicitly ask for it. Currently that is Cortex-A57
and xgene, though I will be proposing a patch to remove Cortex-A57 from
that list shortly.
Which gives my second motivation for this patch. -mlow-precision-recip-sqrt
is intended as a tuning flag for situations where performance is more
important than precision, but the current logic requires setting an
internal flag which also changes the performance characteristics where
high-precision is needed. This conflates two decisions the target might
want to make, and reduces the applicability of an option targets might
want to enable for performance. In particular, I'd still like to see
-mlow-precision-recip-sqrt continue to emit the cheaper, low-precision
sequence for floats under Cortex-A57.
Based on that reasoning, this patch makes the appropriate change to the
logic. I've checked with the current -mcpu values to ensure that behaviour
without -mlow-precision-recip-sqrt does not change, and that behaviour
with -mlow-precision-recip-sqrt is to emit the low precision sequences.
I've also put this through bootstrap and test on aarch64-none-linux-gnu
with no issues.
OK?
Yes, it LGTM.
Thanks.
I appreciate the idea of uniformity whne an option is specified,
which led me to think if it wouldn't be a good ide to add an option
that would have the effect of focring the emission of the reciprocal
square root, effectively forcing the flag
AARCH64_EXTRA_TUNE_RECIP_SQRT on, regardless of the tuning flags for
the given core. I think that this flag would be particularly useful
when specifying flags for specific functions, irrespective of the
core.
Thoughts?
Currently you can do this using the (mostly unsupported) -moverride
mechanism as -moverride=tune=recip_sqrt from the command line.
I'm not sure how reliable using this from
__attribute__((target("override=tune=recip_sqrt"))) would be, I wrote a small
testcase that didn't work as intended, but whether that is a bug or a
design decision I'm not yet sure. I think the logic for parsing the
target attribute is set up to reapply the command-line override string
over whichever tuning options you apply through the attribute, rather than
to allow you to apply a per-function override.
As a clarification: we don't support an "override" target attribute on aarch64.
I had a patch earlier in the year to hook up the override string parsing
machinery
into the attributes parsing code, but didn't end up proposing it.
IIRC the syntax of the override string (using '=' multiple times) would
needlessly
complicate the parsing code for something that's not intended to be used by
regular
users but rather by power users that are exploring gcc internals.
Thanks,
Kyrill
As to whether we'd want to expose this as a fully supported,
user-visible setting, I'd rather not. Our claim is that for the
higher-precision sequences the results are close enough that we can
consider this like reassociation width or other core-specific tuning
parameters that we don't expose. What I'm hoping to avoid is a
proliferation of supported options which are not in anybody's regular
testing matrix. This one would not be so bad as it is automatically
enabled by some cores. For now I'd rather not add the option.
Thanks,
James