On Fri, 7 Aug 2020, Richard Biener wrote:
I was mostly thinking of storing information like:
* don't care about the rounding mode for this operation
* may drop exceptions produced by this operation
* may produce extra exceptions
* don't care about signed zero
* may contract into FMA
* don't care about errno (for sqrt?)
etc
So we could leverage the same mechanism for inlining a non-ffast-math
function into a -ffast-math function, rewriting operations to IFNs?
Yes.
Though the resulting less optimization might offset any benefit we get
from the inlining...
I was hoping enough optimizations would still be possible. With the right
flags, the function could be marked pure or const, could be vectorized,
etc. We could go through the transformations in match.pd and copy each one
for the IFN, checking the relevant set of flags (although they might need
to be more manual in forwprop if match.pd cannot handle them).
At least the above list somewhat suggests it want's to capture the
various -f*-math options.
Originally I only wanted rounding and exceptions, but this looked like a
sensible generalization after a previous discussion.
One complication with tracking data-flow is "unknown" stuff, I'd suggest
to invent a mediator between memory state and FP state which would
semantically be load and store operations of the FP state from/to memory.
All I can think of is make FP state a particular variable in memory, and
teach alias analysis that those functions only read/write to this
variable. What do you have in mind, splitting operations as:
fenv0 = read_fenv()
(res, fenv1) = oper(arg0, arg1, fenv0)
store_fenv(fenv1)
so that "oper" itself is const? (and hopefully simplify consecutive
read_fenv/store_fenv so there are fewer of them) I wonder if lying about
the constness of the operation may be problematic.
Kind-of. I thought to do this around "unknown" operations like function
calls only:
store_fenv(fenv0);
foo ();
fenv0 = read_fenv();
In what I described a few lines above, that's roughly what would remain
after simplification, but instead you would generate it directly, saving
some compile time if there are more floating point operations than
unknown. It may help to add them also for branch/join. And even then it
may not be sufficient. If 2 branches start with read_fenv or end with
store_fenv, we don't want an optimizer to move them into a single call
outside of the branches, because then the operation itself, being const,
could move outside of the branch. ISTR that there are ways to avoid this
kind of transformation (mostly meant to avoid duplicating an inline asm
containing a hardcoded label).
At expansion, I guess read_fenv/store_fenv would expand to nothing, they
were mostly there to protect the true operation, and we could still expand
(res, fenv1) = oper(arg0, arg1, fenv0)
to
res=asm_hide(oper(asm_hide(arg0),asm_hide(arg1)))
if we don't want to also model things in RTL for every target (at least to
begin with).
I guess there's nothing else but to try ...
Suppose for example you have
_3 = .IFN_PLUS (_1, _2, 0);
_4 = .IFN_PLUS (_1, _2, 0);
the first plus may alter FP state (set inexact) but since the second plus
computes the same value we'd want to elide it(?).
Assuming there is nothing in between, I think so, yes.
Now if there's a feclearexcept() inbetween we can't elide it - and that
works as proposed because the memory state is inspected by
feclearexcept().
The exact effect of feclearexcept depends on how we model things. It could
be considered write-only. If the argument is FE_ALL_EXCEPT, things may
also be easier.
In some cases, with
_3 = .IFN_PLUS (_1, _2, 0);
feclearexcept (...);
_4 = .IFN_PLUS (_1, _2, 0);
we may want to elide the first IFN...
But I can't see how we can convince FRE that we can elide the second
plus when both are modifying memory.
Yes, that's certainly harder.
Actually, for optimization purposes, I would distinguish the case where we
care about exceptions and the case where we don't. The few times I've used
exceptions, it was only for a single operation, and I didn't expect any
optimization. On the other hand, I often use hundreds of rounded
operations where I don't care about exceptions. Those can be marked as
pure (I expect querying if .FENV_PLUS is pure to involve looking at a bit
in its last argument), and would fit much more easily with the current
optimizations. I can't claim that my uses are representative of all uses
though, some people may do long, regular computations and trap on
FE_INVALID...
I am not that interested in exceptions, but since just rounding does not
match a standard feature, it seemed more sensible to handle both together.
I did wonder about making 2 sets of functions, the ones with exceptions
(much harder for optimization, although not completely hopeless if people
are really motivated) and the pure ones without exceptions, so the first
wouldn't hinder the second too much. But having the strictest version
first looked reasonable.
There's no such thing currently as effects on memory state only depend
on arguments.
This reminds me of the initialization of static/thread_local variables in
functions, when Jason tried to add an attribute, but I don't think it was
ever committed, and the semantics were likely too different.
I _think_ we don't have to say the mem out state depends on the mem in
state (FP ENV), well - it does, but the difference only depends on the
actual arguments.
A different rounding mode could cause different exceptions I believe.
That said, tracking FENV together with memory will complicate things
but explicitely tracking an (or multiple?) extra FP ENV register input/output
makes the problem not go away (the second plus still has the mutated
FP ENV from the first plus as input). Instead we'd have to separately
track the effect of a single operation and the overall FP state, like
(_3, flags_5) = .IFN_PLUS (_1, _2, 0);
fpexstate = merge (flags_5, fpexstate);
(_4, flags_6) = .IFN_PLUS (_1, _2, 0);
fpexstate = merge (flage_6, fpexstate);
We would have to be careful that lines 2 and 3 cannot be swapped (unless
we keep all the merges and key expansion on those and not on the IFN?
But we may end up with a use of the sum before the merge).
or so and there we can CSE.
And I guess we would have a transformation
merge(f, merge(f, state)) --> merge(f, state)
We have to track exception state separately
from the FP control word for rounding-mode for this to work. Thus when
we're not interested in the exception state then .IFN_PLUS would be 'pure'
(only dependent on the FP CW)?
So I guess we should think of somehow separating rounding mode tracking
and exception state? If we make the functions affect memory anyway
we can have the FP state reg(s) modeled explicitely with a fake decl(s) and pass
that by reference to the IFNs? Then we can make use of the "fn spec" attribute
to tell which function reads/writes which reg. Across unknown functions we'd
then have to use the store/load "trick" to merge them with the global
memory state though.
Splitting the rounding mode from the exceptions certainly makes sense,
since they are used quite differently.
_3 = .FENV_PLUS (_1, _2, 0, &fenv_round, &fenv_except)
or just
_3 = .FENV_PLUS (_1, _2, 1, &fenv_round, 0)
or
_3 = .FENV_PLUS (_1, _2, 2, 0, &fenv_except)
when we are not interested in everything.
with fake global decls for fenv_round and fenv_except (so "unknown"
already possibly reads/writes it) and fn specs to say it doesn't look at
other memory? I was more thinking of making that implicit, through magic
in a couple relevant functions (the value in flags says if the global
fenv_round or fenv_except is accessed), as a refinement of just "memory".
But IIUC, we would need something that does not use memory at all (not
even one variable) if we wanted to avoid the big penalty in alias
analysis, etc.
If we consider the case without exceptions:
round = get_fenv_round()
_3 = .FENV_PLUS (_1, _2, opts, round)
with .FENV_PLUS "const" and get_fenv_round "pure" (or even reading round
from a fake global variable instead of a function call) would be tempting,
but it doesn't work, since now .FENV_PLUS can migrate after a later call
to fesetround. Even without exceptions we need some protection after, so
it may be easier to keep the memory (fenv) read as part of .FENV_PLUS.
Also, caring only about rounding doesn't match any standard #pragma, so
such an option may see very little use in practice...
Sorry for the incoherent brain-dump above ;)
It is great to have someone to discuss this with!
--
Marc Glisse