On Fri, 7 Aug 2020, Richard Biener wrote:

I was mostly thinking of storing information like:
* don't care about the rounding mode for this operation
* may drop exceptions produced by this operation
* may produce extra exceptions
* don't care about signed zero
* may contract into FMA
* don't care about errno (for sqrt?)
etc

So we could leverage the same mechanism for inlining a non-ffast-math
function into a -ffast-math function, rewriting operations to IFNs?

Yes.

Though the resulting less optimization might offset any benefit we get from the inlining...

I was hoping enough optimizations would still be possible. With the right flags, the function could be marked pure or const, could be vectorized, etc. We could go through the transformations in match.pd and copy each one for the IFN, checking the relevant set of flags (although they might need to be more manual in forwprop if match.pd cannot handle them).

At least the above list somewhat suggests it want's to capture the various -f*-math options.

Originally I only wanted rounding and exceptions, but this looked like a sensible generalization after a previous discussion.

One complication with tracking data-flow is "unknown" stuff, I'd suggest
to invent a mediator between memory state and FP state which would
semantically be load and store operations of the FP state from/to memory.

All I can think of is make FP state a particular variable in memory, and
teach alias analysis that those functions only read/write to this
variable. What do you have in mind, splitting operations as:

fenv0 = read_fenv()
(res, fenv1) = oper(arg0, arg1, fenv0)
store_fenv(fenv1)

so that "oper" itself is const? (and hopefully simplify consecutive
read_fenv/store_fenv so there are fewer of them) I wonder if lying about
the constness of the operation may be problematic.

Kind-of.  I thought to do this around "unknown" operations like function
calls only:

store_fenv(fenv0);
foo ();
fenv0 = read_fenv();

In what I described a few lines above, that's roughly what would remain after simplification, but instead you would generate it directly, saving some compile time if there are more floating point operations than unknown. It may help to add them also for branch/join. And even then it may not be sufficient. If 2 branches start with read_fenv or end with store_fenv, we don't want an optimizer to move them into a single call outside of the branches, because then the operation itself, being const, could move outside of the branch. ISTR that there are ways to avoid this kind of transformation (mostly meant to avoid duplicating an inline asm containing a hardcoded label).

At expansion, I guess read_fenv/store_fenv would expand to nothing, they were mostly there to protect the true operation, and we could still expand
(res, fenv1) = oper(arg0, arg1, fenv0)
to
res=asm_hide(oper(asm_hide(arg0),asm_hide(arg1)))
if we don't want to also model things in RTL for every target (at least to begin with).

I guess there's nothing else but to try ...

Suppose for example you have

_3 = .IFN_PLUS (_1, _2, 0);
_4 = .IFN_PLUS (_1, _2, 0);

the first plus may alter FP state (set inexact) but since the second plus
computes the same value we'd want to elide it(?).

Assuming there is nothing in between, I think so, yes.

Now if there's a feclearexcept() inbetween we can't elide it - and that works as proposed because the memory state is inspected by feclearexcept().

The exact effect of feclearexcept depends on how we model things. It could be considered write-only. If the argument is FE_ALL_EXCEPT, things may also be easier.

In some cases, with

_3 = .IFN_PLUS (_1, _2, 0);
feclearexcept (...);
_4 = .IFN_PLUS (_1, _2, 0);

we may want to elide the first IFN...

But I can't see how we can convince FRE that we can elide the second plus when both are modifying memory.

Yes, that's certainly harder.

Actually, for optimization purposes, I would distinguish the case where we care about exceptions and the case where we don't. The few times I've used exceptions, it was only for a single operation, and I didn't expect any optimization. On the other hand, I often use hundreds of rounded operations where I don't care about exceptions. Those can be marked as pure (I expect querying if .FENV_PLUS is pure to involve looking at a bit in its last argument), and would fit much more easily with the current optimizations. I can't claim that my uses are representative of all uses though, some people may do long, regular computations and trap on FE_INVALID...

I am not that interested in exceptions, but since just rounding does not match a standard feature, it seemed more sensible to handle both together. I did wonder about making 2 sets of functions, the ones with exceptions (much harder for optimization, although not completely hopeless if people are really motivated) and the pure ones without exceptions, so the first wouldn't hinder the second too much. But having the strictest version first looked reasonable.

There's no such thing currently as effects on memory state only depend on arguments.

This reminds me of the initialization of static/thread_local variables in functions, when Jason tried to add an attribute, but I don't think it was ever committed, and the semantics were likely too different.

I _think_ we don't have to say the mem out state depends on the mem in state (FP ENV), well - it does, but the difference only depends on the actual arguments.

A different rounding mode could cause different exceptions I believe.

That said, tracking FENV together with memory will complicate things
but explicitely tracking an (or multiple?) extra FP ENV register input/output
makes the problem not go away (the second plus still has the mutated
FP ENV from the first plus as input).  Instead we'd have to separately
track the effect of a single operation and the overall FP state, like

(_3, flags_5) = .IFN_PLUS (_1, _2, 0);
fpexstate = merge (flags_5, fpexstate);
(_4, flags_6) = .IFN_PLUS (_1, _2, 0);
fpexstate = merge (flage_6, fpexstate);

We would have to be careful that lines 2 and 3 cannot be swapped (unless we keep all the merges and key expansion on those and not on the IFN? But we may end up with a use of the sum before the merge).

or so and there we can CSE.

And I guess we would have a transformation
merge(f, merge(f, state)) --> merge(f, state)

We have to track exception state separately
from the FP control word for rounding-mode for this to work.  Thus when
we're not interested in the exception state then .IFN_PLUS would be 'pure'
(only dependent on the FP CW)?

So I guess we should think of somehow separating rounding mode tracking
and exception state?  If we make the functions affect memory anyway
we can have the FP state reg(s) modeled explicitely with a fake decl(s) and pass
that by reference to the IFNs?  Then we can make use of the "fn spec" attribute
to tell which function reads/writes which reg.  Across unknown functions we'd
then have to use the store/load "trick" to merge them with the global
memory state though.

Splitting the rounding mode from the exceptions certainly makes sense, since they are used quite differently.


_3 = .FENV_PLUS (_1, _2, 0, &fenv_round, &fenv_except)
or just
_3 = .FENV_PLUS (_1, _2, 1, &fenv_round, 0)
or _3 = .FENV_PLUS (_1, _2, 2, 0, &fenv_except)
when we are not interested in everything.

with fake global decls for fenv_round and fenv_except (so "unknown" already possibly reads/writes it) and fn specs to say it doesn't look at other memory? I was more thinking of making that implicit, through magic in a couple relevant functions (the value in flags says if the global fenv_round or fenv_except is accessed), as a refinement of just "memory".

But IIUC, we would need something that does not use memory at all (not even one variable) if we wanted to avoid the big penalty in alias analysis, etc.

If we consider the case without exceptions:

round = get_fenv_round()
_3 = .FENV_PLUS (_1, _2, opts, round)

with .FENV_PLUS "const" and get_fenv_round "pure" (or even reading round from a fake global variable instead of a function call) would be tempting, but it doesn't work, since now .FENV_PLUS can migrate after a later call to fesetround. Even without exceptions we need some protection after, so it may be easier to keep the memory (fenv) read as part of .FENV_PLUS.

Also, caring only about rounding doesn't match any standard #pragma, so such an option may see very little use in practice...

Sorry for the incoherent brain-dump above ;)

It is great to have someone to discuss this with!

--
Marc Glisse

Reply via email to