Re: [RFC] Return Value Propagation in IPA-CP

Dhruv Chawla via Gcc Thu, 17 Oct 2024 03:32:32 -0700

Hi,

Thanks for the reply!


On 16/10/24 20:25, Martin Jambor wrote:

External email: Use caution opening links or attachments


Hello,

first and foremost, sorry for a late reply.  I needed to take a larger
leave of absence for family reasons.

Comments inline:

On Thu, Aug 22 2024, Dhruv Chawla via Gcc wrote:

* Table Of Contents *

- Introduction
- Motivating Test Cases
- Proposed Solution
- Other Options
- Existing Solutions
- Primary Optimizations To Be Improved
- Front-End Attribute For Recording Return-Value Information
- Future Work
- References
- Questions

* Introduction *

Return jump functions offer a way of modeling return values of a function in
terms of its formal parameters, similar to how regular jump functions can be
used to model the actual parameters of a callee at a callsite in terms of the
formal parameters of the caller.

By discovering this information and propagating it across the program callgraph
in LTO mode, various transformations like value-range propagation, sibling-call
and chaining-call optimization can be augmented to take advantage of it and
potentially perform better optimizations.

Currently, ipa-cp models parameters using jump functions, propagates the values
of those jump functions by iterating to a fixed-point using lattices, and
clones functions by taking decisions from those lattice values.

We propose extending ipa-cp with the analysis and propagation of return jump
functions, starting with the basic case of a direct return of the form
"return x;" where "x" is any formal parameter of the function.

* Motivating Test Cases *

* Optimizing tail calls:

- PR92867

char buf[128] = { 1 };

char *foo (__SIZE_TYPE__ n)
{
    return __builtin_memset (buf, ' ', n);
}

char *bar (__SIZE_TYPE__ n)
{
    __builtin_memset (buf, ' ', n);
    return buf;
}

Here, the call to __builtin_memset in foo gets successfully converted to a
tail-call, however the call in bar cannot be optimized to a tail-call. This is
an optimization that LLVM is able to perform.

Link to compiler explorer: https://godbolt.org/z/E81axqWTb

- PR67797

#include <string.h>

void *my_func(void *s, size_t n)
{
      memset(s, 0, n);
      return s;
}

This is a similar case to the above example. LLVM is able to optimize the call
to memset to be a tail-call as well.

Link to compiler explorer: https://godbolt.org/z/YzjGc59f8

* Optimizing chaining calls:

#include <iostream>

void f (int x, int y)
{
    std::cout << x;
    std::cout << y;
}

This function can be optimized to the following:

#include <iostream>

void f (int x, int y)
{
    std::cout << x << y;
}

LLVM is able to perform this optimization as well:
https://godbolt.org/z/MW75on1o5

* Proposed Solution *

* Extending IPA-CP:

1. Add return jump function data structures
    - This involves updating ipa_node_params to contain information regarding 
the
      return statements of the function, namely the lattice and the jump 
function
      describing the return value, where both use existing data structures.
    - The ipa_node_params class is reused to avoid introducing a new class and a
      corresponding function_summary type, though it is trivial to add if deemed
      a better solution. The ipa_return_value_summary class may also be a good
      place for this information, however using it would involve moving it from
      ipa-prop.cc to ipa-prop.h.


I believe that ipa_return_value_summary should become a member of
ipa_node_params (perhaps we could think of a better name) and
ipa_return_value_sum should be removed.  All infrastructure there is to
create ipa_return_value_summary should be re-used or extended to fulfill
also the new needs.


As per the reply at https://gcc.gnu.org/pipermail/gcc/2024-
September/244856.html, we ended up splitting this out into a new structure.
Having both tree and IPA passes accessing the same structure made memory
management hard.

    - Additionally, ipa_edge_args is updated to track whether or not it is a
      callgraph edge originating from a return statement. This enables the
      propagation of information in the WPA phase.


I don't understand.  I have read your reply to Honza to his question
about this and I still don't understand what you mean by this.

I believe we want to have a return function for each call graph edge,
certainly for each one that calls a non-void function.


Yeah, I believe that could have been phrased better. It just means that we track
whether or not an SSA_NAME created as a result of a gcall is used in the retval
of a return statement, either directly or through a PHI node. This is achieved
by tracking this information in ipa_edge_args (as there is one instance of this
structure for each cgraph_edge), so each gcall will have an ipa_edge_args
instance associated with it.

The return jump function is stored per-node, not per-edge. The per-edge tracking
is only done to enable propagation of the "returns_arg" attribute from callee to
caller in WPA, to potentially annotate callers with the attribute. This shows up
for example in the md5 functions in libiberty, where md5_finish_ctx calls
md5_read_ctx in a return statement, so the "returns_arg" attribute can be
transitively inferred for md5_finish_ctx from md5_read_ctx.


2. LGEN
    - Set up return jump functions for each function similar to the parameter
      jump function. When it cannot be conclusively determined that one formal
      parameter is always returned (such as conditionally returning either of
      two), the jump function is marked as invalid.


I think we should have it for each function.  And it should not be
similar but rather exactly the same thing, capable of recording the fact
that we always return a constant or holding "arithmetic" jump functions.

Since I assume that at least initially we do not want to create
aggregate jump functions, we may want to split ipa_jump_func to a scalar
part and the aggregate part.  But perhaps not really, the memory
overhead is unlikely to be worth it.


Yup, I agree with that.

    - This involves looking through phi nodes when return statements of
      conditional branches are merged into a single exit block. Note that
      returning a call expression does not count towards invalidating the
      information for that function.


By "returning a call expression" you mean an SSA_NAME defined in a
gimple call statement?  Sometimes your email sounds like you do some
part of your analysis pre-gimple, which I think is not necessary and
actually makes things more difficult?


There is no pre-gimple analysis. I do mean what you said, I was just not
familiar with the specific terms when writing out this RFC :)


3. WPA
    - Implement return value information propagation. It is not possible to do
      this in the existing propagate_constants_topo function, because it 
iterates
      in reverse postorder i.e. from caller to callee. However return value
      propagation needs to be done from callee to caller, thus there is a need 
to
      iterate in a postorder fashion.


I think that adding another sweep in the opposite directions to
propagate_constants_topo is the right thing for an initial
implementation.

In the long run, we might want to take the advantage of the fact that
ipcp_value structures (stored in ipcp_lattice) have know their "sources"
and so we can encode that one parameter gets a value returned by a
function where we know what it is (a constant, its parameter which we
know, or a result of an arithmetic jump function).


This is very good to know! Definitely an interesting avenue to explore.

However, That would mean rewiring the main loop in
propagate_constants_topo and perhaps elsewhere to iterate over
dependency sort of these values rather than functions (we do already do
compute that when we compute SCCs of values, though) which might have
all sorts of interesting consequences...  so that is probably a
follow-up project.

I understand that the need to encode "param n plus four" means the
lattices as they are won't do, initially I'd just extend ipcp_value, and
only if it would make sense to save memory I'd move these extra bits to
a separate descendant class.

In any case I'd strive to keep as close to the existing lattice data
structures as possible, even if we initially only make use of those
which hold exactly one value.

    - The lattice values are initialized using the jump functions computed in 
the
      LGEN phase. These values are then propagated over the callgraph.
    - The existing lattices are reused, with three possible values like usual.
      The possible values are:

        return_lattice -> { TOP, param_decl, BOTTOM }

      where TOP and BOTTOM are defined as usual. param_decl refers to the tree 
of
      either the parameter declaration or its type, whichever is available. The
      meet operator is defined as usual for TOP and BOTTOM. When both are
      param_decl, the following meet operation is defined:

        meet(x, y) = x if x == y, BOTTOM otherwise

    - Finally, nodes to which no information could be propagated are marked as
      BOTTOM and the BOTTOM value is propagated to a fixed-point. This is 
required
      when information about a caller has been inferred but not about one of its
      callees. For example:

    extern int foo(int);

    __attribute__((noinline))
    int bar (int z)
    {
      return foo (z);
    }

    __attribute__((noinline))
    int baz (int z)
    {
      return z;
    }

    int quux (int z)
    {
      if (z & 1)
        return bar (z);
      else
        return baz (z);
    }

    In this case, after propagation, the lattice values look like the following:
    - bar/ret  ->  TOP


Presumably BOTTOM?


Ah, I should've been more clear. It is TOP as no information was propagated to
it, because its callee foo is not visible to the compiler. It gets set to BOTTOM
afterwards. There are three phases of the propagation:

1. Initialization of lattice values from LGEN information
2. Propagation of lattice values across the call graph from callees to callers
3. Setting nodes that had nothing propagated to them (i.e. those left as TOP) to
   BOTTOM and propagating the BOTTOM to callers

The example is meant to show what the lattices looks like after phase 2 and just
before phase 3.

    - baz/ret  ->  baz/0  (first parameter being returned)
    - quux/ret ->  quux/0 (lattice partially inferred from baz)

    Here, we are not able to infer any information about bar because it calls a
    function foo that is not visible to the compiler. If we only set bar to
    BOTTOM and do not propagate this BOTTOM value to quux, we will incorrectly
    infer that quux always returns its 1st parameter.


Above you meant that TOP would lead to miscompilation?


I meant that not propagating BOTTOM values for nodes that had no information
propagated to them could lead to incompletely inferred values in callers, which
could eventually lead to a miscompilation. This is the justification for the
fixed-point propagation in phase 3.

Note that our ipa-cp lattices can hold multiple values (which we then
potentially use for function cloning) and only become BOTTOM if the
situation happens to be totally hopeless.  Plus we can have a situation
where we store a bunch of known values from known sources but also have
a "contains_variable" bit set which means that in other contexts the
values are unknown.

I assume that you have not thought about cloning for values deduced
through return jump functions, so initially only lattices with the bit
cleared and only one stored value would be used, but see the bit above
about rewiring propagate_constants_topo.


Yes, that is exactly what is being done.


4. LTRANS
    - The computed information is used to set a field denoting the returned
      formal parameter in the ipcp_transformation class. This is streamed out to
      bitcode at the end of WPA and read back in the LTRANS phase.
    - This information is currently used to annotate functions with the "fn 
spec"
      internal annotation in a prototype patch. This is only a stop-gap measure
      given the limitations that exist on "fn spec" (detailed in the next
      section).


Creating fnspec seems like a weird choice, but if it works for you...


fnspec is (was) just a placeholder. The eventual goal is to use the
"returns_arg" attribute, as fnspec is indeed too limited for our uses.

I'd rather see the result of the analysis to be recorded in
ipa_param_adjustments - remember we want to also optimize if we know a
return returns a compile time constant - and make
ipa_param_adjustments::modify_call simply change the definition of the
SSA_NAME defined in such call to whatever we know is the value (a
constant or an SSA_NAME of a parameter or a new statement performing an
encoded arithmetic jump function).


Consider the following example:

__attribute__((noinline))
static int f3 (int x, int y)
{
    use (y);
    return x;
}

int f2 (int x, int y)
{
    return f3 (x, y);
}

After LGEN, the jump functions will look like the following:
- f3/ret -> f3/0    (f3 returns the 0th parameter, i.e. x)
- f2/ret -> unknown (but not invalid, because it returns a call)

Therefore, the lattices are initialized at the start of WPA in the following
manner:
- f3/lat -> 0
- f2/lat -> TOP

Simple propagation gives f2/lat the final value of 0. After LTRANS, with the
current patch, the code would end up like the following:

__attribute__((fn spec ("1 "), noinline))
static int f3 (int x, int y)
{
    use (y);
    return x;
}

__attribute__((fn spec ("1 ")))
int f2 (int x, int y)
{
    return f3 (x, y);
}

* Other Options *

* ipa-pure-const pass:

The "ipa-pure-const" pass can also be used to implement this analysis by using
the same local analysis and storing inferred information in the function
declaration using the "fn spec" annotation. However, this is quite restrictive
because it does not allow handling cases such as "return x + 1;", where the
formal parameter has an additional binary operation with a constant operand.

The "fn spec" annotation is also quite limited as it only allows tracking
returns of up to the 4th formal parameter of the function.

* Existing Solutions *

Patch 53ba8d669550d3a1f809048428b97ca607f95cf5 introduces a new class,
ipa_return_value_summary, to store range information of the return value and
extends the tree-vrp pass to store inferred information in this class. This is
then used in fold_using_range. However, this only works within a single
translation unit.

We propose extending this solution by integrating it into the ipa-cp
infrastructure and propagating this information across the whole-program
callgraph.

* Primary Optimizations To Be Improved *

1. Value Range Propagation
2. Sibling Call Optimization
3. Chaining Call Optimization

* Front-End Attribute For Recording Return-Value Information *

A new front-end attribute, "returns_arg", is proposed to explicitly annotate
functions that directly return one of their formal parameters. This is useful
when only function declarations are visible, as is the case when function
definitions are present in a separate shared object file.

There is a patch previously implementing returns_arg, which can be used as a
base for implementing this:

https://gcc.gnu.org/legacy-ml/gcc-patches/2020-01/msg01199.html

* Future Work *

1. Handling polynomial cases, the most basic being a return statement of the
     form "x + c", where "x" is a formal parameter and "c" is a constant value.
     This is already representable in ipa_jump_func.


We generally refer to these as "arithmetic" jump functions and I think
that at least when IPA-CP can deduce a return value is a constant, they
should be supported straight away.  All the infrastructure is there, we
may just need to make it a tiny bit more general.

2. Handling pass-by-reference cases, where a formal parameter of the function
     is used as an out-value.

* References *

- Interprocedural Constant Propagation: A Study of Jump Function
    Implementations: https://dl.acm.org/doi/pdf/10.1145/155090.155099
- Interprocedural Constant Propagation:
https://courses.cs.washington.edu/courses/cse501/06wi/reading/callahan-pldi86.pdf

* Questions *

- Does this sound reasonable?


Yes, I'll be happy to help to get any such work into shape and review
it.

- Would it be better to reuse ipa_node_params, ipa_return_value_summary, or
    introduce a new class entirely?


See above, ipa_return_value_summary should become part of
ipa_node_params and be used for this purpose.

- What are other optimizations that could be extended with this
information?


See my comment about ipa_param_adjustments, that would make the
information available straight away to all subsequent passes.

We may then want to extend IPA-VR to do a similar propagation of value
ranges and store the information into an appropriate SSA_NAME range
info.

- Could the return value range information be improved in any other way?
- Should build_int_cst be used to create the value stored in the lattice,
    instead of the tree corresponding to the parameter?


Probably not, what if the propagated constant is a pointer (see
is_gimple_ip_invariant_address) or a floating point number?

- Should a new class be created to correspond to return jump functions,
    something like ipa_return_jump_func?


No, but perhaps ipa_jump_func needs to be separated into ipa_jump_func
and ipa_jump_scalar_func which would not have the agg component.  But
that may be a premature optimization, first I'd just use it as it is.

Plus we may want to be able to handle aggregate bits one day too.

- Should a constant jump function (instead of a pass-through one) be used to
    represent cases where the returned parameter is known?


Yes.

Thank you very much for the email, as I wrote I'll be happy to
collaborate and review such additions.  And sorry again for a long delay
in my reply, although I really was not in a position to avoid it.


Thank you so much for responding!

Martin


--
Regards,
Dhruv

Re: [RFC] Return Value Propagation in IPA-CP

Reply via email to