Hello, first and foremost, sorry for a late reply. I needed to take a larger leave of absence for family reasons.
Comments inline: On Thu, Aug 22 2024, Dhruv Chawla via Gcc wrote: > * Table Of Contents * > > - Introduction > - Motivating Test Cases > - Proposed Solution > - Other Options > - Existing Solutions > - Primary Optimizations To Be Improved > - Front-End Attribute For Recording Return-Value Information > - Future Work > - References > - Questions > > * Introduction * > > Return jump functions offer a way of modeling return values of a function in > terms of its formal parameters, similar to how regular jump functions can be > used to model the actual parameters of a callee at a callsite in terms of the > formal parameters of the caller. > > By discovering this information and propagating it across the program > callgraph > in LTO mode, various transformations like value-range propagation, > sibling-call > and chaining-call optimization can be augmented to take advantage of it and > potentially perform better optimizations. > > Currently, ipa-cp models parameters using jump functions, propagates the > values > of those jump functions by iterating to a fixed-point using lattices, and > clones functions by taking decisions from those lattice values. > > We propose extending ipa-cp with the analysis and propagation of return jump > functions, starting with the basic case of a direct return of the form > "return x;" where "x" is any formal parameter of the function. > > * Motivating Test Cases * > > * Optimizing tail calls: > > - PR92867 > > char buf[128] = { 1 }; > > char *foo (__SIZE_TYPE__ n) > { > return __builtin_memset (buf, ' ', n); > } > > char *bar (__SIZE_TYPE__ n) > { > __builtin_memset (buf, ' ', n); > return buf; > } > > Here, the call to __builtin_memset in foo gets successfully converted to a > tail-call, however the call in bar cannot be optimized to a tail-call. This is > an optimization that LLVM is able to perform. > > Link to compiler explorer: https://godbolt.org/z/E81axqWTb > > - PR67797 > > #include <string.h> > > void *my_func(void *s, size_t n) > { > memset(s, 0, n); > return s; > } > > This is a similar case to the above example. LLVM is able to optimize the call > to memset to be a tail-call as well. > > Link to compiler explorer: https://godbolt.org/z/YzjGc59f8 > > * Optimizing chaining calls: > > #include <iostream> > > void f (int x, int y) > { > std::cout << x; > std::cout << y; > } > > This function can be optimized to the following: > > #include <iostream> > > void f (int x, int y) > { > std::cout << x << y; > } > > LLVM is able to perform this optimization as well: > https://godbolt.org/z/MW75on1o5 > > * Proposed Solution * > > * Extending IPA-CP: > > 1. Add return jump function data structures > - This involves updating ipa_node_params to contain information regarding > the > return statements of the function, namely the lattice and the jump > function > describing the return value, where both use existing data structures. > - The ipa_node_params class is reused to avoid introducing a new class and > a > corresponding function_summary type, though it is trivial to add if > deemed > a better solution. The ipa_return_value_summary class may also be a good > place for this information, however using it would involve moving it from > ipa-prop.cc to ipa-prop.h. I believe that ipa_return_value_summary should become a member of ipa_node_params (perhaps we could think of a better name) and ipa_return_value_sum should be removed. All infrastructure there is to create ipa_return_value_summary should be re-used or extended to fulfill also the new needs. > - Additionally, ipa_edge_args is updated to track whether or not it is a > callgraph edge originating from a return statement. This enables the > propagation of information in the WPA phase. I don't understand. I have read your reply to Honza to his question about this and I still don't understand what you mean by this. I believe we want to have a return function for each call graph edge, certainly for each one that calls a non-void function. > > 2. LGEN > - Set up return jump functions for each function similar to the parameter > jump function. When it cannot be conclusively determined that one formal > parameter is always returned (such as conditionally returning either of > two), the jump function is marked as invalid. I think we should have it for each function. And it should not be similar but rather exactly the same thing, capable of recording the fact that we always return a constant or holding "arithmetic" jump functions. Since I assume that at least initially we do not want to create aggregate jump functions, we may want to split ipa_jump_func to a scalar part and the aggregate part. But perhaps not really, the memory overhead is unlikely to be worth it. > - This involves looking through phi nodes when return statements of > conditional branches are merged into a single exit block. Note that > returning a call expression does not count towards invalidating the > information for that function. By "returning a call expression" you mean an SSA_NAME defined in a gimple call statement? Sometimes your email sounds like you do some part of your analysis pre-gimple, which I think is not necessary and actually makes things more difficult? > > 3. WPA > - Implement return value information propagation. It is not possible to do > this in the existing propagate_constants_topo function, because it > iterates > in reverse postorder i.e. from caller to callee. However return value > propagation needs to be done from callee to caller, thus there is a need > to > iterate in a postorder fashion. I think that adding another sweep in the opposite directions to propagate_constants_topo is the right thing for an initial implementation. In the long run, we might want to take the advantage of the fact that ipcp_value structures (stored in ipcp_lattice) have know their "sources" and so we can encode that one parameter gets a value returned by a function where we know what it is (a constant, its parameter which we know, or a result of an arithmetic jump function). However, That would mean rewiring the main loop in propagate_constants_topo and perhaps elsewhere to iterate over dependency sort of these values rather than functions (we do already do compute that when we compute SCCs of values, though) which might have all sorts of interesting consequences... so that is probably a follow-up project. I understand that the need to encode "param n plus four" means the lattices as they are won't do, initially I'd just extend ipcp_value, and only if it would make sense to save memory I'd move these extra bits to a separate descendant class. In any case I'd strive to keep as close to the existing lattice data structures as possible, even if we initially only make use of those which hold exactly one value. > - The lattice values are initialized using the jump functions computed in > the > LGEN phase. These values are then propagated over the callgraph. > - The existing lattices are reused, with three possible values like usual. > The possible values are: > > return_lattice -> { TOP, param_decl, BOTTOM } > > where TOP and BOTTOM are defined as usual. param_decl refers to the tree > of > either the parameter declaration or its type, whichever is available. The > meet operator is defined as usual for TOP and BOTTOM. When both are > param_decl, the following meet operation is defined: > > meet(x, y) = x if x == y, BOTTOM otherwise > > - Finally, nodes to which no information could be propagated are marked as > BOTTOM and the BOTTOM value is propagated to a fixed-point. This is > required > when information about a caller has been inferred but not about one of > its > callees. For example: > > extern int foo(int); > > __attribute__((noinline)) > int bar (int z) > { > return foo (z); > } > > __attribute__((noinline)) > int baz (int z) > { > return z; > } > > int quux (int z) > { > if (z & 1) > return bar (z); > else > return baz (z); > } > > In this case, after propagation, the lattice values look like the > following: > - bar/ret -> TOP Presumably BOTTOM? > - baz/ret -> baz/0 (first parameter being returned) > - quux/ret -> quux/0 (lattice partially inferred from baz) > > Here, we are not able to infer any information about bar because it calls a > function foo that is not visible to the compiler. If we only set bar to > BOTTOM and do not propagate this BOTTOM value to quux, we will incorrectly > infer that quux always returns its 1st parameter. Above you meant that TOP would lead to miscompilation? Note that our ipa-cp lattices can hold multiple values (which we then potentially use for function cloning) and only become BOTTOM if the situation happens to be totally hopeless. Plus we can have a situation where we store a bunch of known values from known sources but also have a "contains_variable" bit set which means that in other contexts the values are unknown. I assume that you have not thought about cloning for values deduced through return jump functions, so initially only lattices with the bit cleared and only one stored value would be used, but see the bit above about rewiring propagate_constants_topo. > > 4. LTRANS > - The computed information is used to set a field denoting the returned > formal parameter in the ipcp_transformation class. This is streamed out > to > bitcode at the end of WPA and read back in the LTRANS phase. > - This information is currently used to annotate functions with the "fn > spec" > internal annotation in a prototype patch. This is only a stop-gap measure > given the limitations that exist on "fn spec" (detailed in the next > section). Creating fnspec seems like a weird choice, but if it works for you... I'd rather see the result of the analysis to be recorded in ipa_param_adjustments - remember we want to also optimize if we know a return returns a compile time constant - and make ipa_param_adjustments::modify_call simply change the definition of the SSA_NAME defined in such call to whatever we know is the value (a constant or an SSA_NAME of a parameter or a new statement performing an encoded arithmetic jump function). > > Consider the following example: > > __attribute__((noinline)) > static int f3 (int x, int y) > { > use (y); > return x; > } > > int f2 (int x, int y) > { > return f3 (x, y); > } > > After LGEN, the jump functions will look like the following: > - f3/ret -> f3/0 (f3 returns the 0th parameter, i.e. x) > - f2/ret -> unknown (but not invalid, because it returns a call) > > Therefore, the lattices are initialized at the start of WPA in the following > manner: > - f3/lat -> 0 > - f2/lat -> TOP > > Simple propagation gives f2/lat the final value of 0. After LTRANS, with the > current patch, the code would end up like the following: > > __attribute__((fn spec ("1 "), noinline)) > static int f3 (int x, int y) > { > use (y); > return x; > } > > __attribute__((fn spec ("1 "))) > int f2 (int x, int y) > { > return f3 (x, y); > } > > * Other Options * > > * ipa-pure-const pass: > > The "ipa-pure-const" pass can also be used to implement this analysis by using > the same local analysis and storing inferred information in the function > declaration using the "fn spec" annotation. However, this is quite restrictive > because it does not allow handling cases such as "return x + 1;", where the > formal parameter has an additional binary operation with a constant operand. > > The "fn spec" annotation is also quite limited as it only allows tracking > returns of up to the 4th formal parameter of the function. > > * Existing Solutions * > > Patch 53ba8d669550d3a1f809048428b97ca607f95cf5 introduces a new class, > ipa_return_value_summary, to store range information of the return value and > extends the tree-vrp pass to store inferred information in this class. This is > then used in fold_using_range. However, this only works within a single > translation unit. > > We propose extending this solution by integrating it into the ipa-cp > infrastructure and propagating this information across the whole-program > callgraph. > > * Primary Optimizations To Be Improved * > > 1. Value Range Propagation > 2. Sibling Call Optimization > 3. Chaining Call Optimization > > * Front-End Attribute For Recording Return-Value Information * > > A new front-end attribute, "returns_arg", is proposed to explicitly annotate > functions that directly return one of their formal parameters. This is useful > when only function declarations are visible, as is the case when function > definitions are present in a separate shared object file. > > There is a patch previously implementing returns_arg, which can be used as a > base for implementing this: > > https://gcc.gnu.org/legacy-ml/gcc-patches/2020-01/msg01199.html > > * Future Work * > > 1. Handling polynomial cases, the most basic being a return statement of the > form "x + c", where "x" is a formal parameter and "c" is a constant value. > This is already representable in ipa_jump_func. We generally refer to these as "arithmetic" jump functions and I think that at least when IPA-CP can deduce a return value is a constant, they should be supported straight away. All the infrastructure is there, we may just need to make it a tiny bit more general. > 2. Handling pass-by-reference cases, where a formal parameter of the function > is used as an out-value. > > * References * > > - Interprocedural Constant Propagation: A Study of Jump Function > Implementations: https://dl.acm.org/doi/pdf/10.1145/155090.155099 > - Interprocedural Constant Propagation: > https://courses.cs.washington.edu/courses/cse501/06wi/reading/callahan-pldi86.pdf > > * Questions * > > - Does this sound reasonable? Yes, I'll be happy to help to get any such work into shape and review it. > - Would it be better to reuse ipa_node_params, ipa_return_value_summary, or > introduce a new class entirely? See above, ipa_return_value_summary should become part of ipa_node_params and be used for this purpose. > - What are other optimizations that could be extended with this > information? See my comment about ipa_param_adjustments, that would make the information available straight away to all subsequent passes. We may then want to extend IPA-VR to do a similar propagation of value ranges and store the information into an appropriate SSA_NAME range info. > - Could the return value range information be improved in any other way? > - Should build_int_cst be used to create the value stored in the lattice, > instead of the tree corresponding to the parameter? Probably not, what if the propagated constant is a pointer (see is_gimple_ip_invariant_address) or a floating point number? > - Should a new class be created to correspond to return jump functions, > something like ipa_return_jump_func? No, but perhaps ipa_jump_func needs to be separated into ipa_jump_func and ipa_jump_scalar_func which would not have the agg component. But that may be a premature optimization, first I'd just use it as it is. Plus we may want to be able to handle aggregate bits one day too. > - Should a constant jump function (instead of a pass-through one) be used to > represent cases where the returned parameter is known? Yes. Thank you very much for the email, as I wrote I'll be happy to collaborate and review such additions. And sorry again for a long delay in my reply, although I really was not in a position to avoid it. Martin