Hi Bernd, On Fri, Aug 26, 2016 at 03:03:43PM +0200, Bernd Schmidt wrote: > Not really, I'm afraid. There still seems to be no detailed explanation > of the algorithm. Instead, there is a vague outline
Did you read the description of 8/9? If you think any of that needs to be in the code, please just say so. > and inside the code there are still meaningless comments of the form > > /* Deselect those epilogue components that should not be inserted > for this edge. */ You asked for a comment here, see https://gcc.gnu.org/ml/gcc-patches/2016-07/msg00932.html ("what's the purpose of this", etc.) > Worse, I'm still not entirely certain what it is intended to achieve: I > asked for a motivating example or two, but haven't seen one in the > comments anywhere. "Make faster code", like all other optimisation passes do as well. I commented repeatedly about (micro-)benchmark results. The head comment starts with +/* Separate shrink-wrapping + + Instead of putting all of the prologue and epilogue in one spot, we + can put parts of it in places where those components are executed less + frequently. and that is the long and short of it. > My most general question would be why the algorithm > for placing individual prologue components would be different from the > regular shrink-wrapping algorithm for full prologues. Examples and/or > arguments relating to how this new code acts with regard to loops are > also particularly needed. The full-prologue algorithm makes as many blocks run without prologue as possible, by duplicating blocks where that helps. If you do this for every component you can and up with 2**40 blocks for just 40 components, not such a hot plan. The "separate" algo does not duplicate blocks at all (its only code growth is for all the prologue/epilogue components inserted, and for some forwarder blocks sometimes). The full-prologue algorithm only ever inserts exactly one prologue, far from optimal. But it *cannot* duplicate it with the semantics we have. The separate components can of course be duplicated, it's a new abstraction and part of the requirements for it. > So, I think improvement is necessary in these points, but I also think > that there's room for experimental verification by way of self-tests. Since it is used everywhere I think that is a big waste of time (it is tested a lot already). Testing specific problem cases can of course help; but we have "make check" for that already anwyay. Also, I think "self-tests" looking at the internals are just wrong (the only sane way to update such tests when changing the code is to delete the tests, since they should be written independently; otherwise you just have two copies of the same). And the useless abstractions are just useless. The algorithm is procedural; writing it in procedural style is much clearer than hiding everything. > If > done thoroughly (testing the transformation on several sufficiently > random CFGs and maybe some manually constructed ones) it would go a long > way towards showing that at least we don't have to worry too much about > miscompilations. If you hadn't noticed, the algorithm is constructed in such a way as to minimise possible miscompilations: it first determines what blocks need what components "active", and then makes it so, in a separate (much more trivial) phase. Wrt your patch... making each component needed half of the time is not very realistic, you'll end up with all components active in most blocks. bools as parameters where they do not mean "yes/no" is confusing. It seems you do no longer do the "insert at head" and "insert at tail" before the "insert on edge"? This cannot work afais. Some utility funcs print dump info; the caller should, instead. You make "components" global (as "m_components"). Segher