On Tue, Jan 9, 2018 at 6:54 AM, Segher Boessenkool
<seg...@kernel.crashing.org> wrote:
> On Tue, Jan 09, 2018 at 12:23:42PM +0000, Wilco Dijkstra wrote:
>> Segher Boessenkool wrote:
>> > On Mon, Jan 08, 2018 at 0:25:47PM +0000, Wilco Dijkstra wrote:
>> >> > Always pairing two registers together *also* degrades code quality.
>> >>
>> >> No, while it's not optimal, it means smaller code and fewer memory 
>> >> accesses.
>> >
>> > It means you execute *more* memory accesses.  Always.  This may be
>> > sometimes hidden, sure.  I'm not saying you do not want more ldp's;
>> > I'm saying this particular strategy is very far from ideal.
>>
>> No it means less since the number of memory accesses reduces (memory
>> bandwidth may increase but that's not an issue).
>
> The problem is *more* memory accesses are executed at runtime.  Which is
> why separate shrink-wrapping does what it does: to have *fewer* executed.
> (It's not just the direct execution cost why that helps: more important
> are latencies to dependent ops, microarchitectural traps, etc.).

On most micro-arch of AARCH64, having one LDP/STP will take just as
long as one LDR/STR as long as it is on the same cache line.
So having one LDP/STP compared to two LDR?STR is much better.  LDP/STP
is considered one memory access really and that is where the confusion
is coming from.  We are reducing the overall number of memory accesses
or keeping it the same on that path.
Hope this explanation allows you to understand why pairing does not
degrade the code quality but improves it overall.

Thanks,
Andrew

>
> If you make A always stored whenever B is, and the other way around, the
> optimal place to do it will always store at least as often as either A
> or B, _but can also store more often than either_.
>
>> >> That may well be the problem. So if there are N predecessors, of which N-1
>> >> need to restore the same set of callee saves, but one was shrinkwrapped,
>> >> N-1 copies of the same restores might be emitted. N could be the number
>> >> of blocks in a function - I really hope it doesn't work out like that...
>> >
>> > In the worst case it would.  OTOH, joining every combo into blocks costs
>> > O(2**C) (where C is the # components) bb's worst case.
>> >
>> > It isn't a simple problem.  The current tuning works pretty well for us,
>> > but no doubt it can be improved!
>>
>> Well if there are C components, we could limit the total number of 
>> saves/restores
>> inserted to say 4C. Similarly common cases could easily share the restores
>> without increasing the number of branches.
>
> It is common to see many saves/restores generated for the exceptional cases.
>
>
> Segher

Reply via email to