On Mon, Oct 10, 2016 at 03:21:31PM -0600, Jeff Law wrote: > On 09/30/2016 04:34 AM, Segher Boessenkool wrote: > >[ whoops, message too big, resending with the attachment compressed ] > > > >On Tue, Sep 27, 2016 at 03:14:51PM -0600, Jeff Law wrote: > >>With transposition issue addressed, the only blocker I see are some > >>simple testcases we can add to the suite. They don't have to be real > >>extensive. And one motivating example for the list archives, ideally > >>the glibc malloc case. > > > >And here is the malloc testcase. > > > >A very important (for performance) function is _int_malloc, which starts > >with > [ ... ] > THanks. What I think is important to note with this example is the bits > that were pushed into the path with the sysmalloc/alloc_perturb calls. > That's an unlikely path.
alloc_perturb is a no-op, and inlined as such: as nothing :-) > We have to extrapolate a bit from the assembly provided. In the not > separately shrink-wrapped version, we have a full prologue of stores and > two instances of a full epilogue (though only one ever executes) provided. > > With separate shrink wrapping the (presumably) very cold path where we > error has virtually no prologue/epilogue. That's probably a nop from a > performance standpoint. > > More interesting is the path where we call sysmalloc/alloc_perturb, it's > a cold path, but not as cold as the error path. We save/restore 4 regs > in that case. Rather than a full prologue/epilogue. So there's clearly > a savings there, though again, via the expect it's a cold path. > > Where we have to extrapolate is the hot path. Presumably on the hot > path we're saving/restoring ~4 fewer registers. I haven't verified > that, but that is kindof the whole point here. We save/restore just four registers total on the hot path. And yes, that is the point :-) The hot exit is .L683: ld 14,144(1) ld 15,152(1) ld 25,232(1) ld 30,272(1) addi 3,4,16 .L673: addi 1,1,288 blr so four GPR restores and no LR restore. Without separate shrink-wrapping this was .L641: addi 3,21,16 b .L631 [ ... ] .L631: addi 1,1,288 ld 29,16(1) ld 14,-144(1) ld 15,-136(1) ld 16,-128(1) ld 17,-120(1) ld 18,-112(1) ld 19,-104(1) ld 20,-96(1) ld 21,-88(1) ld 22,-80(1) ld 23,-72(1) ld 24,-64(1) mtlr 29 ld 25,-56(1) ld 26,-48(1) ld 27,-40(1) ld 28,-32(1) ld 29,-24(1) ld 30,-16(1) ld 31,-8(1) blr (18 GPRs as well as LR). I didn't show this path because there is a whole bunch of branches with inline asm in the way. The sysmalloc path was .L635: li 4,0 .L761: addi 1,1,288 mr 3,14 ld 14,16(1) ld 15,-136(1) ld 16,-128(1) ld 17,-120(1) ld 18,-112(1) ld 19,-104(1) ld 20,-96(1) ld 21,-88(1) ld 22,-80(1) ld 23,-72(1) ld 24,-64(1) ld 25,-56(1) mtlr 14 ld 26,-48(1) ld 14,-144(1) ld 27,-40(1) ld 28,-32(1) ld 29,-24(1) ld 30,-16(1) ld 31,-8(1) b sysmalloc and now is .L677: mr 3,14 ld 15,152(1) ld 14,144(1) ld 25,232(1) ld 30,272(1) li 4,0 addi 1,1,288 b sysmalloc I attach malloc.s.{no,yes}, I hope you can stomach that. Well you can read HP-PA, heh. Segher
malloc.s.no.gz
Description: GNU Zip compressed data
malloc.s.yes.gz
Description: GNU Zip compressed data