Dale Johannesen wrote:
No, you should not turn on partitioning in situations where code size is important to you.
You are missing the point. In my example, with perfect profiling data, you still end up with
more code in the hot section,
Yes.
i.e. more pages are actually swapped in.
Unless the cross-section branch is actually executed, there's no reason the unconditional
jumps should get paged in, so this doesn't follow.
If you separate the unconditional jumps from the rest of the function, you just have created a
per-function cold section. Except for corner cases, there would have to be a lot of them to
save a page of working set. And if you have that many, it will mean that the condjump can't
reach. And it is still utterly pointless to put blocks into the inter-function cold section
if that only makes the intra-function cold section larger.
So we've come from 4 bytes, on cycle:
bf 0f mov #0,rn
over 6 bytes, BR issue slot during one cycle: bt L2 L1:
..
L2: bra L1 mov #0,n
to 10 bytes in hot part of the hot section, 12 bytes in cold part of the hot
section, and another 10 to 12 bytes in the cold section, while the execution
time in the hot path is now two cycles (if we manage to get a good
schedule, we might execute two other instructions in these cycles, but still,
this is no better than we started out with):
.hotsection: bf L2 mov.w 0f,rn braf @rn nop 0: .word L2-0b L1:
...
L2: mov.l 0f,rn jmp @rn nop .balign 4 0: .long L3
.coldsection L3: mov.l 0f,rn jmp @rn mov #0,rn .balign 4 0: .long L1