Either cachegrind is wrong, or gcc gets much better from that time? Or do
I interpret cachegrind provided data in the wrong way? What do you think
about it?
Or you're comparing x86 to power, and noticing that the x86 has to execute way more data movement instructions for silly little things, and it wins on most of the silly extra instructions?
Only collecting data side by side for the same work load and checking out the numbers between the two will probably yield the truth.
If cachegrind works on ppc yellow dog linux.... one could compare those numbers...
If I run across any arch to arch numbers, I'll post them.