>>>>> "DS" == Dan Sugalski <[EMAIL PROTECTED]> writes:
DS> The only difference between the aligned and unaligned runs is the
DS> pointer to the aligned data is on an 8-byte boundary, and the
DS> unaligned data is the aligned pointer plus 1.
i am assuming this is an alpha (which i got to know too intimately for 9
months).
DS> The results from multiple runs varied a bit, but the time differences
DS> between aligned and unaligned access was pretty much the same.
DS> Aligned access
DS> int8 took 96 (96 elapsed) for 100000000 elements
DS> int16 took 175 (189 elapsed) for 100000000 elements
DS> int32 took 177 (194 elapsed) for 100000000 elements
DS> int64 took 192 (211 elapsed) for 100000000 elements
DS> Unaligned access
DS> int8 took 93 (92 elapsed) for 100000000 elements
DS> int16 took 218 (218 elapsed) for 100000000 elements
DS> int32 took 216 (216 elapsed) for 100000000 elements
DS> int64 took 3123 (3157 elapsed) for 100000000 elements
alphas have very severe penalties on unaligned 64 bit accesses. instead
of the cpu itself doing 2 fetches and shift/mask itself, the access
actually triggers a fault and the rom code handles the access. this rom
code is set up for VMS or Unix with different sets of builtin
operations. even though they are really assembler subs they are atomic
at the CPU level.
i am surprised the unaligned 16/32 bits are only slightly slower. that
tells me the compiler is smart and is doing the unaligned access for
you. more on that below.
DS> The moral? Align your 64-bit data. :) And don't tune for your
DS> host, because when I told the compiler to generate host-specific
DS> code, the 16 and 32 bit numbers got worse by a factor of 10. For
DS> those that want numbers, the penalties generally are:
DS> (No, I don't know why unaligned access to 8-bit data is faster,
DS> but there you go)
alphas can't even grab a single byte at a time. the smallest they can
grab is 32 bits. bytes are accessed by a shift/mask operation after the
fetch which makes getting a single byte a 2 instruction operation. a
common optimization then is to grab aligned 64 bits and use that as a
shift buffer and then get a byte at a time. this is done in the c string
libraries that DEC (remember them?) wrote.
not sure why the unaligned access is faster other than the compiler is
tricked/forces into grabbing 8 bytes and caching it, while the aligned
case may not realize it.
try doing random accesses and really screw up the compiler. pick a
random aligned address and add 1 (or 7) to it and see what happens. the
compiler won't see a sequential access and will have to do more work or
let the rom handle it. with an offset of 7, you can force all 16/32 bit
accesses to cross word boundaries and wreak havoc on your timings.
DS> What does this mean for perl? Probably not a whole lot, since we
DS> deal mostly with 8-bit character data. It does illustrate that it
DS> really *is* worth keeping alignment issues in mind when designing
DS> data structures. (While the compiler will, presumably, generate
DS> aligned structure members, that doesn't mean that dynamically
DS> generated arrays of them will be properly aligned...)
what about all the UTF stuff? i think allocating string buffers on 64
bit boundaries makes sense if you can stuff 16/32 bit char codes in
them. since we will be doing our own memory management we can control
this as well. it definitely matters for structures but i think it will
for dynamic data too.
uri
--
Uri Guttman --------- [EMAIL PROTECTED] ---------- http://www.sysarch.com
SYStems ARCHitecture and Stem Development ------ http://www.stemsystems.com
Learn Advanced Object Oriented Perl from Damian Conway - Boston, July 10-11
Class and Registration info: http://www.sysarch.com/perl/OOP_class.html