On Sun, Dec 10, 2000 at 02:53:43PM -0500, Tom Lane wrote:
> > On my Celeron, the timing for those six opcodes is almost whopping 13
> > cycles per byte.  Obviously there's some major performance hit to do the
> > memory instructions, because there's no more than 4 cycles worth of
> > dependant instructions in that snippet.
> Yes.  It looks like we're looking at pipeline stalls for the memory
> reads.

In particular, for the single-byte memory read.  By loading in 32-bit
words at a time, the cost drops to about 7 cycles per byte.  I
imagine on a 64-bit CPU, loading 64-bit words at a time would drop the
cost even further.
  word1 = *(unsigned long*)z;
  while (c > 4)
    {
      z += 4;
      ick = IUPDC32 (word1, ick); word1 >>= 8;
      c -= 4;
      ick = IUPDC32 (word1, ick); word1 >>= 8;
      word1 = *(unsigned long*)z;
      ick = IUPDC32 (word1, ick); word1 >>= 8;
      ick = IUPDC32 (word1, ick);
    }
I tried loading two words at a time, starting to load the second word
well before it's used, but that didn't actually reduce the time taken.

> As Nathan remarks nearby, this is just minutiae, but I'm interested
> anyway...

Yup.
-- 
Bruce Guenter <[EMAIL PROTECTED]>                       http://em.ca/~bruceg/

PGP signature

Reply via email to