Eric Blake wrote: > Looks good to me Applied, thanks.
> except for: > > | > | ! /* Compute auxiliary longword values: > | ! repeated_one is a value which has a 1 in every byte. > | ! repeated_c1 has a c1 in every byte. > | ! repeated_c2 has a c2 in every byte. */ > | ! repeated_one = 0x01010101; > | ! repeated_c1 = c1 | (c1 << 8); > | ! repeated_c2 = c2 | (c2 << 8); > | ! repeated_c1 |= repeated_c1 << 16; > | ! repeated_c2 |= repeated_c2 << 16; > | if (0xffffffffU < TYPE_MAXIMUM (longword)) > | { > | ! repeated_one |= repeated_one << 31 << 1; > | ! repeated_c1 |= repeated_c1 << 31 << 1; > | ! repeated_c2 |= repeated_c2 << 31 << 1; > | ! if (8 < sizeof (longword)) > | ! { > | ! int i; > | ! > | ! for (i = 64; i < sizeof (longword) * 8; i *= 2) > > Since we are already computing repeated_one, I'm wondering if it would be > more efficient to compute repeated_c1 = repeated_one * c1 once at the end, > rather than computing repeated_c1 in parallel with the repeated > modifications to repeated_one. You would be trading (on a 32-bit platform) 2 shifts and 2 OR against 1 32-bit multiplication, or (on a 64-bit platform) 3 shifts and 3 OR against 1 64-bit multiplication. On most platforms the multiplication is too expensive. While there are some CPUs with enough die surface on the multiplier, - some older SPARCs need 29 cycles per multiplication IIRC, - some newer CPUs need a time that depends on the number of bits in one of the two operands - either 4 or up to 8 in this case. So, if you don't precisely know the CPU type and its timings, I would not do this change. Bruno