On Mon, Jun 13, 2016 at 11:32:11PM +0200, Mattias Andrée wrote: > Proofreading and suggestions is greatly appreciated! >
Page 5: "Branches are incredibly cheap on modern CPUs." Not so! I wrote a variety of CRC32 algorithms to benchmark (which will just measure my personal CPU, I know, but it's better than guesswork), and had two different bitwise versions: for (i = 0; i < 8; i++) { c = crc & 1; crc >>= 1; if (c) crc ^= 0xedb88320ul; } And: for (i = 0; i < 8; i++) { crc = (crc >> 1) ^ (-(crc & 1) & 0xedb88320ul); } The second version had twice the throughput of the first. (Of course that 4-bytes-at-a-time version had the highest throughput, after that the times got bigger again, but that's beside the point.) That remained even when writing the loop in assembly by hand (the compiler refused to just write this:) shr eax jnc 1f xor eax, edb88320ul 1: And that even though the second version is longer in assembly. But eliminating that branch really did increase the throughput immensly. Although, I don't know what would happen on a different architecture. Ciao, Markus