On 12/13/2013 11:01 AM, David Laight wrote:
My thoughts exactly.
Given this is a hash it could crc alternate words into separate
accumulators and the combine the values at the end.
That way you are still doing sequential accesses to the data.
(The crc instruction might be better than an xor for the combine.)
If the cpu has 3 execution units that can do crc, use them all.

It might be that the hash function is now an insignificant cost.
Looking at how much hashing the data twice (discarding the first
result - assign to global volatile data) slows things down can
help determine this.

On i7 CPUs the crc32/crc64 instructions have a throughput
of 1 cycle and a latency of 3 cycles [1], which means that 1) with this code we pay 3 clocks per crc32 instruction, and 2) we could compute three CRCs in parallel, each processing 1/3 of the data during the same clock. This could in theory provide 3x the performance.

For short keys (~100 bytes and less) there is chance that the 3x theoretical speedup will be destroyed by the additional code required to compute boundaries, xor the results, etc. But as I already mentioned, this is something to try.

[1] http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-paper.pdf
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Reply via email to