On 06/27/11 11:32 PM, Bill Sommerfeld wrote:
On 06/27/11 15:24, David Magda wrote:
Given the amount of transistors that are available nowadays I think
it'd be simpler to just create a series of SIMD instructions right
in/on general CPUs, and skip the whole co-processor angle.
see: http://en.wikipedia.org/wiki/AES_instruction_set
Present in many current Intel CPUs; also expected to be present in AMD's
"Bulldozer" based CPUs.
I recall seeing a blog comparing the existing Solaris hand-tuned AES
assembler performance with the (then) new AES instruction version, where
the Intel AES instructions only got you about a 30% performance
increase. I've seen reports of better performance improvements, but
usually by comparing with the performance on older processors which are
going to be slower for additional reasons then just missing the AES
instructions. Also, you could claim better performance improvement if
you compared against a less efficient original implementation of AES.
What this means is that a faster CPU may buy you more crypto performance
than the AES instructions alone will do.
My understanding from reading the Intel AES instruction set (which I
warn might not be completely correct) is that the AES
encryption/decryption instruction is executed between 10 and 14 times
(depending on key length) for each 128 bits (16 bytes) of data being
encrypted/decrypted, so it's very much part of the regular instruction
pipeline. The code will have to loop though this process multiple times
to process a data block bigger than 16 bytes, i.e. a double nested loop,
although I expect it's normally loop-unrolled a fair degree for
optimisation purposes.
Conversely, the crypto units in the T-series processors are separate
from the CPU, and do the encryption/decryption whilst the CPU is getting
on with something else, and they do it much faster than it could be done
on the CPU. Small blocks are normally a problem for crypto offload
engines because the overhead of farming off the work to the engine and
getting the result back often means that you can do the crypto on the
CPU faster than the time it takes to get the crypto engine started and
stopped. However, T-series crypto is particularly good at handling small
blocks efficiently, such as around 1kbyte which you are likely to find
in a network packet, as it is much closer coupled to the CPU than a PCI
crypto card can be, and performance with small packets was key for the
crypto networking support T-series was designed for. Of course, it
handles crypto of large blocks just fine too.
--
Andrew
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss