On 5/31/23 09:47, Ard Biesheuvel wrote:
On Wed, 31 May 2023 at 18:33, Richard Henderson <richard.hender...@linaro.org> wrote:On 5/31/23 04:22, Ard Biesheuvel wrote:Use the host native instructions to implement the AES instructions exposed by the emulated target. The mapping is not 1:1, so it requires a bit of fiddling to get the right result. This is still RFC material - the current approach feels too ad-hoc, but given the non-1:1 correspondence, doing a proper abstraction is rather difficult. Changes since v1/RFC: - add second patch to implement x86 AES instructions on ARM hosts - this helps illustrate what an abstraction should cover. - use cpuinfo framework to detect host support for AES instructions. - implement ARM aesimc using x86 aesimc directly Patch #1 produces a 1.5-2x speedup in tests using the Linux kernel's tcrypt benchmark (mode=500) Patch #2 produces a 2-3x speedup. The discrepancy is most likely due to the fact that ARM uses two instructions to implement a single AES round, whereas x86 only uses one.Thanks. I spent some time yesterday looking at this, with an encrypted disk test case and could only measure 0.6% and 0.5% for total overhead of decrypt and encrypt respectively.I don't understand what 'overhead' means in this context. Are you saying you saw barely any improvement?
I saw, without changes, just over 1% of total system emulation time was devoted to aes, which gives an upper limit to the runtime improvement possible there. But I'll have a look at tcrypt.
aesenc_MC() can be implemented on x86 the way I did in patch #!, using aesdeclast+aesenc
Oh, nice. I have not read the actual patches yet.
ppc64: asm("lxvd2x 32,0,%1;" "lxvd2x 33,0,%2;" "vcipher 0,0,1;" "stxvd2x 32,0,%0" : : "r"(o), "r"(i), "r"(k), : "memory", "v0", "v1", "v2"); ppc64le: unsigned char le[16] = {8,9,10,11,12,13,14,15,0,1,2,3,4,5,6,7}; asm("lxvd2x 32,0,%1;" "lxvd2x 33,0,%2;" "lxvd2x 34,0,%3;" "vperm 0,0,0,2;" "vperm 1,1,1,2;" "vcipher 0,0,1;" "vperm 0,0,0,2;" "stxvd2x 32,0,%0" : : "r"(o), "r"(i), "r"(k), "r"(le) : "memory", "v0", "v1", "v2"); There are also differences in their AES_Te* based C routines as well, which made me wonder if we are handling host endianness differences correctly in emulation right now. I think I should most definitely add some generic-ish tests for this...The above kind of sums it up, no? Or isn't this working code?
It sums up the problem. It works to produce the same output as the x86 instructions, with input bytes in the same order. It shows that we have to extra careful emulating vcipher etc, and should have unit tests.
r~