On 5/31/23 09:47, Ard Biesheuvel wrote:
On Wed, 31 May 2023 at 18:33, Richard Henderson
<richard.hender...@linaro.org> wrote:

On 5/31/23 04:22, Ard Biesheuvel wrote:
Use the host native instructions to implement the AES instructions
exposed by the emulated target. The mapping is not 1:1, so it requires a
bit of fiddling to get the right result.

This is still RFC material - the current approach feels too ad-hoc, but
given the non-1:1 correspondence, doing a proper abstraction is rather
difficult.

Changes since v1/RFC:
- add second patch to implement x86 AES instructions on ARM hosts - this
    helps illustrate what an abstraction should cover.
- use cpuinfo framework to detect host support for AES instructions.
- implement ARM aesimc using x86 aesimc directly

Patch #1 produces a 1.5-2x speedup in tests using the Linux kernel's
tcrypt benchmark (mode=500)

Patch #2 produces a 2-3x speedup. The discrepancy is most likely due to
the fact that ARM uses two instructions to implement a single AES round,
whereas x86 only uses one.

Thanks.  I spent some time yesterday looking at this, with an encrypted disk 
test case and
could only measure 0.6% and 0.5% for total overhead of decrypt and encrypt 
respectively.


I don't understand what 'overhead' means in this context. Are you
saying you saw barely any improvement?

I saw, without changes, just over 1% of total system emulation time was devoted to aes, which gives an upper limit to the runtime improvement possible there. But I'll have a look at tcrypt.

aesenc_MC() can be implemented on x86 the way I did in patch #!, using
aesdeclast+aesenc

Oh, nice.  I have not read the actual patches yet.

ppc64:

      asm("lxvd2x 32,0,%1;"
          "lxvd2x 33,0,%2;"
          "vcipher 0,0,1;"
          "stxvd2x 32,0,%0"
          : : "r"(o), "r"(i), "r"(k), : "memory", "v0", "v1", "v2");

ppc64le:

      unsigned char le[16] = {8,9,10,11,12,13,14,15,0,1,2,3,4,5,6,7};
      asm("lxvd2x 32,0,%1;"
          "lxvd2x 33,0,%2;"
          "lxvd2x 34,0,%3;"
          "vperm 0,0,0,2;"
          "vperm 1,1,1,2;"
          "vcipher 0,0,1;"
          "vperm 0,0,0,2;"
          "stxvd2x 32,0,%0"
          : : "r"(o), "r"(i), "r"(k), "r"(le) : "memory", "v0", "v1", "v2");

There are also differences in their AES_Te* based C routines as well, which 
made me wonder
if we are handling host endianness differences correctly in emulation right 
now.  I think
I should most definitely add some generic-ish tests for this...


The above kind of sums it up, no? Or isn't this working code?

It sums up the problem. It works to produce the same output as the x86 instructions, with input bytes in the same order. It shows that we have to extra careful emulating vcipher etc, and should have unit tests.


r~


Reply via email to