The goal of these patch series is to set up an infrastructure to emulate guest vector operations using host vector operations. Preliminary experiments show that simply translating loads and stores increases performance of x264 video codec by 10%. The performance of a gcc vectorized for loop increased 2x.
To be able to emulate guest vector operations using host vector operations, several things need to be done. 1. Corresponding vector types should be added to TCG. These series add TCG_v128 and TCG_v64. I've made TCG_v64 a different type than TCG_i64 because it usually needs to be allocated to different registers and supports different operations. 2. Load/store operations for these new types need to be implemented. 3. For seamless transition from current model to a new one we need to handle cases where memory occupied by global variable can be accessed via pointer to the CPUArchState structure. A very simple conservative alias analysis has been added to do it. This analysis tracks memory loads and stores that overlap with fields of CPUArchState and provides this information to the register allocator. The allocator then spills and reloads affected globals when needed. 4. Allow overlapping globals. For scalar registers this is a rare case, and overlapping registers can ba handled as a single one (ah, al, ax, eax, rax). In ARM every Q-register consists of two D-register each consisting of two S-registers. Handling 4 S-registers as one because they are parts of the same Q-register is way too inefficient. 5. Add new memory addressing mode to MMU code for large accesses and create needed helpers. Only 128-bit vectors have been handled for now. 6. Create TCG opcodes for vector operations. Only addition has beed handled in these series. Each operation has a wrapper that checks if the backend supports the corresponding operation or not. In one case the vector opcode is generated, in the other the operation is emulated with scalar operations. The emulation code is generated inline for performance reasons (there is a huge performance difference between inline generation and calling a helper). As a positive side effect this will eventually allow to merge similar emulation code for vector instructions from different frontends to target-independent implementation. 7. Use new operations in the frontend (ARM was used in these series). 8. Support new operations in the backend (x86_64 was used in these series). For experiments I have used ARM guest on x86_64 host. I wanted some pair of different architectures with vector extensions both. ARM and x86_64 pair fits well. v1 -> v2: - represent v128 type with smaller types when it is not supported by the host - detect AVX support and use AVX instructions when available - tcg/README updated - generate two v64 adds instead of one v128 when applicable - rebased to newer master - overlap detection for temps added (it needs to be explicitly called from <arch>_translate_init) - the stack is used to temporary store 128 bit variables to memory (instead of the TCGContext field) v2 -> v2.1 - automatic build failure fixed Outstanding issues: - qemu_ld_v128 and qemu_st_v128 do not generate fallback code if the host does not support 128 bit registers. The reason is that I do not know how to handle the host/guest different endianness (whether do we swap only bytes in elements or whole vectors?). Different targets seem to have different ideas on how this should be done. Kirill Batuzov (20): tcg: add support for 128bit vector type tcg: add support for 64bit vector type tcg: support representing vector type with smaller vector or scalar types tcg: add ld_v128, ld_v64, st_v128 and st_v64 opcodes tcg: add simple alias analysis tcg: use results of alias analysis in liveness analysis tcg: allow globals to overlap tcg: add vector addition operations target/arm: support access to vector guest registers as globals target/arm: use vector opcode to handle vadd.<size> instruction tcg/i386: add support for vector opcodes tcg/i386: support 64-bit vector operations tcg/i386: support remaining vector addition operations tcg: do not rely on exact values of MO_BSWAP or MO_SIGN in backend tcg: introduce new TCGMemOp - MO_128 tcg: introduce qemu_ld_v128 and qemu_st_v128 opcodes softmmu: create helpers for vector loads tcg/i386: add support for qemu_ld_v128/qemu_st_v128 ops target/arm: load two consecutive 64-bits vector regs as a 128-bit vector reg tcg/README: update README to include information about vector opcodes Kirill Batuzov (21): tcg: add support for 128bit vector type tcg: add support for 64bit vector type tcg: support representing vector type with smaller vector or scalar types tcg: add ld_v128, ld_v64, st_v128 and st_v64 opcodes tcg: add simple alias analysis tcg: use results of alias analysis in liveness analysis tcg: allow globals to overlap tcg: add vector addition operations target/arm: support access to vector guest registers as globals target/arm: use vector opcode to handle vadd.<size> instruction tcg/i386: add support for vector opcodes tcg/i386: support 64-bit vector operations tcg/i386: support remaining vector addition operations tcg: do not rely on exact values of MO_BSWAP or MO_SIGN in backend target/aarch64: do not check for non-existent TCGMemOp tcg: introduce new TCGMemOp - MO_128 tcg: introduce qemu_ld_v128 and qemu_st_v128 opcodes softmmu: create helpers for vector loads tcg/i386: add support for qemu_ld_v128/qemu_st_v128 ops target/arm: load two consecutive 64-bits vector regs as a 128-bit vector reg tcg/README: update README to include information about vector opcodes cputlb.c | 4 + softmmu_template_vector.h | 266 +++++++++++++++++++++++++++++++ target/arm/translate-a64.c | 1 - target/arm/translate.c | 76 ++++++++- tcg/README | 47 +++++- tcg/aarch64/tcg-target.inc.c | 4 +- tcg/arm/tcg-target.inc.c | 4 +- tcg/i386/tcg-target.h | 45 +++++- tcg/i386/tcg-target.inc.c | 260 +++++++++++++++++++++++++++++-- tcg/mips/tcg-target.inc.c | 4 +- tcg/optimize.c | 165 +++++++++++++++++++- tcg/ppc/tcg-target.inc.c | 4 +- tcg/s390/tcg-target.inc.c | 4 +- tcg/sparc/tcg-target.inc.c | 12 +- tcg/tcg-op.c | 92 ++++++++++- tcg/tcg-op.h | 267 +++++++++++++++++++++++++++++++ tcg/tcg-opc.h | 34 ++++ tcg/tcg.c | 363 +++++++++++++++++++++++++++++++++++++------ tcg/tcg.h | 163 ++++++++++++++++++- 19 files changed, 1722 insertions(+), 93 deletions(-) create mode 100644 softmmu_template_vector.h -- 2.1.4