On Tue, 22 Mar 2022 23:07:30 GMT, Jamil Nimeh <jni...@openjdk.org> wrote:
>> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5842: >> >>> 5840: __ evpscatterdd(Address(result, zmm_scratch, Address::times_4, >>> 32), writeMask, zmm_cVec, Assembler::AVX_512bit); >>> 5841: __ knotwl(writeMask, writeMask); >>> 5842: __ evpscatterdd(Address(result, zmm_scratch, Address::times_4, >>> 48), writeMask, zmm_dVec, Assembler::AVX_512bit); >> >> Using the vextracti32x4 instead of evpscatterdd would give better >> performance: >> __ vextracti32x4(Address(result, 0), zmm_aVec, 0); >> __ vextracti32x4(Address(result, 64), zmm_aVec, 1); >> __ vextracti32x4(Address(result, 128), zmm_aVec, 2); >> __ vextracti32x4(Address(result, 192), zmm_aVec, 3); >> __ vextracti32x4(Address(result, 16), zmm_bVec, 0); >> __ vextracti32x4(Address(result, 80), zmm_bVec, 1); >> __ vextracti32x4(Address(result, 144), zmm_bVec, 2); >> __ vextracti32x4(Address(result, 208), zmm_bVec, 3); >> __ vextracti32x4(Address(result, 32), zmm_cVec, 0); >> __ vextracti32x4(Address(result, 96), zmm_cVec, 1); >> __ vextracti32x4(Address(result, 160), zmm_cVec, 2); >> __ vextracti32x4(Address(result, 224), zmm_cVec, 3); >> __ vextracti32x4(Address(result, 48), zmm_dVec, 0); >> __ vextracti32x4(Address(result, 112), zmm_dVec, 1); >> __ vextracti32x4(Address(result, 176), zmm_dVec, 2); >> __ vextracti32x4(Address(result, 240), zmm_dVec, 3); > > I have been wondering about this approach for a while now, since I did > something similar for the AVX2 version. I had assumed that using > evpscatterdd used less instructions and therefore would be more efficient, > but I'm more than happy to move to the vextracti32x4 approach. I'll be eager > to see how it impacts performance along with the increased storage of > intermediate data on additional XMMRegister objects. The changes you recommended yielded about a 10-15% performance improvement on the system I was using for benchmarks. Thanks for the suggestions! ------------- PR: https://git.openjdk.org/jdk/pull/7702