Issue |
132666
|
Summary |
[AVX-512] Consider that summing every odd element of an array can be done without gathers/perms
|
Labels |
new issue
|
Assignees |
|
Reporter |
Validark
|
In my real code, I have a bunch of slices (each is a `struct { ptr, len }`) and I want to sum together the `len` values to determine the total size. [Zig Godbolt](https://zig.godbo.lt/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXAGx8BBAKoBnTAAUAHpwAMvAFYTStJg1AAvPMFJL6yAngGVG6AMKpaAVxYM9DgDJ4GmADl3ACNMYhAAZmkAB1QFQlsGZzcPPVj4mwFffyCWUPCoi0wrTIYhAiZiAmT3Ty4iksTyyoJswJCwyOkFCqqa1Pqelrbc/K6ASgtUV2Jkdg40Bh6Aah70ZYBSCIARZYABPBZYqogNgCYztfOz8a2AIQ2NAEFHp8xVY4JlqgZlhnc7gBPAiYBQASQYADE6KDqDCFCBNgBWO4AKg2SO2GIemMWK1cAA5SH93AB9Gj0BHLVzxYyYcbU2mYTYAdgez2WnOWeCoyyg/xY5PhmzOkmWES4DLAYC2uw0yxIJMFFNBmx2suWGgZrgYxEwTGQCCYwXo91eXOWADdKkrScFgaDETS8HS1XKzc9zVz%2BMQ%2BSqFNiNAA6IMCoWUjHbBkbFmOFWk9BMCoxxxei2csP2kEKEUPHbfGEJpNMIP0Bgel4crl6ggzX6Zh0BiLsl4srGe54/JVA7MQ6GUyFMHphOGUxHY9GYsRmBgQM5IyS3FGRvFfQnEsP%2Bp1M7VM1kti08vkQTfC85iiVSmX5%2BWK0%2BUt0arXU3X6w3G03NtNWm0N7OIvYADVMGsEgIAiM5iWdOlo3zPYFGiAwCAgLUKwtVcrRAqlsUjICQKIYhwMgxkXXpN19miAhiEcIdkP9QMQ3vUFI1uL8q05H0%2BUtLDo1jeNE2TWNvwtP9VXOPNdn44sK2/DD4iMehSWMFgWEfdj0zWIN4hYdAg3eKiDWQ0SFGJDRiQJaMzjuTZ1ItTTtN0/TiEMk8ySzUFzPMyyD3TTl7MORzVAM6xXMFdyTOWKQvNzYSuX8nS9KC5yQuM4kzgAFi8mT1LkvxgEUwEVLUp5fL8ghdIcxLguQ%2BT8swJSVNM4l0u82KyoqgKquSmq8sU5SWGa5rWJ8zlZIEFZasU1QitlNrVnKrTOqclzJvqwqBs1NLWtsuKFsq5aQtW0l1rSrbspK6tMFrYhfiO6aWGxLhIxkttXg4SZaE4JFeE8DgtFIVBOAALTMVZplmZlzglXgCE0d7JgAaxAJECSDLgAE50ZZSQWRZLgMckM4CSRfROHS3gWAkDQzN%2B/7AY4XgETM2G/ve0g4FgJA0COGEyAoCBueiXmQGMDQuHSsyKRBYgEQgYI4dIYI/EqQFOB4RXleIQEAHlgm0fC1d4bm2EEbWGFoVXWdILBglcYAaNoWgEW4XgsBYQw6oV/A9WsPBuOd/73hA1wQUN8hBGKBXaDwYJnK15wsAVqjDjD7jiGCOJMG2TB3YUvK4cmKgDGABRALwTAAHdteiRgw/4QQRDEdguBZGRBEUFR1Ct3R6gMIwRbMfQY4RSBJlQSjEmdgBabXlgAJWKfUlEHFYp96YAruWVQCUkUlJHSqfo/%2BVRlinlhkGiVxZWMBg0%2B%2BgG0%2BIPAsBHiBJksfDEnsBgnBcWovB/iMDo4R6jpASAIfodRSBgNKEAvInRBiL19gIZofQ/4DAaJ/FBvRWh%2BHaPAkBFgcGQL0EMKocCxiSimDMOYEgPpfR%2BgremyxTDAE1GjdKQZ5QQFwIQRUUNJQwwLpMBA%2BosDhDfqQJG6UuBBgJNTM4EEibyIiHjIkn0ODk1IJTLg1NSC014PTRmIBmbCNJhwM4FMqY0yYZwIRrNxiTDTvEOw6UgA%3D%3D)
```zig
export fn numBytesInFiles(files: [*][]const u8, num_files: usize) usize {
if ((num_files & 31) != 0 or num_files == 0) unreachable; // make emit simpler
var num_bytes: usize = 0;
for (files[0..num_files]) |file_data|
num_bytes += file_data.len;
return num_bytes;
}
```
By default, this is what I get:
```asm
.LCPI0_1:
.quad 32
.LCPI0_2:
.byte 0
.byte 1
.byte 2
.byte 3
.byte 4
.byte 5
.byte 6
.byte 7
numBytesInFiles:
push rbp
mov rbp, rsp
vpmovsxbq zmm1, qword ptr [rip + .LCPI0_2]
vpbroadcastq zmm2, qword ptr [rip + .LCPI0_1]
vpxor xmm0, xmm0, xmm0
vpxor xmm3, xmm3, xmm3
vpxor xmm4, xmm4, xmm4
vpxor xmm5, xmm5, xmm5
.LBB0_1:
vpsllq zmm6, zmm1, 4
kxnorw k1, k0, k0
vpxor xmm7, xmm7, xmm7
vpaddq zmm1, zmm1, zmm2
add rsi, -32
vpgatherqq zmm7 {k1}, zmmword ptr [rdi + zmm6 + 8]
kxnorw k1, k0, k0
vpaddq zmm0, zmm7, zmm0
vpxor xmm7, xmm7, xmm7
vpgatherqq zmm7 {k1}, zmmword ptr [rdi + zmm6 + 136]
kxnorw k1, k0, k0
vpaddq zmm3, zmm7, zmm3
vpxor xmm7, xmm7, xmm7
vpgatherqq zmm7 {k1}, zmmword ptr [rdi + zmm6 + 264]
kxnorw k1, k0, k0
vpaddq zmm4, zmm7, zmm4
vpxor xmm7, xmm7, xmm7
vpgatherqq zmm7 {k1}, zmmword ptr [rdi + zmm6 + 392]
vpaddq zmm5, zmm7, zmm5
jne .LBB0_1
vpaddq zmm0, zmm3, zmm0
vpaddq zmm0, zmm4, zmm0
vpaddq zmm0, zmm5, zmm0
vextracti64x4 ymm1, zmm0, 1
vpaddq zmm0, zmm0, zmm1 ; why is this not a ymm operation?
vextracti128 xmm1, ymm0, 1
vpaddq xmm0, xmm0, xmm1
vpshufd xmm1, xmm0, 238
vpaddq xmm0, xmm0, xmm1
vmovq rax, xmm0
pop rbp
vzeroupper
ret
```
There are a number of issues with this emit. There are serial dependency chains that could have been parallel and we are using `vpgatherqq` at a high cost for little benefit. On LLVM's Godbolt (I assume a newer version of LLVM) I got this emit: [LLVM Godbolt](https://llvm.godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1C1aANxakl9ZATwDKjdAGFUtAK4sGexwBk8DTAA5DwAjTGIQAHYATlIAB1QFQjsGF3dPPQSk2wE/AOCWMIiYy0xrHIYhAiZiAjSPLy5S8pSqmoI8oNDwqNiFatr6jKb%2B9s6Cot6ASktUN2Jkdg4AUgBmACEAagBZDDd6AEkAEU21k7AOYlRUAgvljQBBBTmFzAB9GnpmNlPVk%2BWAEwAq43QEA%2B4PAbATAETboJjVAwATzmsLOpyBmAAtCwQNi4gDIhoQKsASSAViCZEuOTyZTCWSAGySEDMrF4ZmsyTsrgAgAcEn5WKofOJvL5WIYAq4jNpbKE4rBEKhMM2BGIeDi9F%2B/yBqj5jLebLcDAA1gxUAB3BhY2j%2BNyqSQAOi40SdGhxbgUtCVjwhAAEQm46LYGE6AF54YBvEJMZCmxw6zb%2BAjhZi0TYm76YdBvJjodDETZoBijQTJ5mbAGkTZiKMMTZ8gOjWpOpIsLV4Gg5t60VDAPDIJMptNiTMMbO5/OF4sCMuwvBcTZUMRKGt14ANrgBoMh/xO1FxNwEN4sDCYYeCUcZrOsHvTosl%2BfJgGbLHV2t2zebbd%2Bv0bTYADETRsFJNkhdUFBATYLRNS1/HQTNLWqEJtQtBRDGScN3jQUxwiYaFNgUU1NTiK5PkwCEsBoAI4WeXtUFEDMOUkTZ/QYDx1iRVMFAOBhALoTAFAgOJ1U/etGxggQOJMDEAFYNBrFj5K4KYMVWDRTkidYITJCFNnk1YkzEPtRFOOT1j5TZVGTJs5KOdcvwbJtHgMwE5NY9ETMYphzMs6zbOWezHIklyHjcgE5Lk4yTB8vyrJsvA7Ic8TvzCgz%2BhIC9RKLdzFM2HL5LkkK0v0%2BTGWMhhEOU9ymk2VZf3C%2BTImHZAOxgi8asixkaw0MqQiLRdmprAwwgzWqes2UayhUyIIQhAUysy4hOsrfLCvc6RUucpaiBWitWPWsT3NWEqdtc%2BToiTPt8wK9Uaw2yLiu2xsytqzT0WhVN6DYQRCv8EI5iqhQtK2HKlMrZZIhSx65NiA7Nj6i7aqXdF/CUWpTDENwLyhsH7oRqGTgSPBngYB7jsirh8qRpratfNHS3CAgsfcXHtLushCehlS6q6uS6saiKBaM9FMFUdU4xZ7H2fxrmap5%2BmayFzYBqmpgxvkgFVnmx4pBAN6qei9E4gQPBk1Riy1WIHGawm8yUqCrZar5O2qZaoKjn6wbUaNkaNZm2r4em8bIoBWmIRlA2LrVkOVMkXWHi4SJo7pqmrLRtq4k2TAAEdCap3rvfVzXaoTv8k%2BiVODJNFa4wQDX6ET8Pq9Vos4/csLdJpQ25KBa7UFuliKdyyKto3c6mpWhdIbD8EK%2B11v3IBTy/imwfquZEfDLO17kbD4218Hdr9hnzyw622qdZjn2teejuw9dkvA8i1Y5Ob2Ve8JYyCxgtxLQLn3LaKtlqrXPn3SI28Tq73Sm3Z%2Boc%2B7XweLpRa%2B8%2B5XTFhLYgUtWY41BpzCGnlFZUw/LTYWGkkxfTKJgX6BB/oMEBiadAIM8YEO5jDSm6DCFa3LmnOSDUB63VYeDdhUDX75QnnvPhpJBGby2rDceTkpHkNFmvcWksbC4NlmwhWup%2BGC17qsVeJxpx/wAfzUkbt%2BFIIyntMBhl5GcMUaFYuD8%2B6MkTm/VuscA4IIBI1ImicETqjwEGHi6kPpr1YbBBg8EqpIRQmhRImFbDYTeLhfChFiKkXIoJDEAIqDYLYJSVAI5iBKj%2BGCEyYJ8kqgIFiZAR4KleyBOGBgeFiDv37mCOpwpMAInmEJZpSoATrGZCEQgpBATrHzKoKZoymBCXmTM2KlpbRRgQPU7qWIJkEGWUwUwczpkHNUNWY5hyBYAhCFQGU%2ByLm8l2WIYAdzVCXJCJaF5lzkDoE%2BbydAudfkFMBV2FgTBAWmBCCwPA4LIV4DOQs%2B5AJTC0HBROaF5zXm8lMASMpSgbDgoSHEZAgh/kvNMGi5ZsLKVQvhesAahhkAICoF6FIyzkC0CoO4BQCBUCiTZbQS0IR%2BXYSuGys8pg2ULEsdM5Aqhbkyv1MszAeFMW0puYyZAyyVz9CxFwOSIRuKMF5VqpgOqwhYJNTq9AcRLQKAUD86Z2r6l4BYCweVoynW2nDMSvZjrTX1PFWES19SFBMRqMKBQudajBqxKGsQ4buVdgDaa00CgY1Yw1I3fE4QDABFjQgNwVBOWYHTZgGwJAI1Rt9R60FWqFDAAULGNcjqFDEAsI61QrblnACoAwdFoy8D4FMLnLERAsRDpRdM/w2LkCDuWbQb1ghlmguQORFlAhl0sCOaMwNJbpnivwMQcZy7UCmEPf29YLBLRMEINu9YFotTLKJbQFg%2Bwn2mjcE%2B3lPqn0rSoDCRlF6yJUEZR86ZxBrVzvA9am2yyIPECqnB5hmAczLIwkytDIQQjjswHEHZdcSJGDQw3Yj/RTRob7Gs7ltAHWjLtXuujShaX0dOtM%2BjzomhsaUM6Zj3GwVcaUKx0ZWMlnTJnS%2B/YudSViZFXMOIcRwjLMFRaadtH1j6kgdMztByGPqYwnhTVWn9OYF5dWvTOm026sZLsnE55SBYlJDZs8WB7OsFUDs9Vrmt0NNQB2egcysRueFHEW5gXvMpldmF9zth6CucOdTF0745I9UC/FjQiXLlxdVVQELKWTkgrBal1QAQSwdL2UV8lfaUyhZOZVvAEX7P0oYIyrEZsl0NKzvZ5AVAus0ZoTcEtDSOzIFldOezmBgBkXGwwXObUflYnCCwRt9mVy0CUDhuICgsRfVBe11b62sCba9T60di7yv7exIdrbr7aCauFKuS7uHrv7FoAC%2B7a3HtHYU22lbD2NtbcJT%2B97B2ntYmwUYQbnquBcANamB9v2dWRFh0auICP6nAARAgRT92dU8s22j/NNGwc3ARJD/1WIM14CzQ0q4drc3Ym5YW4tBODN7XzUmnEKbLNOvCOzqg9SCANxtIzotsXhSgukK1moWBhdrfZNoDs9mG4QcYLGuX08lcrSUOVwdeBkWjtQOOvXQr2QMDtHm01LAsRemxJN4glisSmhRbafpsbUcu6YNbpQW2CL2bMHgBpNA/emADzddAUuVd9qI7aS07uqrzYYD1yUhuDVxFNVtrAyJ7MWh2UiNPdqcPIh2V8BPKe8/p8L0wJEtmJXJ9z/njPZQq/5qZ2LtP4fuUkHqcy5rFRLNEoEDQZ5lI4gBbIpgf9xBdVCl2fZ8fk%2BkuMh2ZMykf7eeguImDibpNUzEH72vqf4qlumixKe3npgeLlfn7znPGPBfY%2Bv1PnPcaCC77nwQS0GpUxz69AgAEhLqwwcmBSl2tp4EhzdBswDXB/BsR1FrxR0C0zR7MoCIDTdD0y16kmtGUlkt86FoC80EI8AVobAGkTJLNagLBY1wgqc7RsJ7NvccD60AtuUmBMs1crQHNeUtt6BCtvQOD/Bvky16DKMPdhCOC4hQUCxLR5s%2BC1kJCXsZCRCP9Dc2ArcuCxC1kTQJ4cwcQaErMNDrd0x6wdDVCHNACFAWBTpY0WBJdnh%2BdhQbpyt6NDDtDw9VD7NqhgBoRw9gA%2BxYw1sPDIUPCFBVAaN1RyNR0TR/BgAcdnVXUKcMCK1E1%2Bd7NgxBAyAvdbcX0jdTBY0q0fN%2BhLMbc1crcahCAEACiCAiiMYlsMjr1CA4hTRh94IsAnd7NVBjUsRww3AalAkK4wAwATBzAnRnN9hMAnROUCIWE14Bi8Y%2BiHgOAZhaBOA5JeAvAOAtBSBUBOAfAfAAA1bYTYA4AAJSIheEWHUh4FIAIE0EWJmFNBAEkDdGTlFD5H8WiBeI0DeOkGWI4EkDWNuK2M4F4CgkUhuI2MWNIDgFgCQDQA7EEjIAoAgDhLiARJACIPZUMGAHVBNHI0%2BF3ygggBCEBImWYGICRE4CuNJJqCRAAHkQhtAMDKTeA4TaFaSzcKSITSAsAgxgAnAyDmTuSaEsTxAuT0DQI8IoIuTxYy1jwlgriRxfjNi7R6VySXAsBASQlcRuBeAOlAYlAjhhSjBzdQAISZhJiG09i8BMBLRaSFN1irj%2BBBARAxB2ApAZBBBFAVB1AuTdAmhc1jAzALAVSoJIAZhTMUgpSsRaSjJOcFgEAzg2kOlVitiOkNQsBQyIAZgrAMCUgHAqohhGhqxfAYCJgeg3j4hEhkgBBCzBRKzsgUhxhugIgKyczQIBA2hBhXAGg9A2yKhOyOhSzmzeyBg6huzhhLBRymzCgehVJZh5hFgJAliViASuTtiOBiwAzNgIAcSzQ1IIBcBCASBLipheBwStAph7iQAFInQ3jXRPi7yPjXZfj/jSBcRqZFJ1jNj1yQSQAwTbjLz9BOAARVzvzgTriAKZgOkkh7BJAgA%3D)
```asm
.LCPI0_1:
.byte 0
.byte 2
.byte 4
.byte 6
.byte 8
.byte 10
.byte 12
.byte 14
numBytesInFiles:
.LnumBytesInFiles$local:
push rbp
mov rbp, rsp
cmp rsi, 33
jae .LBB0_2
xor eax, eax
xor edx, edx
jmp .LBB0_5
.LBB0_2:
vpmovsxbq zmm1, qword ptr [rip + .LCPI0_1]
lea rax, [rsi - 32]
lea rcx, [rdi + 392]
vpxor xmm0, xmm0, xmm0
vpxor xmm2, xmm2, xmm2
vpxor xmm3, xmm3, xmm3
vpxor xmm4, xmm4, xmm4
mov rdx, rax
.LBB0_3:
vmovdqu64 zmm5, zmmword ptr [rcx - 384]
vmovdqu64 zmm6, zmmword ptr [rcx - 256]
vmovdqu64 zmm7, zmmword ptr [rcx - 128]
vmovdqu64 zmm8, zmmword ptr [rcx]
vpermt2q zmm5, zmm1, zmmword ptr [rcx - 320]
vpermt2q zmm6, zmm1, zmmword ptr [rcx - 192]
vpermt2q zmm7, zmm1, zmmword ptr [rcx - 64]
vpermt2q zmm8, zmm1, zmmword ptr [rcx + 64]
add rcx, 512
add rdx, -32
vpaddq zmm0, zmm5, zmm0
vpaddq zmm2, zmm6, zmm2
vpaddq zmm3, zmm7, zmm3
vpaddq zmm4, zmm8, zmm4
jne .LBB0_3
vpaddq zmm0, zmm2, zmm0
vpaddq zmm0, zmm3, zmm0
vpaddq zmm0, zmm4, zmm0
vextracti64x4 ymm1, zmm0, 1
vpaddq zmm0, zmm0, zmm1; why is this not a ymm operation?
vextracti128 xmm1, ymm0, 1
vpaddq xmm0, xmm0, xmm1
vpshufd xmm1, xmm0, 238
vpaddq xmm0, xmm0, xmm1
vmovq rdx, xmm0; Why do we need all the stuff below???
.LBB0_5:
lea rcx, [rsi - 4]
vmovq xmm0, rdx
mov rdx, rcx
sub rdx, rax
shl rax, 4
lea rax, [rax + rdi + 8]
.LBB0_6:
vmovdqu ymm1, ymmword ptr [rax]
vpunpcklqdq ymm1, ymm1, ymmword ptr [rax + 32]
add rax, 64
add rdx, -4
vpermq ymm1, ymm1, 216
vpaddq ymm0, ymm1, ymm0
jne .LBB0_6
vextracti128 xmm1, ymm0, 1
shl rcx, 4
shl rsi, 4
vpaddq xmm0, xmm0, xmm1
vpshufd xmm1, xmm0, 238
vpaddq xmm0, xmm0, xmm1
vmovq rax, xmm0
add rax, qword ptr [rdi + rcx + 8]
add rax, qword ptr [rsi + rdi - 40]
add rax, qword ptr [rsi + rdi - 24]
add rax, qword ptr [rsi + rdi - 8]
pop rbp
vzeroupper
ret
```
I can see what the compiler was going for until `LBB0_5`. After that, I am at a loss of what it is doing, and I did not spend enough time walking through each step to see what is happening. But I would prefer an emit similar to this, which I was able to coerce the compiler to do for a size-optimized build:
```asm
numBytesInFilesFaster:
push rbp
mov rbp, rsp
shr rsi, 4
vpxor xmm0, xmm0, xmm0
vpxor xmm1, xmm1, xmm1
vpxor xmm2, xmm2, xmm2
vpxor xmm3, xmm3, xmm3
.LBB0_1:
sub rsi, 1
jb .LBB0_3
vpaddq zmm0, zmm0, zmmword ptr [rdi]
vpaddq zmm1, zmm1, zmmword ptr [rdi + 64]
vpaddq zmm2, zmm2, zmmword ptr [rdi + 128]
vpaddq zmm3, zmm3, zmmword ptr [rdi + 192]
add rdi, 256
jmp .LBB0_1
.LBB0_3:
vpaddq zmm0, zmm0, zmm1
vpaddq zmm1, zmm2, zmm3
vpaddq zmm0, zmm0, zmm1
vextracti64x4 ymm1, zmm0, 1
vpaddq ymm0, ymm0, ymm1; Yay! Not a zmm operation!
vextracti128 xmm1, ymm0, 1
vpaddq xmm0, xmm0, xmm1
vpextrq rax, xmm0, 1; just forget about xmm0[0]
pop rbp
vzeroupper
ret
```
I feel like this strategy makes a lot more sense. I didn't speedtest against the new LLVM version but if I remember correctly, my idea was 2-3x faster than the `vpgatherqq` strategy. The `vpermt2q` strategy is definitely a lot smarter than the `vpgatherqq` strategy, so I am not sure how the two compare. It would be easiest for me to compare with Zig using a newer version of LLVM, and I am not sure what the timeframe for that looks like. Hence, since I am a bit pressed for time at the moment, I didn't want to commit at this exact moment to set up a benchmark but I wanted to get this idea out there in case someone wanted to dig into this idea sooner rather than later.
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs