Issue 132666
Summary [AVX-512] Consider that summing every odd element of an array can be done without gathers/perms
Labels new issue
Assignees
Reporter Validark
    In my real code, I have a bunch of slices (each is a `struct { ptr, len }`) and I want to sum together the `len` values to determine the total size. [Zig Godbolt](https://zig.godbo.lt/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXAGx8BBAKoBnTAAUAHpwAMvAFYTStJg1AAvPMFJL6yAngGVG6AMKpaAVxYM9DgDJ4GmADl3ACNMYhAAZmkAB1QFQlsGZzcPPVj4mwFffyCWUPCoi0wrTIYhAiZiAmT3Ty4iksTyyoJswJCwyOkFCqqa1Pqelrbc/K6ASgtUV2Jkdg40Bh6Aah70ZYBSCIARZYABPBZYqogNgCYztfOz8a2AIQ2NAEFHp8xVY4JlqgZlhnc7gBPAiYBQASQYADE6KDqDCFCBNgBWO4AKg2SO2GIemMWK1cAA5SH93AB9Gj0BHLVzxYyYcbU2mYTYAdgez2WnOWeCoyyg/xY5PhmzOkmWES4DLAYC2uw0yxIJMFFNBmx2suWGgZrgYxEwTGQCCYwXo91eXOWADdKkrScFgaDETS8HS1XKzc9zVz%2BMQ%2BSqFNiNAA6IMCoWUjHbBkbFmOFWk9BMCoxxxei2csP2kEKEUPHbfGEJpNMIP0Bgel4crl6ggzX6Zh0BiLsl4srGe54/JVA7MQ6GUyFMHphOGUxHY9GYsRmBgQM5IyS3FGRvFfQnEsP%2Bp1M7VM1kti08vkQTfC85iiVSmX5%2BWK0%2BUt0arXU3X6w3G03NtNWm0N7OIvYADVMGsEgIAiM5iWdOlo3zPYFGiAwCAgLUKwtVcrRAqlsUjICQKIYhwMgxkXXpN19miAhiEcIdkP9QMQ3vUFI1uL8q05H0%2BUtLDo1jeNE2TWNvwtP9VXOPNdn44sK2/DD4iMehSWMFgWEfdj0zWIN4hYdAg3eKiDWQ0SFGJDRiQJaMzjuTZ1ItTTtN0/TiEMk8ySzUFzPMyyD3TTl7MORzVAM6xXMFdyTOWKQvNzYSuX8nS9KC5yQuM4kzgAFi8mT1LkvxgEUwEVLUp5fL8ghdIcxLguQ%2BT8swJSVNM4l0u82KyoqgKquSmq8sU5SWGa5rWJ8zlZIEFZasU1QitlNrVnKrTOqclzJvqwqBs1NLWtsuKFsq5aQtW0l1rSrbspK6tMFrYhfiO6aWGxLhIxkttXg4SZaE4JFeE8DgtFIVBOAALTMVZplmZlzglXgCE0d7JgAaxAJECSDLgAE50ZZSQWRZLgMckM4CSRfROHS3gWAkDQzN%2B/7AY4XgETM2G/ve0g4FgJA0COGEyAoCBueiXmQGMDQuHSsyKRBYgEQgYI4dIYI/EqQFOB4RXleIQEAHlgm0fC1d4bm2EEbWGFoVXWdILBglcYAaNoWgEW4XgsBYQw6oV/A9WsPBuOd/73hA1wQUN8hBGKBXaDwYJnK15wsAVqjDjD7jiGCOJMG2TB3YUvK4cmKgDGABRALwTAAHdteiRgw/4QQRDEdguBZGRBEUFR1Ct3R6gMIwRbMfQY4RSBJlQSjEmdgBabXlgAJWKfUlEHFYp96YAruWVQCUkUlJHSqfo/%2BVRlinlhkGiVxZWMBg0%2B%2BgG0%2BIPAsBHiBJksfDEnsBgnBcWovB/iMDo4R6jpASAIfodRSBgNKEAvInRBiL19gIZofQ/4DAaJ/FBvRWh%2BHaPAkBFgcGQL0EMKocCxiSimDMOYEgPpfR%2BgremyxTDAE1GjdKQZ5QQFwIQRUUNJQwwLpMBA%2BosDhDfqQJG6UuBBgJNTM4EEibyIiHjIkn0ODk1IJTLg1NSC014PTRmIBmbCNJhwM4FMqY0yYZwIRrNxiTDTvEOw6UgA%3D%3D)

```zig
export fn numBytesInFiles(files: [*][]const u8, num_files: usize) usize {
    if ((num_files & 31) != 0 or num_files == 0) unreachable; // make emit simpler
 var num_bytes: usize = 0;

    for (files[0..num_files]) |file_data|
 num_bytes += file_data.len;

    return num_bytes;
}
```

By default, this is what I get:

```asm
.LCPI0_1:
        .quad 32
.LCPI0_2:
        .byte   0
        .byte   1
        .byte   2
 .byte   3
        .byte   4
        .byte   5
        .byte   6
 .byte   7
numBytesInFiles:
        push    rbp
        mov     rbp, rsp
 vpmovsxbq       zmm1, qword ptr [rip + .LCPI0_2]
        vpbroadcastq zmm2, qword ptr [rip + .LCPI0_1]
        vpxor   xmm0, xmm0, xmm0
 vpxor   xmm3, xmm3, xmm3
        vpxor   xmm4, xmm4, xmm4
        vpxor xmm5, xmm5, xmm5
.LBB0_1:
        vpsllq  zmm6, zmm1, 4
        kxnorw k1, k0, k0
        vpxor   xmm7, xmm7, xmm7
        vpaddq  zmm1, zmm1, zmm2
        add     rsi, -32
        vpgatherqq      zmm7 {k1}, zmmword ptr [rdi + zmm6 + 8]
        kxnorw  k1, k0, k0
        vpaddq  zmm0, zmm7, zmm0
        vpxor   xmm7, xmm7, xmm7
        vpgatherqq      zmm7 {k1}, zmmword ptr [rdi + zmm6 + 136]
        kxnorw  k1, k0, k0
        vpaddq zmm3, zmm7, zmm3
        vpxor   xmm7, xmm7, xmm7
        vpgatherqq zmm7 {k1}, zmmword ptr [rdi + zmm6 + 264]
        kxnorw  k1, k0, k0
 vpaddq  zmm4, zmm7, zmm4
        vpxor   xmm7, xmm7, xmm7
 vpgatherqq      zmm7 {k1}, zmmword ptr [rdi + zmm6 + 392]
        vpaddq zmm5, zmm7, zmm5
        jne     .LBB0_1
        vpaddq  zmm0, zmm3, zmm0
 vpaddq  zmm0, zmm4, zmm0
        vpaddq  zmm0, zmm5, zmm0
 vextracti64x4   ymm1, zmm0, 1
        vpaddq  zmm0, zmm0, zmm1 ; why is this not a ymm operation?
        vextracti128    xmm1, ymm0, 1
        vpaddq xmm0, xmm0, xmm1
        vpshufd xmm1, xmm0, 238
        vpaddq  xmm0, xmm0, xmm1
        vmovq   rax, xmm0
        pop     rbp
 vzeroupper
        ret
```

There are a number of issues with this emit. There are serial dependency chains that could have been parallel and we are using `vpgatherqq` at a high cost for little benefit. On LLVM's Godbolt (I assume a newer version of LLVM) I got this emit: [LLVM Godbolt](https://llvm.godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1C1aANxakl9ZATwDKjdAGFUtAK4sGexwBk8DTAA5DwAjTGIQAHYATlIAB1QFQjsGF3dPPQSk2wE/AOCWMIiYy0xrHIYhAiZiAjSPLy5S8pSqmoI8oNDwqNiFatr6jKb%2B9s6Cot6ASktUN2Jkdg4AUgBmACEAagBZDDd6AEkAEU21k7AOYlRUAgvljQBBBTmFzAB9GnpmNlPVk%2BWAEwAq43QEA%2B4PAbATAETboJjVAwATzmsLOpyBmAAtCwQNi4gDIhoQKsASSAViCZEuOTyZTCWSAGySEDMrF4ZmsyTsrgAgAcEn5WKofOJvL5WIYAq4jNpbKE4rBEKhMM2BGIeDi9F%2B/yBqj5jLebLcDAA1gxUAB3BhY2j%2BNyqSQAOi40SdGhxbgUtCVjwhAAEQm46LYGE6AF54YBvEJMZCmxw6zb%2BAjhZi0TYm76YdBvJjodDETZoBijQTJ5mbAGkTZiKMMTZ8gOjWpOpIsLV4Gg5t60VDAPDIJMptNiTMMbO5/OF4sCMuwvBcTZUMRKGt14ANrgBoMh/xO1FxNwEN4sDCYYeCUcZrOsHvTosl%2BfJgGbLHV2t2zebbd%2Bv0bTYADETRsFJNkhdUFBATYLRNS1/HQTNLWqEJtQtBRDGScN3jQUxwiYaFNgUU1NTiK5PkwCEsBoAI4WeXtUFEDMOUkTZ/QYDx1iRVMFAOBhALoTAFAgOJ1U/etGxggQOJMDEAFYNBrFj5K4KYMVWDRTkidYITJCFNnk1YkzEPtRFOOT1j5TZVGTJs5KOdcvwbJtHgMwE5NY9ETMYphzMs6zbOWezHIklyHjcgE5Lk4yTB8vyrJsvA7Ic8TvzCgz%2BhIC9RKLdzFM2HL5LkkK0v0%2BTGWMhhEOU9ymk2VZf3C%2BTImHZAOxgi8asixkaw0MqQiLRdmprAwwgzWqes2UayhUyIIQhAUysy4hOsrfLCvc6RUucpaiBWitWPWsT3NWEqdtc%2BToiTPt8wK9Uaw2yLiu2xsytqzT0WhVN6DYQRCv8EI5iqhQtK2HKlMrZZIhSx65NiA7Nj6i7aqXdF/CUWpTDENwLyhsH7oRqGTgSPBngYB7jsirh8qRpratfNHS3CAgsfcXHtLushCehlS6q6uS6saiKBaM9FMFUdU4xZ7H2fxrmap5%2BmayFzYBqmpgxvkgFVnmx4pBAN6qei9E4gQPBk1Riy1WIHGawm8yUqCrZar5O2qZaoKjn6wbUaNkaNZm2r4em8bIoBWmIRlA2LrVkOVMkXWHi4SJo7pqmrLRtq4k2TAAEdCap3rvfVzXaoTv8k%2BiVODJNFa4wQDX6ET8Pq9Vos4/csLdJpQ25KBa7UFuliKdyyKto3c6mpWhdIbD8EK%2B11v3IBTy/imwfquZEfDLO17kbD4218Hdr9hnzyw622qdZjn2teejuw9dkvA8i1Y5Ob2Ve8JYyCxgtxLQLn3LaKtlqrXPn3SI28Tq73Sm3Z%2Boc%2B7XweLpRa%2B8%2B5XTFhLYgUtWY41BpzCGnlFZUw/LTYWGkkxfTKJgX6BB/oMEBiadAIM8YEO5jDSm6DCFa3LmnOSDUB63VYeDdhUDX75QnnvPhpJBGby2rDceTkpHkNFmvcWksbC4NlmwhWup%2BGC17qsVeJxpx/wAfzUkbt%2BFIIyntMBhl5GcMUaFYuD8%2B6MkTm/VuscA4IIBI1ImicETqjwEGHi6kPpr1YbBBg8EqpIRQmhRImFbDYTeLhfChFiKkXIoJDEAIqDYLYJSVAI5iBKj%2BGCEyYJ8kqgIFiZAR4KleyBOGBgeFiDv37mCOpwpMAInmEJZpSoATrGZCEQgpBATrHzKoKZoymBCXmTM2KlpbRRgQPU7qWIJkEGWUwUwczpkHNUNWY5hyBYAhCFQGU%2ByLm8l2WIYAdzVCXJCJaF5lzkDoE%2BbydAudfkFMBV2FgTBAWmBCCwPA4LIV4DOQs%2B5AJTC0HBROaF5zXm8lMASMpSgbDgoSHEZAgh/kvNMGi5ZsLKVQvhesAahhkAICoF6FIyzkC0CoO4BQCBUCiTZbQS0IR%2BXYSuGys8pg2ULEsdM5Aqhbkyv1MszAeFMW0puYyZAyyVz9CxFwOSIRuKMF5VqpgOqwhYJNTq9AcRLQKAUD86Z2r6l4BYCweVoynW2nDMSvZjrTX1PFWES19SFBMRqMKBQudajBqxKGsQ4buVdgDaa00CgY1Yw1I3fE4QDABFjQgNwVBOWYHTZgGwJAI1Rt9R60FWqFDAAULGNcjqFDEAsI61QrblnACoAwdFoy8D4FMLnLERAsRDpRdM/w2LkCDuWbQb1ghlmguQORFlAhl0sCOaMwNJbpnivwMQcZy7UCmEPf29YLBLRMEINu9YFotTLKJbQFg%2Bwn2mjcE%2B3lPqn0rSoDCRlF6yJUEZR86ZxBrVzvA9am2yyIPECqnB5hmAczLIwkytDIQQjjswHEHZdcSJGDQw3Yj/RTRob7Gs7ltAHWjLtXuujShaX0dOtM%2BjzomhsaUM6Zj3GwVcaUKx0ZWMlnTJnS%2B/YudSViZFXMOIcRwjLMFRaadtH1j6kgdMztByGPqYwnhTVWn9OYF5dWvTOm026sZLsnE55SBYlJDZs8WB7OsFUDs9Vrmt0NNQB2egcysRueFHEW5gXvMpldmF9zth6CucOdTF0745I9UC/FjQiXLlxdVVQELKWTkgrBal1QAQSwdL2UV8lfaUyhZOZVvAEX7P0oYIyrEZsl0NKzvZ5AVAus0ZoTcEtDSOzIFldOezmBgBkXGwwXObUflYnCCwRt9mVy0CUDhuICgsRfVBe11b62sCba9T60di7yv7exIdrbr7aCauFKuS7uHrv7FoAC%2B7a3HtHYU22lbD2NtbcJT%2B97B2ntYmwUYQbnquBcANamB9v2dWRFh0auICP6nAARAgRT92dU8s22j/NNGwc3ARJD/1WIM14CzQ0q4drc3Ym5YW4tBODN7XzUmnEKbLNOvCOzqg9SCANxtIzotsXhSgukK1moWBhdrfZNoDs9mG4QcYLGuX08lcrSUOVwdeBkWjtQOOvXQr2QMDtHm01LAsRemxJN4glisSmhRbafpsbUcu6YNbpQW2CL2bMHgBpNA/emADzddAUuVd9qI7aS07uqrzYYD1yUhuDVxFNVtrAyJ7MWh2UiNPdqcPIh2V8BPKe8/p8L0wJEtmJXJ9z/njPZQq/5qZ2LtP4fuUkHqcy5rFRLNEoEDQZ5lI4gBbIpgf9xBdVCl2fZ8fk%2BkuMh2ZMykf7eeguImDibpNUzEH72vqf4qlumixKe3npgeLlfn7znPGPBfY%2Bv1PnPcaCC77nwQS0GpUxz69AgAEhLqwwcmBSl2tp4EhzdBswDXB/BsR1FrxR0C0zR7MoCIDTdD0y16kmtGUlkt86FoC80EI8AVobAGkTJLNagLBY1wgqc7RsJ7NvccD60AtuUmBMs1crQHNeUtt6BCtvQOD/Bvky16DKMPdhCOC4hQUCxLR5s%2BC1kJCXsZCRCP9Dc2ArcuCxC1kTQJ4cwcQaErMNDrd0x6wdDVCHNACFAWBTpY0WBJdnh%2BdhQbpyt6NDDtDw9VD7NqhgBoRw9gA%2BxYw1sPDIUPCFBVAaN1RyNR0TR/BgAcdnVXUKcMCK1E1%2Bd7NgxBAyAvdbcX0jdTBY0q0fN%2BhLMbc1crcahCAEACiCAiiMYlsMjr1CA4hTRh94IsAnd7NVBjUsRww3AalAkK4wAwATBzAnRnN9hMAnROUCIWE14Bi8Y%2BiHgOAZhaBOA5JeAvAOAtBSBUBOAfAfAAA1bYTYA4AAJSIheEWHUh4FIAIE0EWJmFNBAEkDdGTlFD5H8WiBeI0DeOkGWI4EkDWNuK2M4F4CgkUhuI2MWNIDgFgCQDQA7EEjIAoAgDhLiARJACIPZUMGAHVBNHI0%2BF3ygggBCEBImWYGICRE4CuNJJqCRAAHkQhtAMDKTeA4TaFaSzcKSITSAsAgxgAnAyDmTuSaEsTxAuT0DQI8IoIuTxYy1jwlgriRxfjNi7R6VySXAsBASQlcRuBeAOlAYlAjhhSjBzdQAISZhJiG09i8BMBLRaSFN1irj%2BBBARAxB2ApAZBBBFAVB1AuTdAmhc1jAzALAVSoJIAZhTMUgpSsRaSjJOcFgEAzg2kOlVitiOkNQsBQyIAZgrAMCUgHAqohhGhqxfAYCJgeg3j4hEhkgBBCzBRKzsgUhxhugIgKyczQIBA2hBhXAGg9A2yKhOyOhSzmzeyBg6huzhhLBRymzCgehVJZh5hFgJAliViASuTtiOBiwAzNgIAcSzQ1IIBcBCASBLipheBwStAph7iQAFInQ3jXRPi7yPjXZfj/jSBcRqZFJ1jNj1yQSQAwTbjLz9BOAARVzvzgTriAKZgOkkh7BJAgA%3D)

```asm
.LCPI0_1:
 .byte   0
        .byte   2
        .byte   4
        .byte   6
 .byte   8
        .byte   10
        .byte   12
        .byte 14
numBytesInFiles:
.LnumBytesInFiles$local:
        push    rbp
 mov     rbp, rsp
        cmp     rsi, 33
        jae     .LBB0_2
 xor     eax, eax
        xor     edx, edx
        jmp .LBB0_5
.LBB0_2:
        vpmovsxbq       zmm1, qword ptr [rip + .LCPI0_1]
 lea     rax, [rsi - 32]
        lea     rcx, [rdi + 392]
 vpxor   xmm0, xmm0, xmm0
        vpxor   xmm2, xmm2, xmm2
        vpxor xmm3, xmm3, xmm3
        vpxor   xmm4, xmm4, xmm4
        mov     rdx, rax
.LBB0_3:
        vmovdqu64       zmm5, zmmword ptr [rcx - 384]
 vmovdqu64       zmm6, zmmword ptr [rcx - 256]
        vmovdqu64       zmm7, zmmword ptr [rcx - 128]
        vmovdqu64       zmm8, zmmword ptr [rcx]
 vpermt2q        zmm5, zmm1, zmmword ptr [rcx - 320]
        vpermt2q zmm6, zmm1, zmmword ptr [rcx - 192]
        vpermt2q        zmm7, zmm1, zmmword ptr [rcx - 64]
        vpermt2q        zmm8, zmm1, zmmword ptr [rcx + 64]
        add     rcx, 512
        add     rdx, -32
        vpaddq zmm0, zmm5, zmm0
        vpaddq  zmm2, zmm6, zmm2
        vpaddq  zmm3, zmm7, zmm3
        vpaddq  zmm4, zmm8, zmm4
        jne     .LBB0_3
 vpaddq  zmm0, zmm2, zmm0
        vpaddq  zmm0, zmm3, zmm0
        vpaddq zmm0, zmm4, zmm0
        vextracti64x4   ymm1, zmm0, 1
        vpaddq zmm0, zmm0, zmm1; why is this not a ymm operation?
        vextracti128 xmm1, ymm0, 1
        vpaddq  xmm0, xmm0, xmm1
        vpshufd xmm1, xmm0, 238
        vpaddq  xmm0, xmm0, xmm1
        vmovq   rdx, xmm0; Why do we need all the stuff below???
.LBB0_5:
        lea     rcx, [rsi - 4]
 vmovq   xmm0, rdx
        mov     rdx, rcx
        sub     rdx, rax
 shl     rax, 4
        lea     rax, [rax + rdi + 8]
.LBB0_6:
 vmovdqu ymm1, ymmword ptr [rax]
        vpunpcklqdq     ymm1, ymm1, ymmword ptr [rax + 32]
        add     rax, 64
        add     rdx, -4
 vpermq  ymm1, ymm1, 216
        vpaddq  ymm0, ymm1, ymm0
        jne .LBB0_6
        vextracti128    xmm1, ymm0, 1
        shl     rcx, 4
 shl     rsi, 4
        vpaddq  xmm0, xmm0, xmm1
        vpshufd xmm1, xmm0, 238
        vpaddq  xmm0, xmm0, xmm1
        vmovq   rax, xmm0
 add     rax, qword ptr [rdi + rcx + 8]
        add     rax, qword ptr [rsi + rdi - 40]
        add     rax, qword ptr [rsi + rdi - 24]
        add rax, qword ptr [rsi + rdi - 8]
        pop     rbp
        vzeroupper
 ret
```

I can see what the compiler was going for until `LBB0_5`. After that, I am at a loss of what it is doing, and I did not spend enough time walking through each step to see what is happening. But I would prefer an emit similar to this, which I was able to coerce the compiler to do for a size-optimized build:

```asm
numBytesInFilesFaster:
        push rbp
        mov     rbp, rsp
        shr     rsi, 4
        vpxor   xmm0, xmm0, xmm0
        vpxor   xmm1, xmm1, xmm1
        vpxor   xmm2, xmm2, xmm2
        vpxor   xmm3, xmm3, xmm3
.LBB0_1:
        sub     rsi, 1
 jb      .LBB0_3
        vpaddq  zmm0, zmm0, zmmword ptr [rdi]
 vpaddq  zmm1, zmm1, zmmword ptr [rdi + 64]
        vpaddq  zmm2, zmm2, zmmword ptr [rdi + 128]
        vpaddq  zmm3, zmm3, zmmword ptr [rdi + 192]
        add     rdi, 256
        jmp     .LBB0_1
.LBB0_3:
 vpaddq  zmm0, zmm0, zmm1
        vpaddq  zmm1, zmm2, zmm3
        vpaddq zmm0, zmm0, zmm1
        vextracti64x4   ymm1, zmm0, 1
        vpaddq ymm0, ymm0, ymm1; Yay! Not a zmm operation!
        vextracti128    xmm1, ymm0, 1
        vpaddq  xmm0, xmm0, xmm1
        vpextrq rax, xmm0, 1; just forget about xmm0[0]
        pop     rbp
        vzeroupper
 ret
```

I feel like this strategy makes a lot more sense. I didn't speedtest against the new LLVM version but if I remember correctly, my idea was 2-3x faster than the `vpgatherqq` strategy. The `vpermt2q` strategy is definitely a lot smarter than the `vpgatherqq` strategy, so I am not sure how the two compare. It would be easiest for me to compare with Zig using a newer version of LLVM, and I am not sure what the timeframe for that looks like. Hence, since I am a bit pressed for time at the moment, I didn't want to commit at this exact moment to set up a benchmark but I wanted to get this idea out there in case someone wanted to dig into this idea sooner rather than later.
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to