[Bug target/117978] Optimise 128-bit-predicated SVE loads to Advanced SIMD LDRs

cvs-commit at gcc dot gnu.org via Gcc-bugs Wed, 07 May 2025 06:30:16 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117978


--- Comment #7 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jennifer Schmitz <jschm...@gcc.gnu.org>:

https://gcc.gnu.org/g:210d06502f22964c7214586c54f8eb54a6965bfd

commit r16-446-g210d06502f22964c7214586c54f8eb54a6965bfd
Author: Jennifer Schmitz <jschm...@nvidia.com>
Date:   Fri Feb 14 00:46:13 2025 -0800

    AArch64: Fold SVE load/store with certain ptrue patterns to LDR/STR.

    SVE loads/stores using predicates that select the bottom 8, 16, 32, 64,
    or 128 bits of a register can be folded to ASIMD LDR/STR, thus avoiding the
    predicate.
    For example,
    svuint8_t foo (uint8_t *x) {
      return svld1 (svwhilelt_b8 (0, 16), x);
    }
    was previously compiled to:
    foo:
            ptrue   p3.b, vl16
            ld1b    z0.b, p3/z, [x0]
            ret

    and is now compiled to:
    foo:
            ldr     q0, [x0]
            ret

    The optimization is applied during the expand pass and was implemented
    by making the following changes to maskload<mode><vpred> and
    maskstore<mode><vpred>:
    - the existing define_insns were renamed and new define_expands for
maskloads
      and maskstores were added with nonmemory_operand as predicate such that
the
      SVE predicate matches both register operands and constant-vector
operands.
    - if the SVE predicate is a constant vector and contains a pattern as
      described above, an ASIMD load/store is emitted instead of the SVE
load/store.

    The patch implements the optimization for LD1 and ST1, for 8-bit, 16-bit,
    32-bit, 64-bit, and 128-bit moves, for all full SVE data vector modes.

    Follow-up patches for LD2/3/4 and ST2/3/4 and potentially partial SVE
vector
    modes are planned.

    The patch was bootstrapped and tested on aarch64-linux-gnu, no regression.

    Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>

    gcc/
            PR target/117978
            * config/aarch64/aarch64-protos.h: Declare
            aarch64_emit_load_store_through_mode and aarch64_sve_maskloadstore.
            * config/aarch64/aarch64-sve.md
            (maskload<mode><vpred>): New define_expand folding maskloads with
            certain predicate patterns to ASIMD loads.
            (*aarch64_maskload<mode><vpred>): Renamed from
maskload<mode><vpred>.
            (maskstore<mode><vpred>): New define_expand folding maskstores with
            certain predicate patterns to ASIMD stores.
            (*aarch64_maskstore<mode><vpred>): Renamed from
maskstore<mode><vpred>.
            * config/aarch64/aarch64.cc
            (aarch64_emit_load_store_through_mode): New function emitting a
            load/store through subregs of a given mode.
            (aarch64_emit_sve_pred_move): Refactor to use
            aarch64_emit_load_store_through_mode.
            (aarch64_expand_maskloadstore): New function to emit ASIMD
loads/stores
            for maskloads/stores with SVE predicates with VL1, VL2, VL4, VL8,
or
            VL16 patterns.
            (aarch64_partial_ptrue_length): New function returning number of
leading
            set bits in a predicate.

    gcc/testsuite/
            PR target/117978
            * gcc.target/aarch64/sve/acle/general/whilelt_5.c: Adjust expected
            outcome.
            * gcc.target/aarch64/sve/ldst_ptrue_pat_128_to_neon.c: New test.
            * gcc.target/aarch64/sve/while_7.c: Adjust expected outcome.
            * gcc.target/aarch64/sve/while_9.c: Adjust expected outcome.

[Bug target/117978] Optimise 128-bit-predicated SVE loads to Advanced SIMD LDRs

Reply via email to