> > > On 16 Sep 2024, at 16:32, Richard Sandiford > <richard.sandif...@arm.com> wrote: > > > > > > External email: Use caution opening links or attachments > > > > > > > > > "Pengxuan Zheng (QUIC)" <quic_pzh...@quicinc.com> writes: > > >>> On Thu, Sep 12, 2024 at 2:53 AM Pengxuan Zheng > > >>> <quic_pzh...@quicinc.com> wrote: > > >>>> > > >>>> SVE's INDEX instruction can be used to populate vectors by values > > >>>> starting from "base" and incremented by "step" for each > > >>>> subsequent value. We can take advantage of it to generate vector > > >>>> constants if TARGET_SVE is available and the base and step values are > within [-16, 15]. > > >>> > > >>> Are there multiplication by or addition of scalar immediate > > >>> instructions to enhance this with two-instruction sequences? > > >> > > >> No, Richard, I can't think of any equivalent two-instruction sequences. > > > > > > There are some. E.g.: > > > > > > { 16, 17, 18, 19, ... } > > > > > > could be: > > > > > > index z0.b, #0, #1 > > > add z0.b, z0.b, #16 > > > > > > or, alternatively: > > > > > > mov w0, #16 > > > index z0.b, w0, #1 > > I guess even step between [16, 31] could be handed with index with half step > and then adding the result to itself (multiply by immediate #2), even if > there's > no direct vector-by-immediate instruction available. Likewise of course some > { A0 + n * B1 + n * B2, ... } can be handled by adding two index compute > results.
Thanks for the example, Richard! It does seem to be something worth looking into. Thanks, Pengxuan > > > > But these cases are less obviously a win, so I think it's ok to > > > handle single instructions only for now. > > > > (Not related to this patch, this work is great, thanks Pengxuan!) > > Looking at some SWOGs like for Neoverse V2 it looks like the first sequence > is preferable. > > On that core the INDEX-immediates-only operation has latency 4 and > throughput 2 and the SVE ADD is as cheap as SIMD operations can be on that > core. > > But in the second sequence the INDEX-reg-operand has latency 7 and > throughput 1 as it seems to treat it as a GP <-> SIMD transfer of some sort. > > So what's the latency/throughput of a vector load from constant pool (can we > even have a "SVE" constant pool? I assume entries would have to be of the > architecturally largest vector size?), assuming it's in L1 (where it would > occupy > quite some space eventually). > > Richard. > > > We may encounter a situation in the future where we’ll want to optimize the > second sequence (if it comes from intrinsics code for example) into the first. > > Thanks, > > Kyrill > > > > > > > > > > The patch is ok for trunk, thanks, but: > > > > > >>>> @@ -22991,7 +22991,7 @@ aarch64_simd_valid_immediate (rtx op, > > >>> simd_immediate_info *info, > > >>>> if (CONST_VECTOR_P (op) > > >>>> && CONST_VECTOR_DUPLICATE_P (op)) > > >>>> n_elts = CONST_VECTOR_NPATTERNS (op); > > >>>> - else if ((vec_flags & VEC_SVE_DATA) > > >>>> + else if (which == AARCH64_CHECK_MOV && TARGET_SVE > > >>>> && const_vec_series_p (op, &base, &step)) > > > > > > ...the convention is to have one && condition per line if the whole > > > expression doesn't fit on a single line: > > > > > > else if (which == AARCH64_CHECK_MOV > > > && TARGET_SVE > > > && const_vec_series_p (op, &base, &step)) > > > > > > Richard > >