On Tue, Sep 17, 2024 at 9:57 AM Kyrylo Tkachov <ktkac...@nvidia.com> wrote: > > > > > On 16 Sep 2024, at 16:32, Richard Sandiford <richard.sandif...@arm.com> > > wrote: > > > > External email: Use caution opening links or attachments > > > > > > "Pengxuan Zheng (QUIC)" <quic_pzh...@quicinc.com> writes: > >>> On Thu, Sep 12, 2024 at 2:53 AM Pengxuan Zheng > >>> <quic_pzh...@quicinc.com> wrote: > >>>> > >>>> SVE's INDEX instruction can be used to populate vectors by values > >>>> starting from "base" and incremented by "step" for each subsequent > >>>> value. We can take advantage of it to generate vector constants if > >>>> TARGET_SVE is available and the base and step values are within [-16, > >>>> 15]. > >>> > >>> Are there multiplication by or addition of scalar immediate instructions > >>> to > >>> enhance this with two-instruction sequences? > >> > >> No, Richard, I can't think of any equivalent two-instruction sequences. > > > > There are some. E.g.: > > > > { 16, 17, 18, 19, ... } > > > > could be: > > > > index z0.b, #0, #1 > > add z0.b, z0.b, #16 > > > > or, alternatively: > > > > mov w0, #16 > > index z0.b, w0, #1
I guess even step between [16, 31] could be handed with index with half step and then adding the result to itself (multiply by immediate #2), even if there's no direct vector-by-immediate instruction available. Likewise of course some { A0 + n * B1 + n * B2, ... } can be handled by adding two index compute results. > > But these cases are less obviously a win, so I think it's ok to handle > > single instructions only for now. > > (Not related to this patch, this work is great, thanks Pengxuan!) > Looking at some SWOGs like for Neoverse V2 it looks like the first sequence > is preferable. > On that core the INDEX-immediates-only operation has latency 4 and throughput > 2 and the SVE ADD is as cheap as SIMD operations can be on that core. > But in the second sequence the INDEX-reg-operand has latency 7 and throughput > 1 as it seems to treat it as a GP <-> SIMD transfer of some sort. So what's the latency/throughput of a vector load from constant pool (can we even have a "SVE" constant pool? I assume entries would have to be of the architecturally largest vector size?), assuming it's in L1 (where it would occupy quite some space eventually). Richard. > We may encounter a situation in the future where we’ll want to optimize the > second sequence (if it comes from intrinsics code for example) into the first. > Thanks, > Kyrill > > > > > > The patch is ok for trunk, thanks, but: > > > >>>> @@ -22991,7 +22991,7 @@ aarch64_simd_valid_immediate (rtx op, > >>> simd_immediate_info *info, > >>>> if (CONST_VECTOR_P (op) > >>>> && CONST_VECTOR_DUPLICATE_P (op)) > >>>> n_elts = CONST_VECTOR_NPATTERNS (op); > >>>> - else if ((vec_flags & VEC_SVE_DATA) > >>>> + else if (which == AARCH64_CHECK_MOV && TARGET_SVE > >>>> && const_vec_series_p (op, &base, &step)) > > > > ...the convention is to have one && condition per line if the whole > > expression doesn't fit on a single line: > > > > else if (which == AARCH64_CHECK_MOV > > && TARGET_SVE > > && const_vec_series_p (op, &base, &step)) > > > > Richard >