> > > On 16 Sep 2024, at 16:32, Richard Sandiford
> <richard.sandif...@arm.com> wrote:
> > >
> > > External email: Use caution opening links or attachments
> > >
> > >
> > > "Pengxuan Zheng (QUIC)" <quic_pzh...@quicinc.com> writes:
> > >>> On Thu, Sep 12, 2024 at 2:53 AM Pengxuan Zheng
> > >>> <quic_pzh...@quicinc.com> wrote:
> > >>>>
> > >>>> SVE's INDEX instruction can be used to populate vectors by values
> > >>>> starting from "base" and incremented by "step" for each
> > >>>> subsequent value. We can take advantage of it to generate vector
> > >>>> constants if TARGET_SVE is available and the base and step values are
> within [-16, 15].
> > >>>
> > >>> Are there multiplication by or addition of scalar immediate
> > >>> instructions to enhance this with two-instruction sequences?
> > >>
> > >> No, Richard, I can't think of any equivalent two-instruction sequences.
> > >
> > > There are some.  E.g.:
> > >
> > >     { 16, 17, 18, 19, ... }
> > >
> > > could be:
> > >
> > >        index   z0.b, #0, #1
> > >        add     z0.b, z0.b, #16
> > >
> > > or, alternatively:
> > >
> > >        mov     w0, #16
> > >        index   z0.b, w0, #1
> 
> I guess even step between [16, 31] could be handed with index with half step
> and then adding the result to itself (multiply by immediate #2), even if 
> there's
> no direct vector-by-immediate instruction available.  Likewise of course  some
> { A0 + n * B1 + n * B2, ... } can be handled by adding two index compute
> results.

Thanks for the example, Richard! It does seem to be something worth looking 
into.

Thanks,
Pengxuan
> 
> > > But these cases are less obviously a win, so I think it's ok to
> > > handle single instructions only for now.
> >
> > (Not related to this patch, this work is great, thanks Pengxuan!)
> > Looking at some SWOGs like for Neoverse V2 it looks like the first sequence
> is preferable.
> > On that core the INDEX-immediates-only operation has latency 4 and
> throughput 2 and the SVE ADD is as cheap as SIMD operations can be on that
> core.
> > But in the second sequence the INDEX-reg-operand has latency 7 and
> throughput 1 as it seems to treat it as a GP <-> SIMD transfer of some sort.
> 
> So what's the latency/throughput of a vector load from constant pool (can we
> even have a "SVE" constant pool?  I assume entries would have to be of the
> architecturally largest vector size?), assuming it's in L1 (where it would 
> occupy
> quite some space eventually).
> 
> Richard.
> 
> > We may encounter a situation in the future where we’ll want to optimize the
> second sequence (if it comes from intrinsics code for example) into the first.
> > Thanks,
> > Kyrill
> >
> >
> > >
> > > The patch is ok for trunk, thanks, but:
> > >
> > >>>> @@ -22991,7 +22991,7 @@ aarch64_simd_valid_immediate (rtx op,
> > >>> simd_immediate_info *info,
> > >>>>   if (CONST_VECTOR_P (op)
> > >>>>       && CONST_VECTOR_DUPLICATE_P (op))
> > >>>>     n_elts = CONST_VECTOR_NPATTERNS (op);
> > >>>> -  else if ((vec_flags & VEC_SVE_DATA)
> > >>>> +  else if (which == AARCH64_CHECK_MOV && TARGET_SVE
> > >>>>           && const_vec_series_p (op, &base, &step))
> > >
> > > ...the convention is to have one && condition per line if the whole
> > > expression doesn't fit on a single line:
> > >
> > >  else if (which == AARCH64_CHECK_MOV
> > >           && TARGET_SVE
> > >           && const_vec_series_p (op, &base, &step))
> > >
> > > Richard
> >

Reply via email to