On Tue, Sep 17, 2024 at 9:57 AM Kyrylo Tkachov <ktkac...@nvidia.com> wrote:
>
>
>
> > On 16 Sep 2024, at 16:32, Richard Sandiford <richard.sandif...@arm.com> 
> > wrote:
> >
> > External email: Use caution opening links or attachments
> >
> >
> > "Pengxuan Zheng (QUIC)" <quic_pzh...@quicinc.com> writes:
> >>> On Thu, Sep 12, 2024 at 2:53 AM Pengxuan Zheng
> >>> <quic_pzh...@quicinc.com> wrote:
> >>>>
> >>>> SVE's INDEX instruction can be used to populate vectors by values
> >>>> starting from "base" and incremented by "step" for each subsequent
> >>>> value. We can take advantage of it to generate vector constants if
> >>>> TARGET_SVE is available and the base and step values are within [-16, 
> >>>> 15].
> >>>
> >>> Are there multiplication by or addition of scalar immediate instructions 
> >>> to
> >>> enhance this with two-instruction sequences?
> >>
> >> No, Richard, I can't think of any equivalent two-instruction sequences.
> >
> > There are some.  E.g.:
> >
> >     { 16, 17, 18, 19, ... }
> >
> > could be:
> >
> >        index   z0.b, #0, #1
> >        add     z0.b, z0.b, #16
> >
> > or, alternatively:
> >
> >        mov     w0, #16
> >        index   z0.b, w0, #1

I guess even step between [16, 31] could be handed with index with half
step and then adding the result to itself (multiply by immediate #2), even
if there's no direct vector-by-immediate instruction available.  Likewise
of course  some { A0 + n * B1 + n * B2, ... } can be handled by adding
two index compute results.

> > But these cases are less obviously a win, so I think it's ok to handle
> > single instructions only for now.
>
> (Not related to this patch, this work is great, thanks Pengxuan!)
> Looking at some SWOGs like for Neoverse V2 it looks like the first sequence 
> is preferable.
> On that core the INDEX-immediates-only operation has latency 4 and throughput 
> 2 and the SVE ADD is as cheap as SIMD operations can be on that core.
> But in the second sequence the INDEX-reg-operand has latency 7 and throughput 
> 1 as it seems to treat it as a GP <-> SIMD transfer of some sort.

So what's the latency/throughput of a vector load from constant pool
(can we even have a "SVE" constant pool?  I assume
entries would have to be of the architecturally largest vector size?),
assuming it's in L1 (where it would occupy quite some
space eventually).

Richard.

> We may encounter a situation in the future where we’ll want to optimize the 
> second sequence (if it comes from intrinsics code for example) into the first.
> Thanks,
> Kyrill
>
>
> >
> > The patch is ok for trunk, thanks, but:
> >
> >>>> @@ -22991,7 +22991,7 @@ aarch64_simd_valid_immediate (rtx op,
> >>> simd_immediate_info *info,
> >>>>   if (CONST_VECTOR_P (op)
> >>>>       && CONST_VECTOR_DUPLICATE_P (op))
> >>>>     n_elts = CONST_VECTOR_NPATTERNS (op);
> >>>> -  else if ((vec_flags & VEC_SVE_DATA)
> >>>> +  else if (which == AARCH64_CHECK_MOV && TARGET_SVE
> >>>>           && const_vec_series_p (op, &base, &step))
> >
> > ...the convention is to have one && condition per line if the whole
> > expression doesn't fit on a single line:
> >
> >  else if (which == AARCH64_CHECK_MOV
> >           && TARGET_SVE
> >           && const_vec_series_p (op, &base, &step))
> >
> > Richard
>

Reply via email to