On Tue, 12 Jul 2016, Uros Bizjak wrote:

> On Tue, Jul 12, 2016 at 10:58 AM, Richard Biener <rguent...@suse.de> wrote:
> > On Sun, 10 Jul 2016, Uros Bizjak wrote:
> >
> >> On Wed, Jul 6, 2016 at 3:18 PM, Richard Biener <rguent...@suse.de> wrote:
> >>
> >> >> > 2016-07-04  Richard Biener  <rguent...@suse.de>
> >> >> >
> >> >> >     PR rtl-optimization/68961
> >> >> >     * fwprop.c (propagate_rtx): Allow SUBREGs of VEC_CONCAT and CONCAT
> >> >> >     to simplify to a non-constant.
> >> >> >
> >> >> >     * gcc.target/i386/pr68961.c: New testcase.
> >> >>
> >> >> Thanks, LGTM.
> >> >
> >> > Bootstrapped and tested on x86_64-unknown-linux-gnu, it causes
> >> >
> >> > FAIL: gcc.target/i386/sse2-load-multi.c scan-assembler-times movup 2
> >> >
> >> > as the peephole created for that testcase no longer applies as fwprop
> >> > does
> >> >
> >> > In insn 10, replacing
> >> >  (vec_concat:V2DF (vec_select:DF (reg:V2DF 91)
> >> >             (parallel [
> >> >                     (const_int 0 [0])
> >> >                 ]))
> >> >         (mem:DF (reg/f:DI 95) [0  S8 A128]))
> >> >  with (vec_concat:V2DF (reg:DF 93 [ MEM[(const double *)&a + 8B] ])
> >> >         (mem:DF (reg/f:DI 95) [0  S8 A128]))
> >> > Changed insn 10
> >> >
> >> > resulting in
> >> >
> >> >         movsd   a+8(%rip), %xmm0
> >> >         movhpd  a+16(%rip), %xmm0
> >> >
> >> > again rather than movupd.
> >> >
> >> > Uros, there is probably a missing peephole for the new form - can you
> >> > fix this as a followup or should I hold on this patch for a bit longer?
> >>
> >> No, please proceed with the patch, I'll fix this fallout with a
> >> followup patch in a couple of days.
> >
> > Applied as r238238.  Is the following x86 change ok then which
> > adjusts the vectorizer vector construction cost to sth more sensible?
> > I have adjusted the generic implementation in targhooks.c this way
> > a few weeks ago already.
> >
> > Thanks,
> > Richard.
> >
> > 2016-07-12  Richard Biener  <rguent...@suse.de>
> >
> >         * targhooks.c (default_builtin_vectorization_cost): Adjust
> >         vec_construct cost.
> >         * config/i386/i386.c (ix86_builtin_vectorization_cost): Likewise.
> 
> Looks OK to me, but let's also give Intel chance to comment.

Btw, the motivation is that the cost of large initializers like for
v16qi or v32qi is underestimated currently.  You end up with
15 or 31 vinsert calls (or similar with other ISAs) and you can't do
better than elements - 1 operations.  It doesn't really matter
for smaller vectors of course (seen for CPU v6 x264)

Richard.

Reply via email to