"Bingfeng Mei" <b...@broadcom.com> wrote on 01/11/2011 01:25:14 PM:
> Ira, > Thank you very much for quick answer. I will check 4.7 x86-64 > to see difference from our port. Is there significant change > between 4.5 & 4.7 regarding SLP? Yes, I think so. 4.5 can't SLP data accesses with unknown alignment that you have here. Ira > > Cheers, > Bingfeng > > > -----Original Message----- > > From: Ira Rosen [mailto:i...@il.ibm.com] > > Sent: 01 November 2011 11:13 > > To: Bingfeng Mei > > Cc: gcc@gcc.gnu.org > > Subject: Re: SLP vectorizer on non-loop? > > > > > > > > gcc-ow...@gcc.gnu.org wrote on 01/11/2011 12:41:32 PM: > > > > > Hello, > > > I have one example with two very similar loops. cunrolli pass > > > unrolls one loop completely > > > but not the other based on slightly different cost estimations. The > > > not-unrolled loop > > > get SLP-vectorized, then unrolled by "cunroll" pass, whereas the > > > other unrolled loop cannot > > > be vectorized since it is not a loop any more. In the end, there is > > > big difference of > > > performance between two loops. > > > > > > > Here what I see with the current trunk on x86_64 with -O3 (with the two > > loops split into different functions): > > > > The first loop, the one that doesn't get unrolled by cunrolli, gets > > loop > > vectorized with -fno-vect-cost-model. With the cost model the > > vectorization > > fails because the number of iterations is not sufficient (the > > vectorizer > > tries to apply loop peeling in order to align the accesses), the loop > > gets > > later unrolled by cunroll and the basic block gets vectorized by SLP. > > > > The second loop, unrolled by cunrolli, also gets vectorized by SLP. > > > > The *.optimized dumps look similar: > > > > > > <bb 2>: > > vect_var_.14_48 = MEM[(int *)p_hist_buff_9(D)]; > > MEM[(int *)temp_hist_buffer_5(D)] = vect_var_.14_48; > > return; > > > > > > <bb 2>: > > vect_var_.7_57 = MEM[(int *)p_input_10(D)]; > > MEM[(int *)temp_hist_buffer_6(D) + 16B] = vect_var_.7_57; > > return; > > > > > > > My question is why SLP vectorization has to be performed on loop (it > > > is a sub-pass under > > > pass_tree_loop). Conceptually, cannot it be done on any basic block? > > > Our port are still > > > stuck at 4.5. But I checked 4.7, it seems still the same. I also > > > checked functions in > > > tree-vect-slp.c. They use a lot of loop_vinfo structures. But in > > > some places it checks > > > whether loop_vinfo exists to use it or other alternative. I tried to > > > add an extra SLP > > > pass after pass_tree_loop, but it didn't work. I wonder how easy to > > > make SLP works for > > > non-loop. > > > > SLP vectorization works both on loops (in vectorize pass) and on basic > > blocks (in slp-vectorize pass). > > > > Ira > > > > > > > > Thanks, > > > Bingfeng Mei > > > > > > Broadcom UK > > > > > > void foo (int *__restrict__ temp_hist_buffer, > > > int * __restrict__ p_hist_buff, > > > int *__restrict__ p_input) > > > { > > > int i; > > > for(i=0;i<4;i++) > > > temp_hist_buffer[i]=p_hist_buff[i]; > > > > > > for(i=0;i<4;i++) > > > temp_hist_buffer[i+4]=p_input[i]; > > > > > > } > > > > > > > > > >