I would agree that the example is just for the case cost model makes correct estimation But how can we assure ourself that it won't have any mistakes in the future?
I believe it'll be Ok to introduce an extra flag as Jakub proposed for the dedicated simd-forced vectorization to use unlimited cost model. This can be default for -fopenmp or there should be a warning issued that compiler overrides user's request of vectorization. In such a case user can enforce vectorization (even with mentioned results :) with this unlimited cost model for simd. On Fri, Nov 15, 2013 at 6:24 PM, Richard Biener <rguent...@suse.de> wrote: > On Fri, 15 Nov 2013, Sergey Ostanevich wrote: > >> Richard, >> >> here's an example that causes trigger for the cost model. > > I hardly believe that (AVX2) > > .L9: > vmovups (%rsi), %xmm3 > addl $1, %r8d > addq $256, %rsi > vinsertf128 $0x1, -240(%rsi), %ymm3, %ymm1 > vmovups -224(%rsi), %xmm3 > vinsertf128 $0x1, -208(%rsi), %ymm3, %ymm3 > vshufps $136, %ymm3, %ymm1, %ymm3 > vperm2f128 $3, %ymm3, %ymm3, %ymm2 > vshufps $68, %ymm2, %ymm3, %ymm1 > vshufps $238, %ymm2, %ymm3, %ymm2 > vmovups -192(%rsi), %xmm3 > vinsertf128 $1, %xmm2, %ymm1, %ymm2 > vinsertf128 $0x1, -176(%rsi), %ymm3, %ymm1 > vmovups -160(%rsi), %xmm3 > vinsertf128 $0x1, -144(%rsi), %ymm3, %ymm3 > vshufps $136, %ymm3, %ymm1, %ymm3 > vperm2f128 $3, %ymm3, %ymm3, %ymm1 > vshufps $68, %ymm1, %ymm3, %ymm4 > vshufps $238, %ymm1, %ymm3, %ymm1 > vmovups -128(%rsi), %xmm3 > vinsertf128 $1, %xmm1, %ymm4, %ymm1 > vshufps $136, %ymm1, %ymm2, %ymm1 > vperm2f128 $3, %ymm1, %ymm1, %ymm2 > vshufps $68, %ymm2, %ymm1, %ymm4 > vshufps $238, %ymm2, %ymm1, %ymm2 > vinsertf128 $0x1, -112(%rsi), %ymm3, %ymm1 > vmovups -96(%rsi), %xmm3 > vinsertf128 $1, %xmm2, %ymm4, %ymm4 > vinsertf128 $0x1, -80(%rsi), %ymm3, %ymm3 > vshufps $136, %ymm3, %ymm1, %ymm3 > vperm2f128 $3, %ymm3, %ymm3, %ymm2 > vshufps $68, %ymm2, %ymm3, %ymm1 > vshufps $238, %ymm2, %ymm3, %ymm2 > vmovups -64(%rsi), %xmm3 > vinsertf128 $1, %xmm2, %ymm1, %ymm2 > vinsertf128 $0x1, -48(%rsi), %ymm3, %ymm1 > vmovups -32(%rsi), %xmm3 > vinsertf128 $0x1, -16(%rsi), %ymm3, %ymm3 > cmpl %r8d, %edi > vshufps $136, %ymm3, %ymm1, %ymm3 > vperm2f128 $3, %ymm3, %ymm3, %ymm1 > vshufps $68, %ymm1, %ymm3, %ymm5 > vshufps $238, %ymm1, %ymm3, %ymm1 > vinsertf128 $1, %xmm1, %ymm5, %ymm1 > vshufps $136, %ymm1, %ymm2, %ymm1 > vperm2f128 $3, %ymm1, %ymm1, %ymm2 > vshufps $68, %ymm2, %ymm1, %ymm3 > vshufps $238, %ymm2, %ymm1, %ymm2 > vinsertf128 $1, %xmm2, %ymm3, %ymm1 > vshufps $136, %ymm1, %ymm4, %ymm1 > vperm2f128 $3, %ymm1, %ymm1, %ymm2 > vshufps $68, %ymm2, %ymm1, %ymm3 > vshufps $238, %ymm2, %ymm1, %ymm2 > vinsertf128 $1, %xmm2, %ymm3, %ymm2 > vaddps %ymm2, %ymm0, %ymm0 > ja .L9 > > is more efficient than > > .L3: > vaddss (%rcx,%rax), %xmm0, %xmm0 > addq $32, %rax > cmpq %rdx, %rax > jne .L3 > > ;) > >> As soon as >> elemental functions will appear and we update the vectorizer so it can accept >> an elemental function inside the loop - we will have the same >> situation as we have >> it now: cost model will bail out with profitability estimation. > > Yes. > >> Still we have no chance to get info on how efficient the bar() function when >> it >> is in vector form. > > Well I assume you mean that the speedup when vectorizing the elemental > will offset whatever wreckage we cause with vectorizing the rest of the > statements. I'd say you can at least compare to unrolling by > the vectorization factor, building the vector inputs to the elemental > from scalars, distributing the vector result from the elemental to > scalars. > >> I believe I should repeat: #pragma omp simd is intended for introduction of >> an >> instruction-level parallel region on developer's request, hence should >> be treated >> in same manner as #pragma omp parallel. Vectorizer cost model is an obstacle >> here, not a help. > > Surely not if there isn't an elemental call in it. With it the > cost model of course will have not enough information to decide. > > But still, what's the difference to the case where we cannot vectorize > the function? What happens if we cannot vectorize the elemental? > Do we have to build scalar versions for all possible vector sizes? > > Richard. > >> Regards, >> Sergos >> >> >> On Fri, Nov 15, 2013 at 1:08 AM, Richard Biener <rguent...@suse.de> wrote: >> > Sergey Ostanevich <sergos....@gmail.com> wrote: >> >>this is only for the whole file? I mean to have a particular loop >> >>vectorized in a >> >>file while all others - up to compiler's cost model. is there such a >> >>machinery? >> > >> > No, there is not. >> > >> > Richard. >> > >> >>Sergos >> >> >> >>On Thu, Nov 14, 2013 at 12:39 PM, Richard Biener <rguent...@suse.de> >> >>wrote: >> >>> On Wed, 13 Nov 2013, Sergey Ostanevich wrote: >> >>> >> >>>> I will get some tests. >> >>>> As for cost analysis - simply consider the pragma as a request to >> >>>> vectorize. How can I - as a developer - enforce it beyond the >> >>pragma? >> >>> >> >>> You can disable the cost model via -fvect-cost-model=unlimited >> >>> >> >>> Richard. >> >>> >> >>>> On Wed, Nov 13, 2013 at 12:55 PM, Richard Biener <rguent...@suse.de> >> >>wrote: >> >>>> > On Tue, 12 Nov 2013, Sergey Ostanevich wrote: >> >>>> > >> >>>> >> The reason patch was in its original state is because we want >> >>>> >> to notify user that his assumption of profitability may be wrong. >> >>>> >> This is not a part of any spec and as far as I know ICC does not >> >>>> >> notify user about the case. Still it can be a good hint for those >> >>>> >> users who tries to get as much as possible performance. >> >>>> >> >> >>>> >> Richard's comment on the vectorization problems is about the same >> >>- >> >>>> >> to inform user that his attempt to force vectorization is failed. >> >>>> >> >> >>>> >> As for profitable or not - sometimes I believe it's impossible to >> >>be >> >>>> >> precise. For OMP we have case of a vector version of a function >> >>>> >> and we have no chance to figure out whether it is profitable to >> >>use >> >>>> >> it or to loose it. If we can't map the loop for any vector length >> >>>> >> other than 1 - I believe in this case we have to bail out and >> >>report. >> >>>> >> Is it about 'never profitable'? >> >>>> > >> >>>> > For example. I think we should report non-vectorized loops >> >>>> > that are marked with force_vect anyway, with >> >>-Wdisabled-optimization. >> >>>> > Another case is that a loop may be profitable to vectorize if >> >>>> > the ISA supports a gather instruction but otherwise not. Or if >> >>the >> >>>> > ISA supports efficient vector construction from N not loop >> >>>> > invariant scalars (for vectorization of strided loads). >> >>>> > >> >>>> > Simply disregarding all of the cost analysis sounds completely >> >>>> > bogus to me. >> >>>> > >> >>>> > I'd simply go for the diagnostic for now, not changing anything >> >>else. >> >>>> > We want to have a good understanding about why the cost model is >> >>>> > so bad that we have to force to ignore it for #pragma simd - thus >> >>we >> >>>> > want testcases. >> >>>> > >> >>>> > Richard. >> >>>> > >> >>>> >> >> >>>> >> On Tue, Nov 12, 2013 at 6:35 PM, Richard Biener >> >><rguent...@suse.de> wrote: >> >>>> >> > On 11/12/13 3:16 PM, Jakub Jelinek wrote: >> >>>> >> >> On Tue, Nov 12, 2013 at 05:46:14PM +0400, Sergey Ostanevich >> >>wrote: >> >>>> >> >>> ivdep just substitutes all cross-iteration data analysis, >> >>>> >> >>> nothing related to cost model. ICC does not cancel its >> >>>> >> >>> cost model in case of #pragma ivdep >> >>>> >> >>> >> >>>> >> >>> as for the safelen - OMP standart treats it as a limitation >> >>>> >> >>> for the vector length. this means if no safelen is present >> >>>> >> >>> an arbitrary vector length can be used. >> >>>> >> >> >> >>>> >> >> I was talking about GCC loop->safelen, which is INT_MAX for >> >>#pragma omp simd >> >>>> >> >> without safelen clause or #pragma simd without vectorlength >> >>clause. >> >>>> >> >> >> >>>> >> >>> so I believe loop->force_vect is the only trigger to >> >>disregard >> >>>> >> >>> the cost model >> >>>> >> >> >> >>>> >> >> Anyway, in that case I think the originally posted patch is >> >>wrong, >> >>>> >> >> if we want to treat force_vect as disregard all the cost model >> >>and >> >>>> >> >> force vectorization (well, the name of the field already kind >> >>of suggest >> >>>> >> >> that), then IMHO we should treat it the same as >> >>-fvect-cost-model=unlimited >> >>>> >> >> for those loops. >> >>>> >> > >> >>>> >> > Err - the user may have a specific sub-architecture in mind >> >>when using >> >>>> >> > #pragma simd, if you say we should completely ignore the cost >> >>model >> >>>> >> > then should we also sorry () if we cannot vectorize the loop >> >>(either >> >>>> >> > because of GCC deficiencies or lack of sub-target support)? >> >>>> >> > >> >>>> >> > That said, at least in the cases that the cost model says the >> >>loop >> >>>> >> > is never profitable to vectorize we should follow its advice. >> >>>> >> > >> >>>> >> > Richard. >> >>>> >> > >> >>>> >> >> Thus (untested): >> >>>> >> >> >> >>>> >> >> 2013-11-12 Jakub Jelinek <ja...@redhat.com> >> >>>> >> >> >> >>>> >> >> * tree-vect-loop.c (vect_estimate_min_profitable_iters): >> >>Use >> >>>> >> >> unlimited cost model also for force_vect loops. >> >>>> >> >> >> >>>> >> >> --- gcc/tree-vect-loop.c.jj 2013-11-12 12:09:40.000000000 >> >>+0100 >> >>>> >> >> +++ gcc/tree-vect-loop.c 2013-11-12 15:11:43.821404330 >> >>+0100 >> >>>> >> >> @@ -2702,7 +2702,7 @@ vect_estimate_min_profitable_iters (loop >> >>>> >> >> void *target_cost_data = LOOP_VINFO_TARGET_COST_DATA >> >>(loop_vinfo); >> >>>> >> >> >> >>>> >> >> /* Cost model disabled. */ >> >>>> >> >> - if (unlimited_cost_model ()) >> >>>> >> >> + if (unlimited_cost_model () || LOOP_VINFO_LOOP >> >>(loop_vinfo)->force_vect) >> >>>> >> >> { >> >>>> >> >> dump_printf_loc (MSG_NOTE, vect_location, "cost model >> >>disabled.\n"); >> >>>> >> >> *ret_min_profitable_niters = 0; >> >>>> >> >> >> >>>> >> >> Jakub >> >>>> >> >> >> >>>> >> > >> >>>> >> >> >>>> >> >> >>>> > >> >>>> > -- >> >>>> > Richard Biener <rguent...@suse.de> >> >>>> > SUSE / SUSE Labs >> >>>> > SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746 >> >>>> > GF: Jeff Hawn, Jennifer Guild, Felix Imend >> >>>> >> >>>> >> >>> >> >>> -- >> >>> Richard Biener <rguent...@suse.de> >> >>> SUSE / SUSE Labs >> >>> SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746 >> >>> GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer >> > >> > >> > > -- > Richard Biener <rguent...@suse.de> > SUSE / SUSE Labs > SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746 > GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer