On Mon, 5 Oct 2020, Jakub Jelinek wrote:

> Hi!
> 
> Compiling the following testcase with -O2 -fopenmp:
> int a[10000][128];
> 
> __attribute__((noipa)) void
> foo (void)
> {
>   #pragma omp for simd schedule (simd: dynamic, 32) collapse(2)
>   for (int i = 0; i < 10000; i++)
>     for (int j = 0; j < 128; j++)
>       a[i][j] += 3;
> }
> 
> int
> main ()
> {
>   for (int i = 0; i < 10000; i++)
>     for (int j = 0; j < 128; j++)
>       {
>       asm volatile ("" : : "r" (&a[0][0]) : "memory");
>       a[i][j] = i + j;
>       }
>   foo ();
>   for (int i = 0; i < 10000; i++)
>     for (int j = 0; j < 128; j++)
>       if (a[i][j] != i + j + 3)
>       __builtin_abort ();
>   return 0;
> }
> doesn't seem result in the vectorization I was hoping to see.
> As has been changed recently, I'm only trying to vectorize now the
> innermost loop of the collapse with outer loops around it being normal
> scalar loops like those written in the source and with only omp simd
> it works fine, but for the combined constructs the current thread gets
> assigned some range of logical iterations, therefore I get a pair of
> in this case i and j starting values.
> 
> At the end of ompexp I have:
> ...
>   D.2106 = (unsigned int) D.2105;
>   D.2107 = MIN_EXPR <D.2104, D.2106>;
>   D.2103 = D.2107 + .iter.4;
>   goto <bb 5>; [INV]
> ;;    succ:       5
> 
> ;;   basic block 4, loop depth 2
> ;;    pred:       5
>   i = i.0;
>   j = j.1;
>   _1 = a[i][j];
>   _2 = _1 + 3;
>   a[i][j] = _2;
>   .iter.4 = .iter.4 + 1;
>   j.1 = j.1 + 1;
> ;;    succ:       5
> 
> ;;   basic block 5, loop depth 2
> ;;    pred:       4
> ;;                3
> ;;                7
>   if (.iter.4 < D.2103)
>     goto <bb 4>; [87.50%]
>   else
>     goto <bb 6>; [12.50%]
> ;;    succ:       4
> ;;                6
> 
> ;;   basic block 6, loop depth 2
> ;;    pred:       5
>   i.0 = i.0 + 1;
>   if (i.0 < 10000)
>     goto <bb 7>; [87.50%]
>   else
>     goto <bb 8>; [12.50%]
> ;;    succ:       8
> ;;                7
> 
> ;;   basic block 7, loop depth 2
> ;;    pred:       6
>   j.1 = 0;
>   D.2108 = D.2099 - .iter.4;
>   D.2109 = MIN_EXPR <D.2108, 128>;
>   D.2103 = D.2109 + .iter.4;
>   goto <bb 5>; [INV]
> 
> I was really hoping bbs 4 and 5 would be one loop (the one I set safelen
> and force_vectorize etc. for) and that basic blocks 6 and 7 would be
> together with that inner loop another loop, but apparently loop discovery
> thinks it is just one loop.
> Any ideas what I'm doing wrong or is there any way how to make it two loops
> (that would also survive all the cfg cleanups until vectorization)?

The early CFG looks like we have a common header with two latches
so it boils down to how we disambiguate those in the end (we seem
to unify the latches via a forwarder).  IIRC OMP lowering builds
loops itself, could it not do the appropriate disambiguation itself?

Richard.

> Essentially, in C I'm trying to have:
> int a[10000][128];
> void get_me_start_end (int *, int *);
> void
> foo (void)
> {
>   int start, end, curend, i, j;
>   get_me_start_end (&start, &end);
>   i = start / 128;
>   j = start % 128;
>   curend = start + (end - start > 128 - j ? 128 - j : end - start);
>   goto doit;
>   for (i = 0; i < 10000; i++)
>     {
>       j = 0;
>       curend = start + (end - start > 128 ? 128 : end - start);
>       doit:;
>       /* I'd use start < curend && j < 128 as condition here, but
>        the vectorizer doesn't like that either.  So I went to
>        using a single IV.  */
>       for (; start < curend; start++, j++)
>         a[i][j] += 3;
>     }
> }
> 
> This isn't vectorized with -O3 either for the same reason.
> 
>       Jakub
> 
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imend
  • Loop question Jakub Jelinek via Gcc
    • Re: Loop question Richard Biener

Reply via email to