Re: Calls to auto-vectorized AVX512 functions

2021-02-08 Thread Richard Biener via Gcc
On Mon, Feb 8, 2021 at 4:26 AM Naoki Shibata via Gcc  wrote:
>
>
> Hello,
>
> I have a question as to the auto-vectorizer in GCC.
>
> When AVX512 instruction is available, the auto-vectorizer in gcc
> sometimes generates calls to AVX2 functions instead of AVX512 functions.
>
>
> $ cat vabitest.c
> #include 
> #include 
>
> _Pragma ("omp declare simd simdlen(8) notinbranch")
> __attribute__((const)) double myfunc(double x);
>
> #define N 1024
> __attribute__ ((__aligned__(256))) double a[N], b[N], c[N];
>
> int main(void) {
>for (int i = 0; i < N; i++) a[i] = myfunc(b[i]);
>for (int i = 0; i < N; i++) c[i] = sin(b[i]);
> }
>
> $ gcc-10 -ffast-math -O3 -mavx512f -fopenmp vabitest.c -S -o- | grep _ZGV
>  call_ZGVdN8v_myfunc@PLT
>  call_ZGVeN8v_sin@PLT
>
>
> Is there a way to force gcc to generate calls to AVX512 function in
> cases like this?

Not sure, the sin function get's called with a %zmm argument but
myfunc is called with two %ymm arguments but returns a %zmm.
Something seems messed up somewhere.

Can you open a bugreport?

Richard.

> Regards,
>
> Naoki Shibata


Re: Calls to auto-vectorized AVX512 functions

2021-02-08 Thread Jakub Jelinek via Gcc
On Mon, Feb 08, 2021 at 11:42:13AM +0100, Richard Biener via Gcc wrote:
> > I have a question as to the auto-vectorizer in GCC.
> >
> > When AVX512 instruction is available, the auto-vectorizer in gcc
> > sometimes generates calls to AVX2 functions instead of AVX512 functions.
> >
> >
> > $ cat vabitest.c
> > #include 
> > #include 
> >
> > _Pragma ("omp declare simd simdlen(8) notinbranch")
> > __attribute__((const)) double myfunc(double x);
> >
> > #define N 1024
> > __attribute__ ((__aligned__(256))) double a[N], b[N], c[N];
> >
> > int main(void) {
> >for (int i = 0; i < N; i++) a[i] = myfunc(b[i]);
> >for (int i = 0; i < N; i++) c[i] = sin(b[i]);
> > }
> >
> > $ gcc-10 -ffast-math -O3 -mavx512f -fopenmp vabitest.c -S -o- | grep _ZGV
> >  call_ZGVdN8v_myfunc@PLT
> >  call_ZGVeN8v_sin@PLT
> >
> >
> > Is there a way to force gcc to generate calls to AVX512 function in
> > cases like this?
> 
> Not sure, the sin function get's called with a %zmm argument but
> myfunc is called with two %ymm arguments but returns a %zmm.
> Something seems messed up somewhere.
> 
> Can you open a bugreport?

I'd guess it is -mprefer-vector-width=256 by default, though it surprises me
%zmm is used for sin.
In any case, I'd retry with mprefer-vector-width=512.

Jakub



IA64 control speculation of loads

2021-02-08 Thread Benoît De Dinechin
Hello, 

Is there a way to activate control speculation of loads in GCC, starting with 
the ia64 target? For a loop as simple as on GCC 7.5, I could not get any: 

double 
list_sum(list_cell list) 
{ 
double result = 0.0; 
while (list->next) { 
list = list->next; 
result += list->payload; 
if (!list->next) break; 
list = list->next; 
result += list->payload; 
} 
return result; 
} 

Kalray has developed a 64-bit Fisher-style VLIW architecture ('KVX') for use in 
a manycore processor it produces. These VLIW cores run Linux, and Kalray 
develops GCC and LLVM code generators for them (see kvx compilers on 
https://godbolt.org/z/ZJGzje ). VLIW performance on non-numerical code is 
critically dependent on the control speculation of loads. Being a Fischer-style 
VLIW, the kvx architecture has dismissable loads instead of control speculative 
loads, so there is no need to create speculation check with recovery code. 

I first tried in prepass scheduling with SCHED_RGN, hoping from various 
comments in the source file that it could move loads across blocks 
(sched-rgn.c:26 The first run performs interblock scheduling, moving insns 
between different blocks in the same "region"). SCHED_EBB is not available in 
prepass and SEL_SCHED does not work with control speculation: not only from 
experience with the kvx retargeting where it breaks dataflow invariants, but 
also as hinted by logic in ia64.c:ia64_set_sched_flags(). 

My question is whether GCC can or cannot do any control speculation of loads 
during prepass scheduling. From what I observed, enabling control speculation 
in region scheduling only enables the load instructions to get ready earlier in 
their home basic block, not being scheduled in a dominator basic block like 
expected to happen for improving performance in the above example. 

Thanks for any advice. 

Benoît Dinechin 



list_sum-ia64.s
Description: Binary data


list_sum-kvx.s
Description: Binary data


Re: IA64 control speculation of loads

2021-02-08 Thread Alexander Monakov via Gcc


On Mon, 8 Feb 2021, Benoît De Dinechin wrote:

> Hello, 
> 
> Is there a way to activate control speculation of loads in GCC, starting with
> the ia64 target? For a loop as simple as on GCC 7.5, I could not get any: 

I think in that loop cost modeling in sel-sched estimates that load speculation
would not be profitable. With a long-latency operation after the load, I do get
a speculative load at -O3 (for the 'payload' field, but not 'next'):

struct list {
  struct list *next;
  double payload;
};

double f(struct list *l)
{
  double result = 0;
  for (; l; l = l->next)
result += 1 / l->payload;
  return result;
}

> Kalray has developed a 64-bit Fisher-style VLIW architecture ('KVX') for use
> in a manycore processor it produces. These VLIW cores run Linux, and Kalray
> develops GCC and LLVM code generators for them (see kvx compilers on
> https://godbolt.org/z/ZJGzje ). VLIW performance on non-numerical code is
> critically dependent on the control speculation of loads. Being a
> Fischer-style VLIW, the kvx architecture has dismissable loads instead of
> control speculative loads, so there is no need to create speculation check
> with recovery code. 
> 
> I first tried in prepass scheduling with SCHED_RGN, hoping from various
> comments in the source file that it could move loads across blocks
> (sched-rgn.c:26 The first run performs interblock scheduling, moving insns
> between different blocks in the same "region"). SCHED_EBB is not available in
> prepass and SEL_SCHED does not work with control speculation: not only from
> experience with the kvx retargeting where it breaks dataflow invariants, but
> also as hinted by logic in ia64.c:ia64_set_sched_flags(). 

Can you elaborate on the dataflow issues you've encountered? I don't recall the
specific reason why control speculation before register allocation cannot be
enabled with sel-sched, but I'd expect it has to do with the interval between
the speculative load and the check, in which the register may not be stored to
memory normally (needs dedicated spill/fill instructions), and interaction with
uninitialized variables assigned the same register.

If on KVX you don't need speculation checks, those concerns would not apply.

Why are you looking for pre-RA (prepass) scheduling specifically? To avoid
anti-dependencies created by register allocation?

> My question is whether GCC can or cannot do any control speculation of loads
> during prepass scheduling. From what I observed, enabling control speculation
> in region scheduling only enables the load instructions to get ready earlier
> in their home basic block, not being scheduled in a dominator basic block like
> expected to happen for improving performance in the above example. 

But there's no control flow inside a basic block, so the load can appear earlier
due to data speculation (or normal scheduling), not control speculation.

I think GCC may have correctness issues with ia64-style control speculation
before register allocation, but I can't think of a reason why check-free loads
would pose a problem.

Alexander


Re: IA64 control speculation of loads

2021-02-08 Thread Benoît De Dinechin
Hi Alexander,

Indeed there is control speculation on this example, but it only happens is 
sched2, which is delayed to mach and uses SEL_SCHED:

  ../gcc/ia64/gcc/cc1 -fpreprocessed ../list_sum.i  -quiet -dumpbase list_sum.c 
-auxbase list_sum -O3  -o list_sum.s -da -dp -dA -fsched-verbose=6

  grep movdf_speculative list_sum.c.*
  list_sum.c.298r.mach:] UNSPEC_LDS)) 25 {movdf_speculative}
  list_sum.c.298r.mach:] UNSPEC_LDS)) 25 {movdf_speculative}
  list_sum.c.298r.mach:] UNSPEC_LDS)) 25 {movdf_speculative}
  list_sum.c.298r.mach:] UNSPEC_LDS)) 25 {movdf_speculative}
  list_sum.c.299r.barriers:] UNSPEC_LDS)) 25 {movdf_speculative}
  list_sum.c.299r.barriers:] UNSPEC_LDS)) 25 {movdf_speculative}
  list_sum.c.303r.shorten:] UNSPEC_LDS)) 25 {movdf_speculative}
  list_sum.c.303r.shorten:] UNSPEC_LDS)) 25 {movdf_speculative}
  list_sum.c.304r.nothrow:] UNSPEC_LDS)) 25 {movdf_speculative}
  list_sum.c.304r.nothrow:] UNSPEC_LDS)) 25 {movdf_speculative}
  list_sum.c.306r.final:] UNSPEC_LDS)) 25 {movdf_speculative}
  list_sum.c.306r.final:] UNSPEC_LDS)) 25 {movdf_speculative}
  list_sum.c.307r.dfinish:] UNSPEC_LDS)) 25 {movdf_speculative}
  list_sum.c.307r.dfinish:] UNSPEC_LDS)) 25 {movdf_speculative}

The kvx back-end uses SCHED_EBB in sched2, which is also delayed to mach. On 
kvx, SCHED_EBB performs better than SEL_SCHED in most cases.

I suspect that control speculation likely does not work in prepass scheduling 
with SEL_SCHED because of the following in ia64_set_sched_flags():

  if (mflag_sched_control_spec
  && (!sel_sched_p ()
  || reload_completed))
{
  mask |= BEGIN_CONTROL;
  
  if (!sel_sched_p () && mflag_sched_in_control_spec)
mask |= BE_IN_CONTROL;
}

On the kvx, I tried the following kvx_sched_set_sched_flags():


- Original Message -
From: "Alexander Monakov" 
To: "Benoît De Dinechin" 
Cc: "gcc" , "Andrey Belevantsev" 
Sent: Monday, February 8, 2021 3:44:08 PM
Subject: Re: IA64 control speculation of loads

On Mon, 8 Feb 2021, Benoît De Dinechin wrote:

> Hello, 
> 
> Is there a way to activate control speculation of loads in GCC, starting with
> the ia64 target? For a loop as simple as on GCC 7.5, I could not get any: 

I think in that loop cost modeling in sel-sched estimates that load speculation
would not be profitable. With a long-latency operation after the load, I do get
a speculative load at -O3 (for the 'payload' field, but not 'next'):

struct list {
  struct list *next;
  double payload;
};

double f(struct list *l)
{
  double result = 0;
  for (; l; l = l->next)
result += 1 / l->payload;
  return result;
}

> Kalray has developed a 64-bit Fisher-style VLIW architecture ('KVX') for use
> in a manycore processor it produces. These VLIW cores run Linux, and Kalray
> develops GCC and LLVM code generators for them (see kvx compilers on
> https://godbolt.org/z/ZJGzje ). VLIW performance on non-numerical code is
> critically dependent on the control speculation of loads. Being a
> Fischer-style VLIW, the kvx architecture has dismissable loads instead of
> control speculative loads, so there is no need to create speculation check
> with recovery code. 
> 
> I first tried in prepass scheduling with SCHED_RGN, hoping from various
> comments in the source file that it could move loads across blocks
> (sched-rgn.c:26 The first run performs interblock scheduling, moving insns
> between different blocks in the same "region"). SCHED_EBB is not available in
> prepass and SEL_SCHED does not work with control speculation: not only from
> experience with the kvx retargeting where it breaks dataflow invariants, but
> also as hinted by logic in ia64.c:ia64_set_sched_flags(). 

Can you elaborate on the dataflow issues you've encountered? I don't recall the
specific reason why control speculation before register allocation cannot be
enabled with sel-sched, but I'd expect it has to do with the interval between
the speculative load and the check, in which the register may not be stored to
memory normally (needs dedicated spill/fill instructions), and interaction with
uninitialized variables assigned the same register.

If on KVX you don't need speculation checks, those concerns would not apply.

Why are you looking for pre-RA (prepass) scheduling specifically? To avoid
anti-dependencies created by register allocation?

> My question is whether GCC can or cannot do any control speculation of loads
> during prepass scheduling. From what I observed, enabling control speculation
> in region scheduling only enables the load instructions to get ready earlier
> in their home basic block, not being scheduled in a dominator basic block like
> expected to happen for improving performance in the above example. 

But there's no control flow inside a basic block, so the load can appear earlier
due to data sp