Re: Calls to auto-vectorized AVX512 functions
On Mon, Feb 8, 2021 at 4:26 AM Naoki Shibata via Gcc wrote: > > > Hello, > > I have a question as to the auto-vectorizer in GCC. > > When AVX512 instruction is available, the auto-vectorizer in gcc > sometimes generates calls to AVX2 functions instead of AVX512 functions. > > > $ cat vabitest.c > #include > #include > > _Pragma ("omp declare simd simdlen(8) notinbranch") > __attribute__((const)) double myfunc(double x); > > #define N 1024 > __attribute__ ((__aligned__(256))) double a[N], b[N], c[N]; > > int main(void) { >for (int i = 0; i < N; i++) a[i] = myfunc(b[i]); >for (int i = 0; i < N; i++) c[i] = sin(b[i]); > } > > $ gcc-10 -ffast-math -O3 -mavx512f -fopenmp vabitest.c -S -o- | grep _ZGV > call_ZGVdN8v_myfunc@PLT > call_ZGVeN8v_sin@PLT > > > Is there a way to force gcc to generate calls to AVX512 function in > cases like this? Not sure, the sin function get's called with a %zmm argument but myfunc is called with two %ymm arguments but returns a %zmm. Something seems messed up somewhere. Can you open a bugreport? Richard. > Regards, > > Naoki Shibata
Re: Calls to auto-vectorized AVX512 functions
On Mon, Feb 08, 2021 at 11:42:13AM +0100, Richard Biener via Gcc wrote: > > I have a question as to the auto-vectorizer in GCC. > > > > When AVX512 instruction is available, the auto-vectorizer in gcc > > sometimes generates calls to AVX2 functions instead of AVX512 functions. > > > > > > $ cat vabitest.c > > #include > > #include > > > > _Pragma ("omp declare simd simdlen(8) notinbranch") > > __attribute__((const)) double myfunc(double x); > > > > #define N 1024 > > __attribute__ ((__aligned__(256))) double a[N], b[N], c[N]; > > > > int main(void) { > >for (int i = 0; i < N; i++) a[i] = myfunc(b[i]); > >for (int i = 0; i < N; i++) c[i] = sin(b[i]); > > } > > > > $ gcc-10 -ffast-math -O3 -mavx512f -fopenmp vabitest.c -S -o- | grep _ZGV > > call_ZGVdN8v_myfunc@PLT > > call_ZGVeN8v_sin@PLT > > > > > > Is there a way to force gcc to generate calls to AVX512 function in > > cases like this? > > Not sure, the sin function get's called with a %zmm argument but > myfunc is called with two %ymm arguments but returns a %zmm. > Something seems messed up somewhere. > > Can you open a bugreport? I'd guess it is -mprefer-vector-width=256 by default, though it surprises me %zmm is used for sin. In any case, I'd retry with mprefer-vector-width=512. Jakub
IA64 control speculation of loads
Hello, Is there a way to activate control speculation of loads in GCC, starting with the ia64 target? For a loop as simple as on GCC 7.5, I could not get any: double list_sum(list_cell list) { double result = 0.0; while (list->next) { list = list->next; result += list->payload; if (!list->next) break; list = list->next; result += list->payload; } return result; } Kalray has developed a 64-bit Fisher-style VLIW architecture ('KVX') for use in a manycore processor it produces. These VLIW cores run Linux, and Kalray develops GCC and LLVM code generators for them (see kvx compilers on https://godbolt.org/z/ZJGzje ). VLIW performance on non-numerical code is critically dependent on the control speculation of loads. Being a Fischer-style VLIW, the kvx architecture has dismissable loads instead of control speculative loads, so there is no need to create speculation check with recovery code. I first tried in prepass scheduling with SCHED_RGN, hoping from various comments in the source file that it could move loads across blocks (sched-rgn.c:26 The first run performs interblock scheduling, moving insns between different blocks in the same "region"). SCHED_EBB is not available in prepass and SEL_SCHED does not work with control speculation: not only from experience with the kvx retargeting where it breaks dataflow invariants, but also as hinted by logic in ia64.c:ia64_set_sched_flags(). My question is whether GCC can or cannot do any control speculation of loads during prepass scheduling. From what I observed, enabling control speculation in region scheduling only enables the load instructions to get ready earlier in their home basic block, not being scheduled in a dominator basic block like expected to happen for improving performance in the above example. Thanks for any advice. Benoît Dinechin list_sum-ia64.s Description: Binary data list_sum-kvx.s Description: Binary data
Re: IA64 control speculation of loads
On Mon, 8 Feb 2021, Benoît De Dinechin wrote: > Hello, > > Is there a way to activate control speculation of loads in GCC, starting with > the ia64 target? For a loop as simple as on GCC 7.5, I could not get any: I think in that loop cost modeling in sel-sched estimates that load speculation would not be profitable. With a long-latency operation after the load, I do get a speculative load at -O3 (for the 'payload' field, but not 'next'): struct list { struct list *next; double payload; }; double f(struct list *l) { double result = 0; for (; l; l = l->next) result += 1 / l->payload; return result; } > Kalray has developed a 64-bit Fisher-style VLIW architecture ('KVX') for use > in a manycore processor it produces. These VLIW cores run Linux, and Kalray > develops GCC and LLVM code generators for them (see kvx compilers on > https://godbolt.org/z/ZJGzje ). VLIW performance on non-numerical code is > critically dependent on the control speculation of loads. Being a > Fischer-style VLIW, the kvx architecture has dismissable loads instead of > control speculative loads, so there is no need to create speculation check > with recovery code. > > I first tried in prepass scheduling with SCHED_RGN, hoping from various > comments in the source file that it could move loads across blocks > (sched-rgn.c:26 The first run performs interblock scheduling, moving insns > between different blocks in the same "region"). SCHED_EBB is not available in > prepass and SEL_SCHED does not work with control speculation: not only from > experience with the kvx retargeting where it breaks dataflow invariants, but > also as hinted by logic in ia64.c:ia64_set_sched_flags(). Can you elaborate on the dataflow issues you've encountered? I don't recall the specific reason why control speculation before register allocation cannot be enabled with sel-sched, but I'd expect it has to do with the interval between the speculative load and the check, in which the register may not be stored to memory normally (needs dedicated spill/fill instructions), and interaction with uninitialized variables assigned the same register. If on KVX you don't need speculation checks, those concerns would not apply. Why are you looking for pre-RA (prepass) scheduling specifically? To avoid anti-dependencies created by register allocation? > My question is whether GCC can or cannot do any control speculation of loads > during prepass scheduling. From what I observed, enabling control speculation > in region scheduling only enables the load instructions to get ready earlier > in their home basic block, not being scheduled in a dominator basic block like > expected to happen for improving performance in the above example. But there's no control flow inside a basic block, so the load can appear earlier due to data speculation (or normal scheduling), not control speculation. I think GCC may have correctness issues with ia64-style control speculation before register allocation, but I can't think of a reason why check-free loads would pose a problem. Alexander
Re: IA64 control speculation of loads
Hi Alexander, Indeed there is control speculation on this example, but it only happens is sched2, which is delayed to mach and uses SEL_SCHED: ../gcc/ia64/gcc/cc1 -fpreprocessed ../list_sum.i -quiet -dumpbase list_sum.c -auxbase list_sum -O3 -o list_sum.s -da -dp -dA -fsched-verbose=6 grep movdf_speculative list_sum.c.* list_sum.c.298r.mach:] UNSPEC_LDS)) 25 {movdf_speculative} list_sum.c.298r.mach:] UNSPEC_LDS)) 25 {movdf_speculative} list_sum.c.298r.mach:] UNSPEC_LDS)) 25 {movdf_speculative} list_sum.c.298r.mach:] UNSPEC_LDS)) 25 {movdf_speculative} list_sum.c.299r.barriers:] UNSPEC_LDS)) 25 {movdf_speculative} list_sum.c.299r.barriers:] UNSPEC_LDS)) 25 {movdf_speculative} list_sum.c.303r.shorten:] UNSPEC_LDS)) 25 {movdf_speculative} list_sum.c.303r.shorten:] UNSPEC_LDS)) 25 {movdf_speculative} list_sum.c.304r.nothrow:] UNSPEC_LDS)) 25 {movdf_speculative} list_sum.c.304r.nothrow:] UNSPEC_LDS)) 25 {movdf_speculative} list_sum.c.306r.final:] UNSPEC_LDS)) 25 {movdf_speculative} list_sum.c.306r.final:] UNSPEC_LDS)) 25 {movdf_speculative} list_sum.c.307r.dfinish:] UNSPEC_LDS)) 25 {movdf_speculative} list_sum.c.307r.dfinish:] UNSPEC_LDS)) 25 {movdf_speculative} The kvx back-end uses SCHED_EBB in sched2, which is also delayed to mach. On kvx, SCHED_EBB performs better than SEL_SCHED in most cases. I suspect that control speculation likely does not work in prepass scheduling with SEL_SCHED because of the following in ia64_set_sched_flags(): if (mflag_sched_control_spec && (!sel_sched_p () || reload_completed)) { mask |= BEGIN_CONTROL; if (!sel_sched_p () && mflag_sched_in_control_spec) mask |= BE_IN_CONTROL; } On the kvx, I tried the following kvx_sched_set_sched_flags(): - Original Message - From: "Alexander Monakov" To: "Benoît De Dinechin" Cc: "gcc" , "Andrey Belevantsev" Sent: Monday, February 8, 2021 3:44:08 PM Subject: Re: IA64 control speculation of loads On Mon, 8 Feb 2021, Benoît De Dinechin wrote: > Hello, > > Is there a way to activate control speculation of loads in GCC, starting with > the ia64 target? For a loop as simple as on GCC 7.5, I could not get any: I think in that loop cost modeling in sel-sched estimates that load speculation would not be profitable. With a long-latency operation after the load, I do get a speculative load at -O3 (for the 'payload' field, but not 'next'): struct list { struct list *next; double payload; }; double f(struct list *l) { double result = 0; for (; l; l = l->next) result += 1 / l->payload; return result; } > Kalray has developed a 64-bit Fisher-style VLIW architecture ('KVX') for use > in a manycore processor it produces. These VLIW cores run Linux, and Kalray > develops GCC and LLVM code generators for them (see kvx compilers on > https://godbolt.org/z/ZJGzje ). VLIW performance on non-numerical code is > critically dependent on the control speculation of loads. Being a > Fischer-style VLIW, the kvx architecture has dismissable loads instead of > control speculative loads, so there is no need to create speculation check > with recovery code. > > I first tried in prepass scheduling with SCHED_RGN, hoping from various > comments in the source file that it could move loads across blocks > (sched-rgn.c:26 The first run performs interblock scheduling, moving insns > between different blocks in the same "region"). SCHED_EBB is not available in > prepass and SEL_SCHED does not work with control speculation: not only from > experience with the kvx retargeting where it breaks dataflow invariants, but > also as hinted by logic in ia64.c:ia64_set_sched_flags(). Can you elaborate on the dataflow issues you've encountered? I don't recall the specific reason why control speculation before register allocation cannot be enabled with sel-sched, but I'd expect it has to do with the interval between the speculative load and the check, in which the register may not be stored to memory normally (needs dedicated spill/fill instructions), and interaction with uninitialized variables assigned the same register. If on KVX you don't need speculation checks, those concerns would not apply. Why are you looking for pre-RA (prepass) scheduling specifically? To avoid anti-dependencies created by register allocation? > My question is whether GCC can or cannot do any control speculation of loads > during prepass scheduling. From what I observed, enabling control speculation > in region scheduling only enables the load instructions to get ready earlier > in their home basic block, not being scheduled in a dominator basic block like > expected to happen for improving performance in the above example. But there's no control flow inside a basic block, so the load can appear earlier due to data sp