[Bug target/72784] AVX512: Assembler failure when compiling on OSX

2018-10-21 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72784

Wenzel Jakob  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |INVALID

--- Comment #1 from Wenzel Jakob  ---
Closing this due to lack of attention/relevance.

[Bug target/87674] New: AVX512: incorrect intrinsic signature

2018-10-21 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87674

Bug ID: 87674
   Summary: AVX512: incorrect intrinsic signature
   Product: gcc
   Version: 8.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wen...@mitsuba-renderer.org
  Target Milestone: ---

Hi,

I'm seeing a number of warnings related to the following three intrinsics,
which appaer to have an incorrect signature. The fix is easy: simply change
__mmask16 to __mmask8 for those definitions (and this is also what's correct
according to Intel's Intrinsics Explorer)

/home/wjakob/dist/lib/gcc/x86_64-pc-linux-gnu/8.2.0/include/avx512vlintrin.h:
In function ‘__m128i _mm_mask_mullo_epi32(__m128i, __mmask16, __m128i,
__m128i)’:
/home/wjakob/dist/lib/gcc/x86_64-pc-linux-gnu/8.2.0/include/avx512vlintrin.h:9055:23:
warning: conversion from ‘__mmask16’ {aka ‘short unsigned int’} to ‘unsigned
char’ may change value [-Wconversion]
 (__v4si) __W, __M);   ^~~
/home/wjakob/dist/lib/gcc/x86_64-pc-linux-gnu/8.2.0/include/avx512vlbwintrin.h:
In function ‘__m128i _mm_mask_packus_epi32(__m128i, __mmask16, __m128i,
__m128i)’:
/home/wjakob/dist/lib/gcc/x86_64-pc-linux-gnu/8.2.0/include/avx512vlbwintrin.h:4354:25:
warning: conversion from ‘__mmask16’ {aka ‘short unsigned int’} to ‘unsigned
char’ may change value [-Wconversion]
   (__v8hi) __W, __M); ^~~
/home/wjakob/dist/lib/gcc/x86_64-pc-linux-gnu/8.2.0/include/avx512vlbwintrin.h:
In function ‘__m128i _mm_mask_packs_epi32(__m128i, __mmask16, __m128i,
__m128i)’:
/home/wjakob/dist/lib/gcc/x86_64-pc-linux-gnu/8.2.0/include/avx512vlbwintrin.h:4397:25:
warning: conversion from ‘__mmask16’ {aka ‘short unsigned int’} to ‘unsigned
char’ may change value [-Wconversion]
   (__v8hi) __W, __M); ^~~

Best,
Wenzel

[Bug target/87674] AVX512: incorrect intrinsic signature

2018-10-22 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87674

--- Comment #3 from Wenzel Jakob  ---
Thanks -- this patch works for me.

With regards to the signature difference: I had already stumbled about the
(float *) vs  (some value *) difference in some intrinsics.

In the best case differences cause warnings (ok, but still annoying :)), in the
worst case special casts are needed for GCC, making intrinsics code less
portable between compilers. So my vote would definitely matching ICC behavior
1:1.

[Bug target/76731] [AVX512] _mm512_i32gather_epi32 and other scatter/gather routines have incorrect signature

2016-12-13 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=76731

--- Comment #6 from Wenzel Jakob  ---
Are there any news here? This is clearly an issue, and it would be nice to fix
it. I currently can't compile my AVX512 project on GCC due to this bug.

[Bug target/76731] [AVX512] _mm512_i32gather_epi32 and other scatter/gather routines have incorrect signature

2017-01-10 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=76731

--- Comment #9 from Wenzel Jakob  ---
Hi -- just a ping regarding this issue.
Thanks,
Wenzel

[Bug target/73350] AVX512: GCC optimizes away rounding flags

2017-01-10 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=73350

--- Comment #4 from Wenzel Jakob  ---
This bug is still present in the latest GCC -- are there any plans to fix it?

[Bug target/76731] [AVX512] _mm512_i32gather_epi32 and other scatter/gather routines have incorrect signature

2017-01-13 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=76731

--- Comment #11 from Wenzel Jakob  ---
Searching through the intrinsics guide (e.g.
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=gather_ps),
I see "void *" for all gather scatter intrinsics and "const void *" for all
gather intrinsics consistently applied.

This is in contrast to the load intrinsics, where there are some
inconsistencies between argument conventions (e.g. _mm512_load_ps vs
_mm256_load_ps)

[Bug target/79481] New: AVX512PF: unmasked gather prefetch intrinsics missing

2017-02-12 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79481

Bug ID: 79481
   Summary: AVX512PF: unmasked gather prefetch intrinsics missing
   Product: gcc
   Version: 7.0.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wen...@mitsuba-renderer.org
  Target Milestone: ---

The latest trunk version (and all versions before as far as I can tell) are
missing the following (unmasked) intrinsics for gather prefetches:

_mm512_prefetch_i32gather_pd
_mm512_prefetch_i64gather_pd
_mm512_prefetch_i32gather_ps
_mm512_prefetch_i64gather_ps

It would be great if these could be added.

Thanks,
Wenzel

[Bug target/79481] AVX512PF: unmasked gather prefetch intrinsics missing

2017-02-13 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79481

--- Comment #2 from Wenzel Jakob  ---
I agree that the docs from Intel are not particularly consistent. In this case,
the hardware has dedicated instructions for these type of gathers, so it would
make sense for a matching intrinsic to be part of GCC.

[Bug target/79481] AVX512PF: unmasked gather prefetch intrinsics missing

2017-02-13 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79481

--- Comment #4 from Wenzel Jakob  ---
I think that's right. Clang e.g. also does this:

#define _mm512_prefetch_i32gather_ps(index, addr, scale, hint) ({\
  __builtin_ia32_gatherpfdps((__mmask16) -1, \
 (__v16si)(__m512i)(index), (int const *)(addr), \
 (int)(scale), (int)(hint)); })

[Bug c++/77629] New: internal compiler error: same canonical type node for different types

2016-09-17 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77629

Bug ID: 77629
   Summary: internal compiler error: same canonical type node for
different types
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wen...@mitsuba-renderer.org
  Target Milestone: ---

Created attachment 39640
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39640&action=edit
Preprocessed source code

I am running into the following internal compiler error with GCC TRUNK. The
preprocessed file is attached.

$ /usr/local/bin/gcc7 out.cpp

In file included from include/simdarray/array.h:58:0,
 from tests/testsuite.cpp:21:
include/simdarray/array_recursive.h:124:93: internal compiler error: same
canonical type node for different types simd::ArrayBase::peel != -1),
void>::type>::Base and simd::ArrayOperations
 SIMD_INLINE ArrayBase(Scalar value) : a1(value), a2(value) { }
   
 ^
0x781d74 comptypes(tree_node*, tree_node*, int)
../../gcc/cp/typeck.c:1437
0x6bcd01 resolve_typename_type(tree_node*, bool)
../../gcc/cp/pt.c:23721
0x7802ec structural_comptypes
../../gcc/cp/typeck.c:1204
0x7848ad comptypes(tree_node*, tree_node*, int)
../../gcc/cp/typeck.c:1409
0x7848ad compparms(tree_node const*, tree_node const*)
../../gcc/cp/typeck.c:1539
0x703b5c add_method(tree_node*, tree_node*, tree_node*)
../../gcc/cp/class.c:1155
0x7d0c64 finish_member_declaration(tree_node*)
../../gcc/cp/semantics.c:2997
0x76d3d8 cp_parser_member_declaration
../../gcc/cp/parser.c:22770
0x747b7a cp_parser_member_specification_opt
../../gcc/cp/parser.c:22331
0x747b7a cp_parser_class_specifier_1
../../gcc/cp/parser.c:21496
0x74a069 cp_parser_class_specifier
../../gcc/cp/parser.c:21745
0x74a069 cp_parser_type_specifier
../../gcc/cp/parser.c:15971
0x75da97 cp_parser_decl_specifier_seq
../../gcc/cp/parser.c:12889
0x76b965 cp_parser_single_declaration
../../gcc/cp/parser.c:25975
0x76bd0c cp_parser_template_declaration_after_parameters
../../gcc/cp/parser.c:25667
0x76c68c cp_parser_explicit_template_declaration
../../gcc/cp/parser.c:25902
0x76c68c cp_parser_template_declaration_after_export
../../gcc/cp/parser.c:25920
0x7735a9 cp_parser_declaration
../../gcc/cp/parser.c:12209
0x771d7b cp_parser_declaration_seq_opt
../../gcc/cp/parser.c:12139
0x7724b2 cp_parser_namespace_body
../../gcc/cp/parser.c:17763
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <http://gcc.gnu.org/bugs.html> for instructions.

[Bug c++/69481] ICE with C++11 alias using with templates

2016-09-18 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69481

--- Comment #4 from Wenzel Jakob  ---
I'm pretty sure this is a recent regression -- GCC was able to compile the code
on Bug 77629 a month ago.

[Bug c++/69481] ICE with C++11 alias using with templates

2016-09-18 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69481

--- Comment #6 from Wenzel Jakob  ---
No -- I am experimenting with the AVX512F backend and thus need to use the
development branch.

[Bug target/76731] [AVX512] _mm512_i32gather_epi32 and other scatter/gather routines have incorrect signature

2016-09-18 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=76731

--- Comment #3 from Wenzel Jakob  ---
Any updates here? Should this be closed?

[Bug target/76731] [AVX512] _mm512_i32gather_epi32 and other scatter/gather routines have incorrect signature

2016-09-18 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=76731

--- Comment #4 from Wenzel Jakob  ---
Hmm, it looks like this is still an issue. Recompiling my codebase with the
latest trunk version of gcc still produces many errors caused by this, e.g.


include/simdarray/array_avx512.h:1059:53: error: invalid conversion from
‘simd::ArrayOperations >::Scalar* {aka
unsigned int*}’ to ‘const int*’ [-fpermissive]
 __m512i values = _mm512_mask_i32gather_epi32(
  ~~~^
 _mm512_undefined_epi32(), mask.k, index.m, f, sizeof(Scalar));
 ~
In file included from
/usr/local/lib/gcc/x86_64-pc-linux-gnu/7.0.0/include/immintrin.h:45:0,
 from include/simdarray/array.h:33,
 from tests/histogram.cpp:2:
/usr/local/lib/gcc/x86_64-pc-linux-gnu/7.0.0/include/avx512fintrin.h:9316:1:
note:   initializing argument 4 of ‘__m512i
_mm512_mask_i32gather_epi32(__m512i, __mmask16, __m512i, const int*, int)’
 _mm512_mask_i32gather_epi32 (__m512i __v1_old, __mmask16 __mask,
 ^~~
In file included from include/simdarray/array.h:73:0,
 from tests/histogram.cpp:2:
include/simdarray/array_avx512.h:1068:22: error: use of ‘main(int,
char**):: [with auto:1 = simd::Array]’ before deduction of ‘auto’
 values = func(Derived(values)).m;

etc...

[Bug tree-optimization/72824] [5/6 Regression] Signed floating point zero semantics broken at optimization level -O3 (tree-loop-distribute-patterns)

2016-09-18 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72824

--- Comment #13 from Wenzel Jakob  ---
The fix was merged, so I assume this bug should be closed as RESOLVED?

[Bug c++/69481] ICE with C++11 alias using with templates

2016-09-18 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69481

--- Comment #7 from Wenzel Jakob  ---
Correction: this ICE indeed goes away when building with
--enable-checking=release (though that doesn't seem like a nice solution). I
assume I used this check level in my trunk builds before and forgot it this
time.

[Bug target/77633] New: AVX512: shuffle intrinsic has incorrect signature when optimizations are enabled

2016-09-18 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77633

Bug ID: 77633
   Summary: AVX512: shuffle intrinsic has incorrect signature when
optimizations are enabled
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wen...@mitsuba-renderer.org
  Target Milestone: ---

The AVX512 shuffle intrinsic switches to a different implementation (&
different signature) when optimizations are turned on. This leads to the
following strange error message when compiling a snippet that passes the type
checker at -O0.

///


$ g++-7 test.c -march=knl -O3
In file included from
/usr/local/lib/gcc/x86_64-pc-linux-gnu/7.0.0/include/immintrin.h:29:0,
 from test.c:1:
test.c: In function ‘void test()’:
test.c:8:50: error: invalid conversion from ‘int’ to ‘_MM_PERM_ENUM’
[-fpermissive]
 _mm512_shuffle_epi32(_mm512_setzero_epi32(), _MM_SHUFFLE(0, 3, 0, 1));
  ^
In file included from
/usr/local/lib/gcc/x86_64-pc-linux-gnu/7.0.0/include/immintrin.h:45:0,
 from test.c:1:
/usr/local/lib/gcc/x86_64-pc-linux-gnu/7.0.0/include/avx512fintrin.h:3848:1:
note:   initializing argument 2 of ‘__m512i _mm512_shuffle_epi32(__m512i,
_MM_PERM_ENUM)’
 _mm512_shuffle_epi32 (__m512i __A, _MM_PERM_ENUM __mask)
 ^~~~


///


#include 

void test() {
/* SSE shuffle: works */
_mm_shuffle_epi32(_mm_setzero_si128(), _MM_SHUFFLE(0, 3, 0, 1));

/* AVX512 shuffle: type checker error when optimizations are turned on! */
_mm512_shuffle_epi32(_mm512_setzero_epi32(), _MM_SHUFFLE(0, 3, 0, 1));
}

[Bug target/77633] AVX512: shuffle intrinsic has incorrect signature when optimizations are enabled

2016-09-19 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77633

--- Comment #2 from Wenzel Jakob  ---
I just tried compiling this snippet with ICC 17.0.0. It accepts it without
warnings (-Wall -Wconversion -Wextra). So even if the signature is different,
ICC seems to be more relaxed about passing an integer value to an enum
parameter.

[Bug target/77633] AVX512: shuffle intrinsic has incorrect signature when optimizations are enabled

2016-09-19 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77633

--- Comment #4 from Wenzel Jakob  ---
Aha, interesting -- that breaks it:

test.cpp(9): error: argument of type "int" is incompatible with parameter of
type "_MM_PERM_ENUM={_MM_PERM_ENUM}"
  _mm512_shuffle_epi32(_mm512_setzero_epi32(), _MM_SHUFFLE(0, 3, 0, 1));

Definitely not a very nice API design! I assume the right course of action then
will be to mark this issue INVALID and change my code to cast to _MM_PERM_ENUM?

[Bug target/72773] New: AVX512: Invalid operand for vcvttss2siq instruction

2016-08-02 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72773

Bug ID: 72773
   Summary: AVX512: Invalid operand for vcvttss2siq instruction
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wen...@mitsuba-renderer.org
  Target Milestone: ---

Created attachment 39047
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39047&action=edit
Preprocessed file causing the issue

Hi,

I'm running into a pesky AVX512F code generation issue using the latest HEAD
version of gcc on my OSX development machine.
Compiling the attached preprocessed file yields the following error messages:

g++-7 test.i -xc++ -std=c++14 -O3 -DNDEBUG -fomit-frame-pointer -mavx2 -mfma
-mf16c -mavx512f -Wa,-mavx512f

/var/folders/94/rfzxfhbn3hjb4p402lg_yjgwgn/T//cci4IcB4.s:555:14: error:
invalid operand for instruction
vcvttss2siq %xmm18, %rax
^~
/var/folders/94/rfzxfhbn3hjb4p402lg_yjgwgn/T//cci4IcB4.s:559:14: error:
invalid operand for instruction
vcvttss2siq %xmm17, %rax
^~
/var/folders/94/rfzxfhbn3hjb4p402lg_yjgwgn/T//cci4IcB4.s:561:14: error:
invalid operand for instruction
vcvttss2siq %xmm16, %rax
^~

AFAIK on OSX, GCC uses the Clang assembler. There are thus two possibilities:

1. The vcvttss2siq instrunction does not exist for new-style xmm register
arguments, and GCC should not have generated it

2. It is a valid instruction, and it's the Clang assembler's fault for not
recognizing it.

I am not familiar enough with the AVX512F assembly and will create a ticket in
both the GCC and LLVM bugtracker so that this problem can be addressed.

Details on my compiler version:

COLLECT_GCC=g++-7
COLLECT_LTO_WRAPPER=/usr/local/Cellar/gcc/HEAD-/libexec/gcc/x86_64-apple-darwin15.5.0/7.0.0/lto-wrapper
Target: x86_64-apple-darwin15.5.0
Configured with: ../configure --build=x86_64-apple-darwin15.5.0
--prefix=/usr/local/Cellar/gcc/HEAD-
--libdir=/usr/local/Cellar/gcc/HEAD-/lib/gcc/
--enable-languages=c,c++,objc,obj-c++,fortran --program-suffix=-
--with-gmp=/usr/local/opt/gmp --with-mpfr=/usr/local/opt/mpfr
--with-mpc=/usr/local/opt/libmpc --with-isl=/usr/local/opt/isl
--with-system-zlib --enable-libstdcxx-time=yes --enable-stage1-checking
--enable-checking=release --enable-lto --with-build-config=bootstrap-debug
--disable-werror --with-pkgversion='Homebrew gcc HEAD- --without-multilib'
--with-bugurl=https://github.com/Homebrew/homebrew/issues --enable-plugin
--disable-nls --disable-multilib
Thread model: posix
gcc version 7.0.0 20160801 (experimental) (Homebrew gcc HEAD-
--without-multilib)

[Bug target/72773] AVX512: Invalid operand for vcvttss2siq instruction

2016-08-02 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72773

--- Comment #1 from Wenzel Jakob  ---
The LLVM ticket is here: https://llvm.org/bugs/show_bug.cgi?id=28810

[Bug target/72773] AVX512: Invalid operand for vcvttss2siq instruction

2016-08-02 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72773

--- Comment #3 from Wenzel Jakob  ---
It looks like it is an LLVM issue (see
https://llvm.org/bugs/show_bug.cgi?id=28810)

[Bug target/72782] New: AVX512: No support for scalar broadcasts

2016-08-03 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72782

Bug ID: 72782
   Summary: AVX512: No support for scalar broadcasts
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wen...@mitsuba-renderer.org
  Target Milestone: ---

AVX512 introduces the ability to do scalar broadcasts, which significantly cuts
down on the number of explicit broadcast instructions in vectorized code. It
looks like the AVX512 code generation backend on GCC does not recognize/make
use of this instruction set feature:

Consider the following snippet:

__m512 addConstant(__m512 arg) {
return _mm512_add_ps(arg, _mm512_set1_ps(1.f));
}

This is the assembly generated by GCC (HEAD):

__Z11addConstantDv16_f:
LFB4589:
vbroadcastssLC0(%rip), %zmm1
vaddps  %zmm1, %zmm0, %zmm0
ret

For reference, this is the output generated by Clang:

_Z11addConstantDv16_f: ## @_Z11addConstantDv16_f
vaddps  LCPI0_0(%rip){1to16}, %zmm0, %zmm0
retq

[Bug target/72784] New: AVX512: Assembler failure when compiling on OSX

2016-08-03 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72784

Bug ID: 72784
   Summary: AVX512: Assembler failure when compiling on OSX
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wen...@mitsuba-renderer.org
  Target Milestone: ---

GCC (HEAD) fails to compile basic AVX512 code on my machine (OSX 10.11.6) which
I'm using to develop for (and emulate) this architecture.

Consider the following small program:

#include 

__m512 addConstant(__m512 arg) {
return _mm512_add_ps(arg, _mm512_set1_ps(1.f));
}

Compiling yields the following error message at the assembler stage:

$ g++-7 test.cpp -c -o test.s -O3 -mavx512f

/var/folders/lm/4mxv3gx901q6sympjjnzbrb4gp/T//cc0nXIBi.s:6:2: error:
instruction requires: AVX-512 ISA
vbroadcastssLC0(%rip), %zmm1
^
/var/folders/lm/4mxv3gx901q6sympjjnzbrb4gp/T//cc0nXIBi.s:7:2: error:
instruction requires: AVX-512 ISA
vaddps  %zmm1, %zmm0, %zmm0


This is an interesting interaction between the GCC toolchain and the Clang
assembler which only occurs when developing on OSX. It is possible to work
around the error by specifying an additional option "-Wa,-mavx512f" to the
compiler. However, this is certainly non-ideal, since it is nonstandard
parameter and in fact causes builds on other platforms to fail.

Ideally GCC (on OSX only) would transparently forward the -march=<...> and
-mavx* parameters to the LLVM assembler.

[Bug target/72805] New: AVX512: invalid code generation involving masks

2016-08-04 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72805

Bug ID: 72805
   Summary: AVX512: invalid code generation involving masks
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wen...@mitsuba-renderer.org
  Target Milestone: ---

Consider the following minimal program, which initializes an 16 int AVX512
vector with -1 entries, does a componen-twise "< 0" comparison, and prints the
resulting mask.

Since there are 16 entries, the expected output is "65535". GCC trunk prints
"255" (compilation flags: g++-7 -S -mavx512f  test.c -o test.s
-fomit-frame-pointer -fno-asynchronous-unwind-tables -fno-exceptions). The
issue goes away when compiling at higher optimization levels, though that is
clearly not a good solution.

#include 
#include 

__attribute__((noinline))
int test() { 
__m512i value = _mm512_set1_epi32(-1);
return (int) _mm512_cmp_epi32_mask(value, _mm512_setzero_si512(), 1 /*
_MM_CMPINT_LT */);
}

int main(int argc, char *argv[]) {
printf("%i\n", test());
return 0;
}

Looking at the assembly reveals the problem:

__Z4testv:
leaq8(%rsp), %r10
andq$-64, %rsp
pushq   -8(%r10)
pushq   %rbp
movq%rsp, %rbp
pushq   %r10
subq$112, %rsp
movl$-1, -52(%rbp)
vmovdqa64   -176(%rbp), %zmm0
movl$-1, %eax
kmovw   %eax, %k2
vpbroadcastd-52(%rbp), %zmm0{%k2}
vmovdqa64   %zmm0, -240(%rbp)
vpxord  %zmm0, %zmm0, %zmm0
vmovdqa64   %zmm0, %zmm1
vmovdqa64   -240(%rbp), %zmm0
movl$-1, %eax
kmovw   %eax, %k3
vpcmpd  $1, %zmm1, %zmm0, %k1{%k3}
kmovw   %k1, %eax
movzbl  %al, %eax<- UH OH
addq$112, %rsp
popq%r10
popq%rbp
leaq-8(%r10), %rsp
ret

For some reason, GCC things that the mask is only eight byte wide and uses a
"movzbl" instruction.

At higher optimization levels, many of the moves are elided, and the mask is
directly copied to %eax. Very mysterious.

__Z4testv:
vpternlogd  $0xFF, %zmm0, %zmm0, %zmm0
vpxord  %zmm1, %zmm1, %zmm1
vpcmpd  $1, %zmm1, %zmm0, %k1
kmovw   %k1, %eax
movzwl  %ax, %eax
ret

[Bug target/72805] AVX512: invalid code generation involving masks

2016-08-04 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72805

--- Comment #6 from Wenzel Jakob  ---
awesome, thanks!

[Bug tree-optimization/72824] New: [7 Regression] Signed floating point zero semantics broken at optimization level -O3 (tree-loop-distribute-patterns)

2016-08-06 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72824

Bug ID: 72824
   Summary: [7 Regression] Signed floating point zero semantics
broken at optimization level -O3
(tree-loop-distribute-patterns)
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: major
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wen...@mitsuba-renderer.org
  Target Milestone: ---

The trunk version of GCC has a regression which optimizes away signed zeros at
optimization level -O3. This should never happen unless more aggressive
optimization flags are specified (like -ffast-math or -fno-signed-zeros,
neither of which are part of -O3). Having correct signed zero semantics is
important for many scientific computing applications.

$ g++-7 test.cpp -o test -O2
$ ./test
-0.00 # < Correct

$ g++-7 test.cpp -o test -O3
$ ./test
0.00 # < Signed zero gone

It's possible to fix the issue by adding -fno-tree-loop-distribute-patterns, so
I assume that it is somehow related to this optimization.

Program to reproduce:

#include 

template 
struct Array {
Array(float value) {
for (size_t i = 0; i array(-0.f);
printf("%f\n", array.x[0]);
return 0;
}


This is with the latest trunk version of GCC:
$ g++-7 -v
Using built-in specs.
COLLECT_GCC=g++-7
COLLECT_LTO_WRAPPER=/usr/local/Cellar/gcc/HEAD-/libexec/gcc/x86_64-apple-darwin15.6.0/7.0.0/lto-wrapper
Target: x86_64-apple-darwin15.6.0
Configured with: ../configure --build=x86_64-apple-darwin15.6.0
--prefix=/usr/local/Cellar/gcc/HEAD-
--libdir=/usr/local/Cellar/gcc/HEAD-/lib/gcc/
--enable-languages=c,c++,objc,obj-c++,fortran --program-suffix=-
--with-gmp=/usr/local/opt/gmp --with-mpfr=/usr/local/opt/mpfr
--with-mpc=/usr/local/opt/libmpc --with-isl=/usr/local/opt/isl
--with-system-zlib --enable-libstdcxx-time=yes --enable-stage1-checking
--enable-checking=release --enable-lto --with-build-config=bootstrap-debug
--disable-werror --with-pkgversion='Homebrew gcc HEAD- --without-multilib'
--with-bugurl=https://github.com/Homebrew/homebrew/issues --enable-plugin
--disable-nls --disable-multilib
Thread model: posix
gcc version 7.0.0 20160804 (experimental) (Homebrew gcc HEAD-
--without-multilib)

[Bug target/72867] New: SSE/AVX/AVX512: incorrect optimization of VMINPS/VMAXPS at compile time

2016-08-10 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72867

Bug ID: 72867
   Summary: SSE/AVX/AVX512: incorrect optimization of
VMINPS/VMAXPS at compile time
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wen...@mitsuba-renderer.org
  Target Milestone: ---

The Intel intrinsics provide a family functions for computing the minimum and
maximum of two floating point vectors of different SIMD widths.

For the most part, these are symmetric. They are not, however, when given a NaN
argument: in particular,

min(1, nan) == 1
min(nan, 1) == nan

Whether that is pretty is arguable, but it's what the hardware implements (and
numerical libraries depend on this behavior).

The program below computes the expected output at optimization level 0.

$ g++ test.c -o test -msse4.2 -O0 && ./test
min(1, nan) = [nan nan nan nan]
min(nan, 1) = [1.00 1.00 1.00 1.00]

At optimization level 1, the minimum is computed at compile time, and the NaN
value is incorrectly propagated. This problem occurs both on GCC trunk and on
GCC 5.0 (I have not tested other versions).

$ g++ test.c -o test -msse4.2 -O1 && ./test
min(1, nan) = [nan nan nan nan]
min(nan, 1) = [nan nan nan nan]

///  Program to reproduce the issue 

#include 
#include 
#include 


int main(int argc, char *argv[]) {
__m128 x = _mm_min_ps(_mm_set1_ps(1.f), _mm_set1_ps(NAN));
printf("min(1, nan) = [%f %f %f %f]\n", x[0], x[1], x[2], x[3]);

x = _mm_min_ps(_mm_set1_ps(NAN), _mm_set1_ps(1.f));
printf("min(nan, 1) = [%f %f %f %f]\n", x[0], x[1], x[2], x[3]);

return 0;
}

[Bug target/72782] AVX512: No support for scalar broadcasts

2016-08-10 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72782

--- Comment #1 from Wenzel Jakob  ---
Looks like this issue was first reported in 2014 but got stuck -- see Bug
63351.

[Bug target/63351] Optimization: contract broadcast intrinsics when AVX512 is enabled

2016-08-10 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63351

Wenzel Jakob  changed:

   What|Removed |Added

 CC||wen...@mitsuba-renderer.org

--- Comment #5 from Wenzel Jakob  ---
Any news on this? I've also run into GCC's lack of broadcast support (Bug
72782).

[Bug tree-optimization/72824] [5/6 Regression] Signed floating point zero semantics broken at optimization level -O3 (tree-loop-distribute-patterns)

2016-08-10 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72824

--- Comment #8 from Wenzel Jakob  ---
Thank you, I can confirm that the issue is fixed on my end.

[Bug target/73350] New: AVX512: GCC optimizes away rounding flags

2016-08-10 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=73350

Bug ID: 73350
   Summary: AVX512: GCC optimizes away rounding flags
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wen...@mitsuba-renderer.org
  Target Milestone: ---

The AVX512 instruction set introduced the ability to specify a rounding flag
for almost every arithmetic operation that is subject to rounding. This is
extremely useful because it eliminates the need to mess around with the MXCSR
control register when using tools like interval arithmetic that need control of
rounding.

Unfortunately, support for this is currently broken in GCC. Specifically, the
GCC optimizer does not seem to distinguish between function variants with
different rounding modes and ends up merging them during common subexpression
elimination.

Consider the simple program attached below, which computes "1 + pi" with +inf
and -inf rounding modes and then prints the difference of these values. The
expected output is:

$ g++ test.c -o test -mavx512f -O0 -fomit-frame-pointer -fomit-frame-pointer &&
./test
-4.76837e-07

At optimization level, -O1, this currently stops working (tested with GCC
trunk):

$ g++ test.c -o test -mavx512f -O0 -fomit-frame-pointer -fomit-frame-pointer &&
./test
-4.76837e-07

Looking at the assembly, there are two surprising things: first, common
subexpression elimination seems to have (partially) merged the two additions.
The second add is still generated but its result is never used.

The other weird thing is that GCC decides to fill a mask register with '-1' and
then use the masked versions of these operations instead of using the unmasked
versions, which use a "-1" mask by default.

_main:
leaq8(%rsp), %r10
andq$-64, %rsp
pushq   -8(%r10)
pushq   %rbp
movq%rsp, %rbp
pushq   %r10
subq$40, %rsp
movl$-1, %eax
kmovw   %eax, %k1
vbroadcastssLC0(%rip), %zmm1
vbroadcastssLC1(%rip), %zmm2
vaddps  {rd-sae}, %zmm2, %zmm1, %zmm0{%k1}{z} <-- Why use mask?
vaddps  {ru-sae}, %zmm2, %zmm1, %zmm1{%k1}{z}
vsubss  %xmm0, %xmm0, %xmm0   <-- xmm0 ??
vcvtss2sd   %xmm0, %xmm0, %xmm0
leaqLC2(%rip), %rdi
movl$1, %eax
call_printf
movl$0, %eax
addq$40, %rsp
popq%r10
popq%rbp
leaq-8(%r10), %rsp
ret

// == Program to reproduce 

#include 
#include 
#include 

int main(int argc, char *argv[]) {
__m512 a = _mm512_set1_ps((float) M_PI);
__m512 b = _mm512_set1_ps((float) 1.f);

__m512 result1 = _mm512_add_round_ps(a, b, (_MM_FROUND_TO_NEG_INF |
_MM_FROUND_NO_EXC));
__m512 result2 = _mm512_add_round_ps(a, b, (_MM_FROUND_TO_POS_INF |
_MM_FROUND_NO_EXC));

printf("%g\n", result1[0] - result2[0]);

return 0;
}

[Bug target/73350] AVX512: GCC optimizes away rounding flags

2016-08-11 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=73350

--- Comment #1 from Wenzel Jakob  ---
Sorry, there was a stupid typo in my message below. The middle part should have
read

At optimization level, -O1, this currently stops working (tested with GCC
trunk):

$ g++ test.c -o test -mavx512f -O1 -fomit-frame-pointer -fomit-frame-pointer &&
./test
0.0

[Bug target/76342] New: AVX512: _mm512_undefined_epi32() intrinsic missing (incorrectly named _mm512_undefined_si512)

2016-08-13 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=76342

Bug ID: 76342
   Summary: AVX512: _mm512_undefined_epi32() intrinsic missing
(incorrectly named _mm512_undefined_si512)
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wen...@mitsuba-renderer.org
  Target Milestone: ---

Consider the following snippet:

// --

#include 
__m512 test() { return _mm512_undefined_epi32(); }

// --

When compiled with GCC trunk, this yields the following error message:

test.cpp: In function '__m512 test()':
test.cpp:3:24: error: '_mm512_undefined_epi32' was not declared in this scope
 __m512 test() { return _mm512_undefined_epi32(); }
^~
test.cpp:3:24: note: suggested alternative: '_mm512_undefined_si512'
 __m512 test() { return _mm512_undefined_epi32(); }
^~
_mm512_undefined_si512

However, there is no _mm512_undefined_si512 intrinsic. It is called
_mm512_undefined_epi32. See here for details:
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm512_undefined_epi32&expand=5509

[Bug target/76342] AVX512: _mm512_undefined_epi32() intrinsic missing (incorrectly named _mm512_undefined_si512)

2016-08-14 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=76342

Wenzel Jakob  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #3 from Wenzel Jakob  ---
Great, thank you!

[Bug target/76731] New: [AVX512] _mm512_i32gather_epi32 and other scatter/gather routines have incorrect signature

2016-08-14 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=76731

Bug ID: 76731
   Summary: [AVX512] _mm512_i32gather_epi32 and other
scatter/gather routines have incorrect signature
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wen...@mitsuba-renderer.org
  Target Milestone: ---

All of the scatter/gather intrinsics in avx512intrin.h use int/float/double
pointers, which is incorrect.

For intsance:

extern __inline __m512i
__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
_mm512_i32gather_epi32 (__m512i __index, int const *__addr, int __scale)

These should use void*/const void* pointers according to Intel (see e.g.
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm512_i32gather_epi32&expand=2778,2777)

This is a departure from prior mask/gather intrinsics, where type information
turned out to be a bad idea for various reasons (e.g. aliasing analysis)

[Bug target/76731] [AVX512] _mm512_i32gather_epi32 and other scatter/gather routines have incorrect signature

2016-08-22 Thread wen...@mitsuba-renderer.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=76731

--- Comment #2 from Wenzel Jakob  ---
+1 this looks great!