[Bug target/88798] New: AVX512BW code does not use bit-operations that work on mask registers

2019-01-10 Thread wojciech_mula at poczta dot onet.pl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88798

Bug ID: 88798
   Summary: AVX512BW code does not use bit-operations that work on
mask registers
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wojciech_mula at poczta dot onet.pl
  Target Milestone: ---

Hi!

AVX512BW-related issue: the C compiler generates superfluous moves from 64-bit
mask registers to 64-bit GPRs and then performs basic bit-ops, while the
AVX512BW supports bit-ops for mask registers (instructions: korq, kandq,
kxorq).

I guess the main reason is C does not define a bit-or for type __mask64
and there's always an implicit conversion to uint64_t.

Below is a sample program compiled for Cannon Lake --- the CPU does have
(at least) AVX512BW, AVX512VBMI and AVX512VL.

---perf.c---
#include 
#include 

uint64_t any_whitespace(__m512i string) {
return _mm512_cmpeq_epu8_mask(string, _mm512_set1_epi8(' '))
 | _mm512_cmpeq_epu8_mask(string, _mm512_set1_epi8('\n'))
 | _mm512_cmpeq_epu8_mask(string, _mm512_set1_epi8('\r'));
}
---eof--

$ gcc --version
gcc (Debian 8.2.0-13) 8.2.0

$ gcc perf.c -O3 -march=cannonlake -S
$ cat perf.s # redacted
any_whitespace:
vpcmpub $0, .LC0(%rip), %zmm0, %k1
vpcmpub $0, .LC1(%rip), %zmm0, %k2
vpcmpub $0, .LC2(%rip), %zmm0, %k3
kmovq   %k1, %rcx
kmovq   %k2, %rdx
orq %rcx, %rdx
kmovq   %k3, %rax
orq %rdx, %rax
vzeroupper
ret

I'd rather expect to get something like:

any_whitespace:
vpcmpub $0, .LC0(%rip), %zmm0, %k1
vpcmpub $0, .LC1(%rip), %zmm0, %k2
vpcmpub $0, .LC2(%rip), %zmm0, %k3
korq%k1, %k2, %k1
korq%k1, %k3, %k3
kmovq   %k3, %rax
vzeroupper
ret

best regards
Wojciech

[Bug target/88798] AVX512BW code does not use bit-operations that work on mask registers

2019-01-11 Thread wojciech_mula at poczta dot onet.pl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88798

--- Comment #3 from Wojciech Mula  ---
Sorry, I didn't find that bug; I think you may close this one.

BTW, I had checked the code on godbolt.org before submitting. I tested also
with their "GCC (trunk)", but the generated code is the same as for 8.2. The
trunk's version is "g++ (GCC-Explorer-Build) 9.0.0 20190109 (experimental)" --
seems it's  a fresh version and should already include the fixes Andrew
mentioned.

[Bug tree-optimization/88868] New: [SSE] pshufb can be omitted for a specific pattern

2019-01-15 Thread wojciech_mula at poczta dot onet.pl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88868

Bug ID: 88868
   Summary: [SSE] pshufb can be omitted for a specific pattern
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wojciech_mula at poczta dot onet.pl
  Target Milestone: ---

SSSE3 instruction PSHUFB (and the AVX2 counterpart VPSHUFB) acts as a
no-operation
when its argument is a sequence 0..15. Such invocation does not alter shuffled
register, thus PSHUFB can be safely omitted

BTW, clang does this optimization, but ICC doesn't.

---pshufb.c---
#include 

__m128i shuffle(__m128i x) {
const __m128i noop = _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15);
return _mm_shuffle_epi8(x, noop);
}
---eof---

$ gcc --version
gcc (Debian 8.2.0-13) 8.2.0

$ gcc -O3 -march=skylake -S pshufb.c 
$ cat pshufb.s
shuffle:
vpshufb .LC0(%rip), %xmm0, %xmm0
ret
.LC0:
.byte   0
.byte   1
.byte   2
.byte   3
.byte   4
.byte   5
.byte   6
.byte   7
.byte   8
.byte   9
.byte   10
.byte   11
.byte   12
.byte   13
.byte   14
.byte   15

An expected output:

shuffle:
ret

[Bug target/88916] New: [x86] suboptimal code generated for integer comparisons joined with boolean operators

2019-01-18 Thread wojciech_mula at poczta dot onet.pl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88916

Bug ID: 88916
   Summary: [x86] suboptimal code generated for integer
comparisons joined with boolean operators
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wojciech_mula at poczta dot onet.pl
  Target Milestone: ---

Let's consider these two simple, yet pretty useful functions:

--test.c---
int both_nonnegative(long a, long b) {
return (a >= 0) && (b >= 0);
}

int both_nonzero(unsigned long a, unsigned long b) {
return (a > 0) && (b > 0);
}
---eof--

$ gcc --version
gcc (Debian 8.2.0-13) 8.2.0

$ gcc -O3 test.c -march=skylake -S
$ cat test.s
both_nonnegative:
notq%rdi
movq%rdi, %rax
notq%rsi
shrq$63, %rax
shrq$63, %rsi
andl%esi, %eax
ret

both_nonzero:
testq   %rdi, %rdi
setne   %al
xorl%edx, %edx
testq   %rsi, %rsi
setne   %dl
andl%edx, %eax
ret

I checked different target machines (haswell, broadwell and cannonlake),
however the result remained the same. Also GCC trunk on godbolt.org 
produces the same assembly code.

The first function, `both_nonnegative`, can be rewritten as:

(((unsigned long)(a) | (unsigned long)(b)) >> 63) ^ 1

yielding something like this:

both_nonnegative:
orq %rsi, %rdi
movq%rdi, %rax
shrq$63, %rax
xorl$1, %eax
ret

It's also possible to use this expression:
(long)(unsigned long)a | (unsigned long)b) < 0,
but the assembly output is almost the same.

The condition from `both_nonzero` can be expressed as:

((unsigned long)a | (unsigned long)b) != 0

GCC compiles it to:

both_nonzero:
xorl%eax, %eax
orq %rsi, %rdi
setne   %al
retq

[Bug tree-optimization/88916] [x86] suboptimal code generated for integer comparisons joined with boolean operators

2019-01-21 Thread wojciech_mula at poczta dot onet.pl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88916

--- Comment #2 from Wojciech Mula  ---
(In reply to Richard Biener from comment #1)
> Confirmed.

The first case is OK, but the second (for `both_nonzero`) is obviously wrong.
Sorry for that.

[Bug tree-optimization/88916] [x86] suboptimal code generated for integer comparisons joined with boolean operators

2019-01-22 Thread wojciech_mula at poczta dot onet.pl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88916

--- Comment #3 from Wojciech Mula  ---
A similar case:

---sign.c---
int different_sign(long a, long b) {
return (a >= 0 && b < 0) || (a < 0 && b >= 0);
}
---eof--

This is compiled into:

different_sign:
notq%rdi
movq%rdi, %rax
notq%rsi
shrq$63, %rax
shrq$63, %rsi
xorl%esi, %eax
movzbl  %al, %eax
ret

When expressed as difference of the sign bits

((unsigned long)a ^ (unsigned long)b) >> 63

the code is way shorter:

different_sign:
xorq%rsi, %rdi
movq%rdi, %rax
shrq$63, %rax

BTW, I looked at ARM assembly, and GCC also emits
two shifts, so the observed behaviour is not limited
by a target.

[Bug tree-optimization/89018] New: common subexpression present in both branches of condition is not factored out

2019-01-23 Thread wojciech_mula at poczta dot onet.pl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89018

Bug ID: 89018
   Summary: common subexpression present in both branches of
condition is not factored out
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wojciech_mula at poczta dot onet.pl
  Target Milestone: ---

A common transformation used in a C condition expression is not detected and
code is duplicated. Below are a few examples:

---condition.c---
long transform(long);

long negative_max(long a, long b) {
return (a >= b) ? -a : -b;
}

long fun_max(long a, long b) {
return (a >= b) ? 47*a : 47*b;
}

long transform_max(long a, long b) {
return (a >= b) ? transform(a) : transform(b);
}
---eof---

In both branches a scalar value is a part of the same expression. So, it
would be more profitable when, for instance "(a >= b) ? -a : -b" would be
compiled as "-((a >= b) ? a : b))". Of course, a programmer might factor it
out, but in case of macros or auto-generated code such silly repetition
might occur.

Below is assembly code generated for x86 by a pretty fresh GCC 9.
BTW the jump instruction from `fun_max` and `transform_max` can
be replaced with a condition move.

$ gcc --version
gcc (GCC) 9.0.0 20190117 (experimental)

$ gcc -O3 -march=skylake -c -S condition.c && cat condition.s

negative_max:
movq%rdi, %rdx
movq%rsi, %rax
negq%rdx
negq%rax
cmpq%rsi, %rdi
cmovge  %rdx, %rax
ret

fun_max:
cmpq%rsi, %rdi
jl  .L6
imulq   $47, %rdi, %rax
ret
.L6:
imulq   $47, %rsi, %rax
ret

transform_max:
cmpq%rsi, %rdi
jge .L11
movq%rsi, %rdi
.L11:
jmp transform

[Bug target/89063] New: [x86] lack of support for BEXTR from BMI extension

2019-01-25 Thread wojciech_mula at poczta dot onet.pl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89063

Bug ID: 89063
   Summary: [x86] lack of support for BEXTR from BMI extension
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wojciech_mula at poczta dot onet.pl
  Target Milestone: ---

Instruction BEXTR extracts an arbitrary unsigned bit field from 32- or 64-bit
value. As I see in `config/i386.md`, there's support for the immediate
variant available in AMD's TBM (TARGET_TBM).

Intel's variant gets parameters from a register. Although this variant
won't be profitable in all cases -- as we need an extra move to setup
the bit-field parameters in a register -- I bet bit-field-intensive
code might benefit from BEXTR.

---bextr.c---
#include 
#include 

uint64_t test(uint64_t x) {
const uint64_t a0 = (x & 0x3f);
const uint64_t a1 = (x >> 11) & 0x3f;
const uint64_t a2 = (x >> 22) & 0x3f;
return a0 + a1 + a2;
}

uint64_t test_intrinsics(uint64_t x) {
const uint64_t a0 = (x & 0x3f);
const uint64_t a1 = _bextr_u64(x, 11, 6);
const uint64_t a2 = _bextr_u64(x, 22, 6);
return a0 + a1 + a2;
}
---eof---

$ gcc --version
gcc (GCC) 9.0.0 20190117 (experimental)

$ gcc -O3 -mbmi -march=skylake bextr.c -c && objdump -d bextr.o

 :
   0:   48 89 famov%rdi,%rdx
   3:   48 c1 ea 0b shr$0xb,%rdx
   7:   48 89 f8mov%rdi,%rax
   a:   48 89 d1mov%rdx,%rcx
   d:   48 c1 e8 16 shr$0x16,%rax
  11:   83 e0 3fand$0x3f,%eax
  14:   83 e1 3fand$0x3f,%ecx
  17:   48 8d 14 01 lea(%rcx,%rax,1),%rdx
  1b:   83 e7 3fand$0x3f,%edi
  1e:   48 8d 04 3a lea(%rdx,%rdi,1),%rax
  22:   c3  retq   

0030 :
  30:   b8 0b 06 00 00  mov$0x60b,%eax
  35:   c4 e2 f8 f7 d7  bextr  %rax,%rdi,%rdx
  3a:   b8 16 06 00 00  mov$0x616,%eax
  3f:   c4 e2 f8 f7 c7  bextr  %rax,%rdi,%rax
  44:   83 e7 3fand$0x3f,%edi
  47:   48 01 d0add%rdx,%rax
  4a:   48 01 f8add%rdi,%rax
  4d:   c3  retq

[Bug target/89081] New: [x86] suboptimal code generated for condition expression returning negation

2019-01-27 Thread wojciech_mula at poczta dot onet.pl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89081

Bug ID: 89081
   Summary: [x86] suboptimal code generated for condition
expression returning negation
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wojciech_mula at poczta dot onet.pl
  Target Milestone: ---

Let's consider this trivial function:

---clamp.c---
#include 

uint64_t clamp1(int64_t x) {
return (x < 0) ? -x : 0;
}
---eof---

$ gcc --version
gcc (GCC) 9.0.0 20190117 (experimental)

$ gcc -O3 -march=skylake clamp.c -c -S && cat clamp.s
clamp1:
movq%rdi, %rax
negq%rax
movl$0, %edx
testq   %rdi, %rdi
cmovns  %rdx, %rax
ret

This procedure can be way shorter, like this

clamp1:
xorq   %rax, %rax # res = 0
negq   %rdi   # -x, sets SF
cmovns %rdi, %rax
ret

One thing I observed recently when looking at assembly is that GCC
never modifies input registers %rdi or %rsi, always makes their
copies -- -thus the proposed shorter version is not possible.
However, clang modifies these registers, so seems the ABI allows this.

[Bug target/85832] New: [AVX512] possible shorter code when comparing with vector of zeros

2018-05-18 Thread wojciech_mula at poczta dot onet.pl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85832

Bug ID: 85832
   Summary: [AVX512] possible shorter code when comparing with
vector of zeros
   Product: gcc
   Version: 7.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wojciech_mula at poczta dot onet.pl
  Target Milestone: ---

Consider this simple function, which yields mask fors non-zero elements:

---cat cmp.c---
#include 

int fun(__m512i x) {
return _mm512_cmpeq_epi32_mask(x, _mm512_setzero_si512());
}
---eof

$ gcc --version
gcc (Debian 7.3.0-16) 7.3.0

$ gcc -O2 -S -mavx512f cmp.c && cat cmp.s
fun:
vpxord  %zmm1, %zmm1, %zmm1 # <<< HERE
vpcmpeqd%zmm1, %zmm0, %k1   # <<<
kmovw   %k1, %eax
vzeroupper
ret

Also 8.1.0 generates the same code (as checked on godbolt.org).

The pair of instructions VPXORD/VPCMPEQD can be replaced with single
VPTESTMD %zmm0, %zmm0.  VPTESTMD performs k1 := zmm0 AND zmm0, so to
compare zmm0 with zeros it's sufficient.

[Bug target/85833] New: [AVX512] use mask registers instructions instead of scalar code

2018-05-18 Thread wojciech_mula at poczta dot onet.pl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85833

Bug ID: 85833
   Summary: [AVX512] use mask registers instructions instead of
scalar code
   Product: gcc
   Version: 7.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wojciech_mula at poczta dot onet.pl
  Target Milestone: ---

There is a simple function, which checks if there is any non-zero element
in a vector:

---ktest.c---
#include 

int anynonzero_epi32(__m512i x) {
const __m512i   zero = _mm512_setzero_si512();
const __mmask16 mask = _mm512_cmpneq_epi32_mask(x, zero);
return mask != 0;
}
---eof---

$ gcc --version
gcc (Debian 7.3.0-16) 7.3.0

$ gcc -O2 -S -mavx512f ktest.c && cat ktest.s

anynonzero_epi32:
vpxord  %zmm1, %zmm1, %zmm1
vpcmpd  $4, %zmm1, %zmm0, %k1
kmovw   %k1, %eax   # <<< HERE
testw   %ax, %ax#
setne   %al
movzbl  %al, %eax
vzeroupper
ret

The problem is that GCC copies content of the mask register k1 into
GPR (using KMOV instruction), and then perform test. AVX512F has got
instruction KTEST kx, ky which sets ZF and CF:

ZF = (kx AND ky) == 0
CF = (kx AND NOT ky) == 0

In this case we might use KTEST k1, k1 to set ZF when k1 == 0.
The procedure might be then compiled as:

anynonzero_epi32:
vpxord  %zmm1, %zmm1, %zmm1
vpcmpd  $4, %zmm1, %zmm0, %k1
xor %eax, %eax  #
ktestw  %k1, %k1#
setne   %al #
vzeroupper
ret

[Bug target/85833] [AVX512] use mask registers instructions instead of scalar code

2018-05-22 Thread wojciech_mula at poczta dot onet.pl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85833

--- Comment #3 from Wojciech Mula  ---
Uroš, thank you very much. I didn't pay attention on the AVX512 variant, as I
thought this is so basic instruction that it should be available from AVX512F.

[Bug target/85073] New: [x86] extra check after BLSR

2018-03-25 Thread wojciech_mula at poczta dot onet.pl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85073

Bug ID: 85073
   Summary: [x86] extra check after BLSR
   Product: gcc
   Version: 7.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wojciech_mula at poczta dot onet.pl
  Target Milestone: ---

GCC is able to use the BLSR instruction in place of expression (x - 1) & x
[which is REALLY nice, thank you :)], but does not utilize CPU flags set by the
instruction. Below is a simple example.

--bmi1.c--
int popcount(unsigned x) {
int c = 0;
while (x) {
c += 1;
x = (x - 1) & x;
}

return c;
}
--eof--

$ gcc --version
gcc (Debian 7.3.0-11) 7.3.0

$ gcc -march=skylake -O3 -s bmi1.c && cat bmi1.s

popcount:
.LFB0:
xorl%eax, %eax
testl   %edi, %edi
je  .L4
.L3:
addl$1, %eax
blsr%edi, %edi <<< HERE
testl   %edi, %edi <<< and HERE
jne .L3
ret
.L4:
ret

BLSR sets the ZF flag if the result is zero. The subsequent TEST instruction is
not needed.

[Bug target/88798] AVX512BW code does not use bit-operations that work on mask registers

2022-01-31 Thread wojciech_mula at poczta dot onet.pl via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88798

--- Comment #6 from Wojciech Mula  ---
Hongtao, thank you for your patch and for pinging back! I checked the code from
this issue against version 11.2.0 (Debian 11.2.0-14), but still, there are
KMOVQs before performing any bit ops. Here is the output from `gcc -O3
-march=icelake-server -S`

vpcmpub $0, .LC0(%rip), %zmm0, %k0
vpcmpub $0, .LC1(%rip), %zmm0, %k1
vpcmpub $0, .LC2(%rip), %zmm0, %k2
kmovq   %k0, %rcx
kmovq   %k1, %rax
orq %rcx, %rax
kmovq   %k2, %rdx
orq %rdx, %rax
ret

[Bug target/88798] AVX512BW code does not use bit-operations that work on mask registers

2022-02-07 Thread wojciech_mula at poczta dot onet.pl via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88798

--- Comment #8 from Wojciech Mula  ---
Thank you for the answer. Thus my question is: is it possible to delay
conversion from kmasks into ints? I'm not a language lawyer, but I guess a `x
binop y` has to be treated as `(int)x binop (int)y`. If it's true, we will have
to prove that `(int)(x avx512-binop y)` is equivalent to the latter expr.

[Bug target/114172] [13 only] ICE with riscv rvv VSETVL intrinsic

2024-03-28 Thread wojciech_mula at poczta dot onet.pl via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114172

Wojciech Mula  changed:

   What|Removed |Added

 CC||wojciech_mula at poczta dot 
onet.p
   ||l

--- Comment #2 from Wojciech Mula  ---
Checked 13.2 from Debian:

$ riscv64-linux-gnu-gcc --version
riscv64-linux-gnu-gcc (Debian 13.2.0-12) 13.2.0

For the Bruce's testcase the following invocation triggers segfault (-O2, -O1,
-O2 - no error):

$ riscv64-linux-gnu-gcc -march=rv64gcv -c 1.c -O3

Below is just the bottom of stack obtained by gdb. There's an infinite
recursion somewhere around `riscv_vector::avl_info::operator==`.

#629078 0x00fa3372 in
riscv_vector::avl_info::operator==(riscv_vector::avl_info const&) const ()
#629079 0x00fa37f9 in ?? ()
#629080 0x00fa2543 in ?? ()
#629081 0x00fa3372 in
riscv_vector::avl_info::operator==(riscv_vector::avl_info const&) const ()
#629082 0x00fa37f9 in ?? ()
#629083 0x00fa2543 in ?? ()
#629084 0x00fa3372 in
riscv_vector::avl_info::operator==(riscv_vector::avl_info const&) const ()
#629085 0x00fa37f9 in ?? ()
#629086 0x00fa2543 in ?? ()
#629087 0x00fa3372 in
riscv_vector::avl_info::operator==(riscv_vector::avl_info const&) const ()
#629088 0x00fa37f9 in ?? ()
#629089 0x00fa2543 in ?? ()
#629090 0x00fa3372 in
riscv_vector::avl_info::operator==(riscv_vector::avl_info const&) const ()
#629091 0x00fa394b in ?? ()
#629092 0x00f9f588 in
riscv_vector::vector_insn_info::compatible_p(riscv_vector::vector_insn_info
const&) const ()
#629093 0x00fa0eb9 in
pass_vsetvl::compute_local_backward_infos(rtl_ssa::bb_info const*) ()
#629094 0x00fa8c6b in pass_vsetvl::lazy_vsetvl() ()
#629095 0x00fa8e1f in pass_vsetvl::execute(function*) ()
#629096 0x00b5e21b in execute_one_pass(opt_pass*) ()
#629097 0x00b5eac0 in ?? ()
#629098 0x00b5ead2 in ?? ()
#629099 0x00b5ead2 in ?? ()
#629100 0x00b5eaf9 in execute_pass_list(function*, opt_pass*) ()
#629101 0x00822588 in cgraph_node::expand() ()
#629102 0x00823afb in ?? ()
#629103 0x00825fd8 in symbol_table::finalize_compilation_unit() ()
#629104 0x00c29bad in ?? ()
#629105 0x006a4c97 in toplev::main(int, char**) ()
#629106 0x006a6a8b in main ()

[Bug c++/114747] New: [RISC-V RVV] Wrong SEW set for mixed-size intrinsics

2024-04-16 Thread wojciech_mula at poczta dot onet.pl via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114747

Bug ID: 114747
   Summary: [RISC-V RVV] Wrong SEW set for mixed-size intrinsics
   Product: gcc
   Version: 13.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wojciech_mula at poczta dot onet.pl
  Target Milestone: ---

This is a distilled procedure from simdutf project:

---
#include 
#include 
#include 

size_t convert_latin1_to_utf16le(const char *src, size_t len, char16_t *dst) {
  char16_t *beg = dst;
  for (size_t vl; len > 0; len -= vl, src += vl, dst += vl) {
vl = __riscv_vsetvl_e8m4(len);
vuint8m4_t v = __riscv_vle8_v_u8m4((uint8_t*)src, vl);
__riscv_vse16_v_u16m8((uint16_t*)dst, __riscv_vzext_vf2_u16m8(v, vl), vl);
  }
  return dst - beg;
}
---

When compiled with gcc 13.2.0 with flags "-march=rv64gcv -O2" it sets a wrong
SEW:

---
convert_latin1_to_utf16le(char const*, unsigned long, char16_t*):
beq a1,zero,.L4
mv  a4,a2
.L3:
vsetvli a5,a1,e8,m4,ta,ma  # set SEW=8
vle8.v  v8,0(a0)
sllia3,a5,1
vzext.vf2   v24,v8 # illegal instruction, as SEW/2 < 8
sub a1,a1,a5
vse16.v v24,0(a4)
add a0,a0,a5
add a4,a4,a3
bne a1,zero,.L3
sub a0,a4,a2
sraia0,a0,1
ret
.L4:
li  a0,0
ret
---

The trunk available on godbold.org (riscv64-unknown-linux-gnu-g++ 14.0.1
20240415) emits vsetvli with e16 argument, which seems to be fine.

[Bug target/114809] New: [RISC-V RVV] Counting elements might be simpler

2024-04-22 Thread wojciech_mula at poczta dot onet.pl via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114809

Bug ID: 114809
   Summary: [RISC-V RVV] Counting elements might be simpler
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wojciech_mula at poczta dot onet.pl
  Target Milestone: ---

Consider this simple procedure

---
#include 
#include 

size_t count_chars(const char *src, size_t len, char c) {
size_t count = 0;
for (size_t i=0; i < len; i++) {
count += src[i] == c;
}

return count;
}
---

Assembly for it (GCC 14.0, -march=rv64gcv -O3):

---
count_chars(char const*, unsigned long, char):
beq a1,zero,.L4
vsetvli a4,zero,e8,mf8,ta,ma
vmv.v.x v2,a2
vsetvli zero,zero,e64,m1,ta,ma
vmv.v.i v1,0
.L3:
vsetvli a5,a1,e8,mf8,ta,ma
vle8.v  v0,0(a0)
sub a1,a1,a5
add a0,a0,a5
vmseq.vvv0,v0,v2
vsetvli zero,zero,e64,m1,tu,mu
vadd.vi v1,v1,1,v0.t
bne a1,zero,.L3
vsetvli a5,zero,e64,m1,ta,ma
li  a4,0
vmv.s.x v2,a4
vredsum.vs  v1,v1,v2
vmv.x.s a0,v1
ret
.L4:
li  a0,0
ret
---

The counting procedure might use `vcpop.m` instead of updating vector of
counters (`v1`) and summing them in the end. This would move all mode switches
outside the loop.

And there's a missing peephole optimization:

li  a4,0
vmv.s.x v2,a4

It can be:

vmv.s.x v2,zero

[Bug target/117421] New: [RISCV] Use byte comparison instead of word comparison

2024-11-02 Thread wojciech_mula at poczta dot onet.pl via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117421

Bug ID: 117421
   Summary: [RISCV] Use byte comparison instead of word comparison
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wojciech_mula at poczta dot onet.pl
  Target Milestone: ---

Consider this simple function:

---
#include 

bool ext_is_gzip(std::string_view ext) {
return ext == "gzip";
}
---

For the x86 target, GCC nicely inlines compile-time constant, and produces the
code like that (it's from GCC 15, with `-O3 -march=icelake-server`):

---
ext_is_gzip(std::basic_string_view >):
xorl%eax, %eax
cmpq$4, %rdi
je  .L5
ret
.L5:
cmpl$1885960807, (%rsi)
sete%al
ret
---

However, for the RISC-V target, GCC emits plain byte-by-byte comparison
(riscv64-unknown-linux-gnu-g++ (crosstool-NG UNKNOWN) 15.0.0 20241031
(experimental), with `-O3 -march=rv64gcv`):

---
ext_is_gzip(std::basic_string_view >):
addisp,sp,-16
sd  a0,0(sp)
sd  a1,8(sp)
li  a5,4
beq a0,a5,.L9
li  a0,0
addisp,sp,16
jr  ra
.L9:
lbu a4,0(a1)
li  a5,103
beq a4,a5,.L10
.L3:
li  a0,1
.L4:
xoria0,a0,1
addisp,sp,16
jr  ra
.L10:
lbu a4,1(a1)
li  a5,122
bne a4,a5,.L3
lbu a4,2(a1)
li  a5,105
bne a4,a5,.L3
lbu a4,3(a1)
li  a5,112
li  a0,0
beq a4,a5,.L4
li  a0,1
j   .L4
---

My wild guess is that we have by default a high cost of placing huge
compile-time values in RISC-V. However, when I checked what is emitted for
"gzip" & "pizg" given as u32, then we have:

---
   0:   677a7537lui a0,0x677a7
   4:   9705051baddiw   a0,a0,-1680 # 677a6970 

   8:   70698537lui a0,0x70698
   c:   a675051baddiw   a0,a0,-1433 # 70697a67
---

A godbolt link for convenience: https://godbolt.org/z/e16bP369n.

[Bug target/109279] RISC-V: complex constants synthesized should be improved

2024-11-12 Thread wojciech_mula at poczta dot onet.pl via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109279

--- Comment #20 from Wojciech Mula  ---
This constants is worth checking (appears in division by 10): 

```
unsigned long ccd() {
return 0xcccd;
}
```

riscv64-unknown-linux-gnu-g++ (crosstool-NG UNKNOWN) 15.0.0 2024
(experimental):

```
ccd():
li  a0,858992640
li  a5,858992640
addia0,a0,819
addia5,a5,818
sllia0,a0,32
add a0,a0,a5
xoria0,a0,-1
ret
```

clang 20:

```
ccd():
lui a0, 838861
addiw   a0, a0, -819
sllia1, a0, 32
add a0, a0, a1
ret
```

[Bug target/117421] [RISCV] Use byte comparison instead of word comparison

2024-11-12 Thread wojciech_mula at poczta dot onet.pl via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117421

--- Comment #4 from Wojciech Mula  ---
Although, there's no word-wise set for equality, thus I think this sequence
would be better.

```
lbu a0, 1(a1)
lbu a2, 0(a1)
lbu a3, 2(a1)
lb  a1, 3(a1)
xoria0, a0, 'z'
xoria2, a2, 'g'
xoria3, a3, 'i'
xoria1, a1, 'p'
or  a0, a0, a2
or  a1, a1, a3
or  a0, a0, a1
seqza0, a0
```

[Bug target/117421] [RISCV] Use byte comparison instead of word comparison

2024-11-12 Thread wojciech_mula at poczta dot onet.pl via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117421

--- Comment #3 from Wojciech Mula  ---
It's worth noting, that Clang first synthesizes a 32-bit word from individual
bytes, and then use a single comparison.

```
ext_is_gzip(std::basic_string_view>):
li  a2, 4
bne a0, a2, .LBB0_2
lbu a0, 1(a1)
lbu a2, 0(a1)
lbu a3, 2(a1)
lb  a1, 3(a1)
sllia0, a0, 8
or  a0, a0, a2
sllia3, a3, 16
sllia1, a1, 24
or  a1, a1, a3
or  a0, a0, a1
lui a1, 460440
addiw   a1, a1, -1433
xor a0, a0, a1
seqza0, a0
ret
.LBB0_2:
li  a0, 0
ret
```

[Bug target/117421] [RISCV] Use byte comparison instead of word comparison

2024-11-08 Thread wojciech_mula at poczta dot onet.pl via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117421

--- Comment #2 from Wojciech Mula  ---
First of all, thanks for looking at this! 

> I should note that -mno-strict-align still does not do it but that is because 
> it might be slow still to do unaligned access.

OK, maybe `-mno-strict-align` should issue a warning in such cases?

[Bug target/119911] New: [RVV] Suboptimal code generation for multiple extracting 0-th elements of vector

2025-04-23 Thread wojciech_mula at poczta dot onet.pl via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119911

Bug ID: 119911
   Summary: [RVV] Suboptimal code generation for multiple
extracting 0-th elements of vector
   Product: gcc
   Version: 16.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wojciech_mula at poczta dot onet.pl
  Target Milestone: ---

I observed the issue on GCC 14.2, but it's still visible on the godbolt trunk,
which is 16.0.0 20250423 (experimental).

Summary: when we have multiple `vmv.x.s` (move the 0th vector element into a
scalar register), GCC always emit shift-left then shift-right to apply masking
of result lower bits (like 8 or 16). However, when there are more `vmv.x.s`
instances, then it would be profitable to create the mask in a register (which
is a compile-time const) and use bit-and for masking.

Clang performs this optimization.

Consider this simple function:

---test.cpp---
#include 
#include 

uint64_t sum_of_first_three(vuint16m1_t x) {
const uint64_t mask = 0x;
const auto vl = __riscv_vsetvlmax_e16m1();
return uint64_t(__riscv_vmv_x_s_u16m1_u16(x))
 + uint64_t(__riscv_vmv_x_s_u16m1_u16(__riscv_vslidedown(x, 1, vl)))
 + uint64_t(__riscv_vmv_x_s_u16m1_u16(__riscv_vslidedown(x, 2, vl)));
}
---eof---

When compiled with `-O3 -march=rv64gcv`, the assembly is:

---
sum_of_first_three(__rvv_uint16m1_t):
vsetvli a5,zero,e16,m1,ta,ma
vslidedown.vi   v10,v8,1
vslidedown.vi   v9,v8,2
vmv.x.s a5,v8
vmv.x.s a4,v10
vmv.x.s a0,v9
sllia4,a4,48
sllia5,a5,48
srlia4,a4,48
srlia5,a5,48
sllia0,a0,48
add a5,a5,a4
srlia0,a0,48
add a0,a5,a0
ret
---

godbolt link: https://godbolt.org/z/hPrM8vz4v

[Bug driver/109605] -fno-tree-vectorize does not disable vectorizer

2025-04-28 Thread wojciech_mula at poczta dot onet.pl via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109605

Wojciech Mula  changed:

   What|Removed |Added

 CC||wojciech_mula at poczta dot 
onet.p
   ||l

--- Comment #3 from Wojciech Mula  ---
This is somehow related. I needed to generate the particular procedure without
any vector instruction (the surrounding code is free to RVV instructions).

But when a code uses the builtin function `memset`, GCC still emits some vector
instruction. The cure is setting `-fno-builtin`, because pragma does not accept
that option.

The attached sample code comes from simdutf project (src/scalar/utf.f), godbolt
link for convenience https://godbolt.org/z/Ya91he99v.

---no-vector.cpp--
#include 
#include 
#include 

#pragma GCC optimize ("no-tree-vectorize")
#pragma GCC optimize ("no-tree-loop-vectorize")
#pragma GCC optimize ("no-tree-slp-vectorize")
#pragma GCC optimize ("no-builtin") // not accepted by the compiler
bool validate(const char *buf, size_t len) noexcept {
const uint8_t *data = reinterpret_cast(buf);
  uint64_t pos = 0;
  uint32_t code_point = 0;
  while (pos < len) {
// check of the next 16 bytes are ascii.
uint64_t next_pos = pos + 16;
if (next_pos <=
len) { // if it is safe to read 16 more bytes, check that they are
ascii
  uint64_t v1;
  std::memcpy(&v1, data + pos, sizeof(uint64_t));
  uint64_t v2;
  std::memcpy(&v2, data + pos + sizeof(uint64_t), sizeof(uint64_t));
  uint64_t v{v1 | v2};
  if ((v & 0x8080808080808080) == 0) {
pos = next_pos;
continue;
  }
}
unsigned char byte = data[pos];

while (byte < 0b1000) {
  if (++pos == len) {
return true;
  }
  byte = data[pos];
}

if ((byte & 0b1110) == 0b1100) {
  next_pos = pos + 2;
  if (next_pos > len) {
return false;
  }
  if ((data[pos + 1] & 0b1100) != 0b1000) {
return false;
  }
  // range check
  code_point = (byte & 0b0001) << 6 | (data[pos + 1] & 0b0011);
  if ((code_point < 0x80) || (0x7ff < code_point)) {
return false;
  }
} else if ((byte & 0b) == 0b1110) {
  next_pos = pos + 3;
  if (next_pos > len) {
return false;
  }
  if ((data[pos + 1] & 0b1100) != 0b1000) {
return false;
  }
  if ((data[pos + 2] & 0b1100) != 0b1000) {
return false;
  }
  // range check
  code_point = (byte & 0b) << 12 |
   (data[pos + 1] & 0b0011) << 6 |
   (data[pos + 2] & 0b0011);
  if ((code_point < 0x800) || (0x < code_point) ||
  (0xd7ff < code_point && code_point < 0xe000)) {
return false;
  }
} else if ((byte & 0b1000) == 0b) { // 0b
  next_pos = pos + 4;
  if (next_pos > len) {
return false;
  }
  if ((data[pos + 1] & 0b1100) != 0b1000) {
return false;
  }
  if ((data[pos + 2] & 0b1100) != 0b1000) {
return false;
  }
  if ((data[pos + 3] & 0b1100) != 0b1000) {
return false;
  }
  // range check
  code_point =
  (byte & 0b0111) << 18 | (data[pos + 1] & 0b0011) << 12 |
  (data[pos + 2] & 0b0011) << 6 | (data[pos + 3] & 0b0011);
  if (code_point <= 0x || 0x10 < code_point) {
return false;
  }
} else {
  // we may have a continuation
  return false;
}
pos = next_pos;
  }
  return true;
}
---eof---

The head of generated asm:

---
validate(char const*, unsigned long):
beq a1,zero,.L32
li  a4,2139062272
addia4,a4,-129
sllia2,a4,32
addisp,sp,-16
add a2,a2,a4
li  a5,0
xoria2,a2,-1
addia7,sp,8
vsetivlizero,8,e8,mf2,ta,ma ## here
.L2:
addia3,a5,16
add t1,a0,a5
bltua1,a3,.L36
vle8.v  v1,0(t1) #
addia4,a5,8
add a4,a0,a4
vse8.v  v1,0(sp) #
vle8.v  v1,0(a4) #
ld  a4,0(sp)
vse8.v  v1,0(a7) #
ld  a6,8(sp)
or  a4,a4,a6
and a4,a4,a2
bne a4,zero,.L36
mv  a5,a3
.L6:
---

[Bug target/119040] New: [PPC/Altivec] Missing bit-level optimization (select)

2025-02-27 Thread wojciech_mula at poczta dot onet.pl via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119040

Bug ID: 119040
   Summary: [PPC/Altivec] Missing bit-level optimization (select)
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wojciech_mula at poczta dot onet.pl
  Target Milestone: ---

This come from real-world usage. Suppose we have a vector of words, we want to
move around some bit-fields of that words. We isolate the bit-fields with `and`
then shift the bit-fields to the desired positions in the final word. But we
never end up with overlapping bit-fields.

Below is a sample code that merges two 6-bit fields into 12-bit one in 32-bit
elements:

---test.cpp---
#include 
#include 

using vec_u32_t = __vector uint32_t;

vec_u32_t merge_2x6_bits(const vec_u32_t a, const vec_u32_t b) {
vec_u32_t t0 = vec_and(a, vec_splats(uint32_t(0x003f)));
vec_u32_t t1 = vec_and(b, vec_splats(uint32_t(0x3f00)));

vec_u32_t t2 = vec_sr(t1, vec_splats(uint32_t(2)));

return vec_or(t2, t0);
}
---eof---

GCC 14.2.0 with flags `-O3 -maltivec` produces the following code (I omitted
the constants .LC0 & .LC1):

lis 10,.LC0@ha
lis 9,.LC1@ha
la 10,.LC0@l(10)
la 9,.LC1@l(9)
lvx 13,0,10
vspltisw 0,2
lvx 1,0,9
vand 3,3,13
vsrw 3,3,0
vand 2,2,1
vor 2,3,2

Since the bit-fields do not overlap, the last sequence of `vand` and `vor` can
be replace with `vsel`. Instead of `((a & mask1) >> 2) | (b & mask2)` we may
have `select(mask1 >> 2, a >> 2, b & mask2)` [providing the prototype of
`select` is select(condition, true, false)].

[Bug target/120141] New: [RVV] Noop are not removed

2025-05-06 Thread wojciech_mula at poczta dot onet.pl via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120141

Bug ID: 120141
   Summary: [RVV] Noop are not removed
   Product: gcc
   Version: 16.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wojciech_mula at poczta dot onet.pl
  Target Milestone: ---

I observed that RVV noops, like shifting by 0 or adding 0, are not removed from
the program.

I fully understand that a compiler cannot do it when `vsetvli` changes the mode
between operations. But in the sample program `v8` is written and then shifted
under the same vector mode.

Is there any reason that comes from the RVV spec which is not obvious?

Sample program:

---
#include 

vuint16m1_t naive_avg(vuint16m1_t x, vuint16m1_t y) {
const auto vl = __riscv_vsetvlmax_e16m1();
const auto a = __riscv_vadd(x, y, vl);
return __riscv_vsrl(a, 0, vl);
}
---

Compiled with `-O3 -march=rv64gcv` yield the following assembly:

---
naive_avg(__rvv_uint16m1_t, __rvv_uint16m1_t):
vsetvli a5,zero,e16,m1,ta,ma
vadd.vv v8,v8,v9
vsrl.vi v8,v8,0
ret
---