[Bug target/88798] New: AVX512BW code does not use bit-operations that work on mask registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88798 Bug ID: 88798 Summary: AVX512BW code does not use bit-operations that work on mask registers Product: gcc Version: 8.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: wojciech_mula at poczta dot onet.pl Target Milestone: --- Hi! AVX512BW-related issue: the C compiler generates superfluous moves from 64-bit mask registers to 64-bit GPRs and then performs basic bit-ops, while the AVX512BW supports bit-ops for mask registers (instructions: korq, kandq, kxorq). I guess the main reason is C does not define a bit-or for type __mask64 and there's always an implicit conversion to uint64_t. Below is a sample program compiled for Cannon Lake --- the CPU does have (at least) AVX512BW, AVX512VBMI and AVX512VL. ---perf.c--- #include #include uint64_t any_whitespace(__m512i string) { return _mm512_cmpeq_epu8_mask(string, _mm512_set1_epi8(' ')) | _mm512_cmpeq_epu8_mask(string, _mm512_set1_epi8('\n')) | _mm512_cmpeq_epu8_mask(string, _mm512_set1_epi8('\r')); } ---eof-- $ gcc --version gcc (Debian 8.2.0-13) 8.2.0 $ gcc perf.c -O3 -march=cannonlake -S $ cat perf.s # redacted any_whitespace: vpcmpub $0, .LC0(%rip), %zmm0, %k1 vpcmpub $0, .LC1(%rip), %zmm0, %k2 vpcmpub $0, .LC2(%rip), %zmm0, %k3 kmovq %k1, %rcx kmovq %k2, %rdx orq %rcx, %rdx kmovq %k3, %rax orq %rdx, %rax vzeroupper ret I'd rather expect to get something like: any_whitespace: vpcmpub $0, .LC0(%rip), %zmm0, %k1 vpcmpub $0, .LC1(%rip), %zmm0, %k2 vpcmpub $0, .LC2(%rip), %zmm0, %k3 korq%k1, %k2, %k1 korq%k1, %k3, %k3 kmovq %k3, %rax vzeroupper ret best regards Wojciech
[Bug target/88798] AVX512BW code does not use bit-operations that work on mask registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88798 --- Comment #3 from Wojciech Mula --- Sorry, I didn't find that bug; I think you may close this one. BTW, I had checked the code on godbolt.org before submitting. I tested also with their "GCC (trunk)", but the generated code is the same as for 8.2. The trunk's version is "g++ (GCC-Explorer-Build) 9.0.0 20190109 (experimental)" -- seems it's a fresh version and should already include the fixes Andrew mentioned.
[Bug tree-optimization/88868] New: [SSE] pshufb can be omitted for a specific pattern
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88868 Bug ID: 88868 Summary: [SSE] pshufb can be omitted for a specific pattern Product: gcc Version: 8.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: wojciech_mula at poczta dot onet.pl Target Milestone: --- SSSE3 instruction PSHUFB (and the AVX2 counterpart VPSHUFB) acts as a no-operation when its argument is a sequence 0..15. Such invocation does not alter shuffled register, thus PSHUFB can be safely omitted BTW, clang does this optimization, but ICC doesn't. ---pshufb.c--- #include __m128i shuffle(__m128i x) { const __m128i noop = _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15); return _mm_shuffle_epi8(x, noop); } ---eof--- $ gcc --version gcc (Debian 8.2.0-13) 8.2.0 $ gcc -O3 -march=skylake -S pshufb.c $ cat pshufb.s shuffle: vpshufb .LC0(%rip), %xmm0, %xmm0 ret .LC0: .byte 0 .byte 1 .byte 2 .byte 3 .byte 4 .byte 5 .byte 6 .byte 7 .byte 8 .byte 9 .byte 10 .byte 11 .byte 12 .byte 13 .byte 14 .byte 15 An expected output: shuffle: ret
[Bug target/88916] New: [x86] suboptimal code generated for integer comparisons joined with boolean operators
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88916 Bug ID: 88916 Summary: [x86] suboptimal code generated for integer comparisons joined with boolean operators Product: gcc Version: 8.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: wojciech_mula at poczta dot onet.pl Target Milestone: --- Let's consider these two simple, yet pretty useful functions: --test.c--- int both_nonnegative(long a, long b) { return (a >= 0) && (b >= 0); } int both_nonzero(unsigned long a, unsigned long b) { return (a > 0) && (b > 0); } ---eof-- $ gcc --version gcc (Debian 8.2.0-13) 8.2.0 $ gcc -O3 test.c -march=skylake -S $ cat test.s both_nonnegative: notq%rdi movq%rdi, %rax notq%rsi shrq$63, %rax shrq$63, %rsi andl%esi, %eax ret both_nonzero: testq %rdi, %rdi setne %al xorl%edx, %edx testq %rsi, %rsi setne %dl andl%edx, %eax ret I checked different target machines (haswell, broadwell and cannonlake), however the result remained the same. Also GCC trunk on godbolt.org produces the same assembly code. The first function, `both_nonnegative`, can be rewritten as: (((unsigned long)(a) | (unsigned long)(b)) >> 63) ^ 1 yielding something like this: both_nonnegative: orq %rsi, %rdi movq%rdi, %rax shrq$63, %rax xorl$1, %eax ret It's also possible to use this expression: (long)(unsigned long)a | (unsigned long)b) < 0, but the assembly output is almost the same. The condition from `both_nonzero` can be expressed as: ((unsigned long)a | (unsigned long)b) != 0 GCC compiles it to: both_nonzero: xorl%eax, %eax orq %rsi, %rdi setne %al retq
[Bug tree-optimization/88916] [x86] suboptimal code generated for integer comparisons joined with boolean operators
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88916 --- Comment #2 from Wojciech Mula --- (In reply to Richard Biener from comment #1) > Confirmed. The first case is OK, but the second (for `both_nonzero`) is obviously wrong. Sorry for that.
[Bug tree-optimization/88916] [x86] suboptimal code generated for integer comparisons joined with boolean operators
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88916 --- Comment #3 from Wojciech Mula --- A similar case: ---sign.c--- int different_sign(long a, long b) { return (a >= 0 && b < 0) || (a < 0 && b >= 0); } ---eof-- This is compiled into: different_sign: notq%rdi movq%rdi, %rax notq%rsi shrq$63, %rax shrq$63, %rsi xorl%esi, %eax movzbl %al, %eax ret When expressed as difference of the sign bits ((unsigned long)a ^ (unsigned long)b) >> 63 the code is way shorter: different_sign: xorq%rsi, %rdi movq%rdi, %rax shrq$63, %rax BTW, I looked at ARM assembly, and GCC also emits two shifts, so the observed behaviour is not limited by a target.
[Bug tree-optimization/89018] New: common subexpression present in both branches of condition is not factored out
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89018 Bug ID: 89018 Summary: common subexpression present in both branches of condition is not factored out Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: wojciech_mula at poczta dot onet.pl Target Milestone: --- A common transformation used in a C condition expression is not detected and code is duplicated. Below are a few examples: ---condition.c--- long transform(long); long negative_max(long a, long b) { return (a >= b) ? -a : -b; } long fun_max(long a, long b) { return (a >= b) ? 47*a : 47*b; } long transform_max(long a, long b) { return (a >= b) ? transform(a) : transform(b); } ---eof--- In both branches a scalar value is a part of the same expression. So, it would be more profitable when, for instance "(a >= b) ? -a : -b" would be compiled as "-((a >= b) ? a : b))". Of course, a programmer might factor it out, but in case of macros or auto-generated code such silly repetition might occur. Below is assembly code generated for x86 by a pretty fresh GCC 9. BTW the jump instruction from `fun_max` and `transform_max` can be replaced with a condition move. $ gcc --version gcc (GCC) 9.0.0 20190117 (experimental) $ gcc -O3 -march=skylake -c -S condition.c && cat condition.s negative_max: movq%rdi, %rdx movq%rsi, %rax negq%rdx negq%rax cmpq%rsi, %rdi cmovge %rdx, %rax ret fun_max: cmpq%rsi, %rdi jl .L6 imulq $47, %rdi, %rax ret .L6: imulq $47, %rsi, %rax ret transform_max: cmpq%rsi, %rdi jge .L11 movq%rsi, %rdi .L11: jmp transform
[Bug target/89063] New: [x86] lack of support for BEXTR from BMI extension
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89063 Bug ID: 89063 Summary: [x86] lack of support for BEXTR from BMI extension Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: wojciech_mula at poczta dot onet.pl Target Milestone: --- Instruction BEXTR extracts an arbitrary unsigned bit field from 32- or 64-bit value. As I see in `config/i386.md`, there's support for the immediate variant available in AMD's TBM (TARGET_TBM). Intel's variant gets parameters from a register. Although this variant won't be profitable in all cases -- as we need an extra move to setup the bit-field parameters in a register -- I bet bit-field-intensive code might benefit from BEXTR. ---bextr.c--- #include #include uint64_t test(uint64_t x) { const uint64_t a0 = (x & 0x3f); const uint64_t a1 = (x >> 11) & 0x3f; const uint64_t a2 = (x >> 22) & 0x3f; return a0 + a1 + a2; } uint64_t test_intrinsics(uint64_t x) { const uint64_t a0 = (x & 0x3f); const uint64_t a1 = _bextr_u64(x, 11, 6); const uint64_t a2 = _bextr_u64(x, 22, 6); return a0 + a1 + a2; } ---eof--- $ gcc --version gcc (GCC) 9.0.0 20190117 (experimental) $ gcc -O3 -mbmi -march=skylake bextr.c -c && objdump -d bextr.o : 0: 48 89 famov%rdi,%rdx 3: 48 c1 ea 0b shr$0xb,%rdx 7: 48 89 f8mov%rdi,%rax a: 48 89 d1mov%rdx,%rcx d: 48 c1 e8 16 shr$0x16,%rax 11: 83 e0 3fand$0x3f,%eax 14: 83 e1 3fand$0x3f,%ecx 17: 48 8d 14 01 lea(%rcx,%rax,1),%rdx 1b: 83 e7 3fand$0x3f,%edi 1e: 48 8d 04 3a lea(%rdx,%rdi,1),%rax 22: c3 retq 0030 : 30: b8 0b 06 00 00 mov$0x60b,%eax 35: c4 e2 f8 f7 d7 bextr %rax,%rdi,%rdx 3a: b8 16 06 00 00 mov$0x616,%eax 3f: c4 e2 f8 f7 c7 bextr %rax,%rdi,%rax 44: 83 e7 3fand$0x3f,%edi 47: 48 01 d0add%rdx,%rax 4a: 48 01 f8add%rdi,%rax 4d: c3 retq
[Bug target/89081] New: [x86] suboptimal code generated for condition expression returning negation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89081 Bug ID: 89081 Summary: [x86] suboptimal code generated for condition expression returning negation Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: wojciech_mula at poczta dot onet.pl Target Milestone: --- Let's consider this trivial function: ---clamp.c--- #include uint64_t clamp1(int64_t x) { return (x < 0) ? -x : 0; } ---eof--- $ gcc --version gcc (GCC) 9.0.0 20190117 (experimental) $ gcc -O3 -march=skylake clamp.c -c -S && cat clamp.s clamp1: movq%rdi, %rax negq%rax movl$0, %edx testq %rdi, %rdi cmovns %rdx, %rax ret This procedure can be way shorter, like this clamp1: xorq %rax, %rax # res = 0 negq %rdi # -x, sets SF cmovns %rdi, %rax ret One thing I observed recently when looking at assembly is that GCC never modifies input registers %rdi or %rsi, always makes their copies -- -thus the proposed shorter version is not possible. However, clang modifies these registers, so seems the ABI allows this.
[Bug target/85832] New: [AVX512] possible shorter code when comparing with vector of zeros
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85832 Bug ID: 85832 Summary: [AVX512] possible shorter code when comparing with vector of zeros Product: gcc Version: 7.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: wojciech_mula at poczta dot onet.pl Target Milestone: --- Consider this simple function, which yields mask fors non-zero elements: ---cat cmp.c--- #include int fun(__m512i x) { return _mm512_cmpeq_epi32_mask(x, _mm512_setzero_si512()); } ---eof $ gcc --version gcc (Debian 7.3.0-16) 7.3.0 $ gcc -O2 -S -mavx512f cmp.c && cat cmp.s fun: vpxord %zmm1, %zmm1, %zmm1 # <<< HERE vpcmpeqd%zmm1, %zmm0, %k1 # <<< kmovw %k1, %eax vzeroupper ret Also 8.1.0 generates the same code (as checked on godbolt.org). The pair of instructions VPXORD/VPCMPEQD can be replaced with single VPTESTMD %zmm0, %zmm0. VPTESTMD performs k1 := zmm0 AND zmm0, so to compare zmm0 with zeros it's sufficient.
[Bug target/85833] New: [AVX512] use mask registers instructions instead of scalar code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85833 Bug ID: 85833 Summary: [AVX512] use mask registers instructions instead of scalar code Product: gcc Version: 7.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: wojciech_mula at poczta dot onet.pl Target Milestone: --- There is a simple function, which checks if there is any non-zero element in a vector: ---ktest.c--- #include int anynonzero_epi32(__m512i x) { const __m512i zero = _mm512_setzero_si512(); const __mmask16 mask = _mm512_cmpneq_epi32_mask(x, zero); return mask != 0; } ---eof--- $ gcc --version gcc (Debian 7.3.0-16) 7.3.0 $ gcc -O2 -S -mavx512f ktest.c && cat ktest.s anynonzero_epi32: vpxord %zmm1, %zmm1, %zmm1 vpcmpd $4, %zmm1, %zmm0, %k1 kmovw %k1, %eax # <<< HERE testw %ax, %ax# setne %al movzbl %al, %eax vzeroupper ret The problem is that GCC copies content of the mask register k1 into GPR (using KMOV instruction), and then perform test. AVX512F has got instruction KTEST kx, ky which sets ZF and CF: ZF = (kx AND ky) == 0 CF = (kx AND NOT ky) == 0 In this case we might use KTEST k1, k1 to set ZF when k1 == 0. The procedure might be then compiled as: anynonzero_epi32: vpxord %zmm1, %zmm1, %zmm1 vpcmpd $4, %zmm1, %zmm0, %k1 xor %eax, %eax # ktestw %k1, %k1# setne %al # vzeroupper ret
[Bug target/85833] [AVX512] use mask registers instructions instead of scalar code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85833 --- Comment #3 from Wojciech Mula --- Uroš, thank you very much. I didn't pay attention on the AVX512 variant, as I thought this is so basic instruction that it should be available from AVX512F.
[Bug target/85073] New: [x86] extra check after BLSR
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85073 Bug ID: 85073 Summary: [x86] extra check after BLSR Product: gcc Version: 7.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: wojciech_mula at poczta dot onet.pl Target Milestone: --- GCC is able to use the BLSR instruction in place of expression (x - 1) & x [which is REALLY nice, thank you :)], but does not utilize CPU flags set by the instruction. Below is a simple example. --bmi1.c-- int popcount(unsigned x) { int c = 0; while (x) { c += 1; x = (x - 1) & x; } return c; } --eof-- $ gcc --version gcc (Debian 7.3.0-11) 7.3.0 $ gcc -march=skylake -O3 -s bmi1.c && cat bmi1.s popcount: .LFB0: xorl%eax, %eax testl %edi, %edi je .L4 .L3: addl$1, %eax blsr%edi, %edi <<< HERE testl %edi, %edi <<< and HERE jne .L3 ret .L4: ret BLSR sets the ZF flag if the result is zero. The subsequent TEST instruction is not needed.
[Bug target/88798] AVX512BW code does not use bit-operations that work on mask registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88798 --- Comment #6 from Wojciech Mula --- Hongtao, thank you for your patch and for pinging back! I checked the code from this issue against version 11.2.0 (Debian 11.2.0-14), but still, there are KMOVQs before performing any bit ops. Here is the output from `gcc -O3 -march=icelake-server -S` vpcmpub $0, .LC0(%rip), %zmm0, %k0 vpcmpub $0, .LC1(%rip), %zmm0, %k1 vpcmpub $0, .LC2(%rip), %zmm0, %k2 kmovq %k0, %rcx kmovq %k1, %rax orq %rcx, %rax kmovq %k2, %rdx orq %rdx, %rax ret
[Bug target/88798] AVX512BW code does not use bit-operations that work on mask registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88798 --- Comment #8 from Wojciech Mula --- Thank you for the answer. Thus my question is: is it possible to delay conversion from kmasks into ints? I'm not a language lawyer, but I guess a `x binop y` has to be treated as `(int)x binop (int)y`. If it's true, we will have to prove that `(int)(x avx512-binop y)` is equivalent to the latter expr.
[Bug target/114172] [13 only] ICE with riscv rvv VSETVL intrinsic
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114172 Wojciech Mula changed: What|Removed |Added CC||wojciech_mula at poczta dot onet.p ||l --- Comment #2 from Wojciech Mula --- Checked 13.2 from Debian: $ riscv64-linux-gnu-gcc --version riscv64-linux-gnu-gcc (Debian 13.2.0-12) 13.2.0 For the Bruce's testcase the following invocation triggers segfault (-O2, -O1, -O2 - no error): $ riscv64-linux-gnu-gcc -march=rv64gcv -c 1.c -O3 Below is just the bottom of stack obtained by gdb. There's an infinite recursion somewhere around `riscv_vector::avl_info::operator==`. #629078 0x00fa3372 in riscv_vector::avl_info::operator==(riscv_vector::avl_info const&) const () #629079 0x00fa37f9 in ?? () #629080 0x00fa2543 in ?? () #629081 0x00fa3372 in riscv_vector::avl_info::operator==(riscv_vector::avl_info const&) const () #629082 0x00fa37f9 in ?? () #629083 0x00fa2543 in ?? () #629084 0x00fa3372 in riscv_vector::avl_info::operator==(riscv_vector::avl_info const&) const () #629085 0x00fa37f9 in ?? () #629086 0x00fa2543 in ?? () #629087 0x00fa3372 in riscv_vector::avl_info::operator==(riscv_vector::avl_info const&) const () #629088 0x00fa37f9 in ?? () #629089 0x00fa2543 in ?? () #629090 0x00fa3372 in riscv_vector::avl_info::operator==(riscv_vector::avl_info const&) const () #629091 0x00fa394b in ?? () #629092 0x00f9f588 in riscv_vector::vector_insn_info::compatible_p(riscv_vector::vector_insn_info const&) const () #629093 0x00fa0eb9 in pass_vsetvl::compute_local_backward_infos(rtl_ssa::bb_info const*) () #629094 0x00fa8c6b in pass_vsetvl::lazy_vsetvl() () #629095 0x00fa8e1f in pass_vsetvl::execute(function*) () #629096 0x00b5e21b in execute_one_pass(opt_pass*) () #629097 0x00b5eac0 in ?? () #629098 0x00b5ead2 in ?? () #629099 0x00b5ead2 in ?? () #629100 0x00b5eaf9 in execute_pass_list(function*, opt_pass*) () #629101 0x00822588 in cgraph_node::expand() () #629102 0x00823afb in ?? () #629103 0x00825fd8 in symbol_table::finalize_compilation_unit() () #629104 0x00c29bad in ?? () #629105 0x006a4c97 in toplev::main(int, char**) () #629106 0x006a6a8b in main ()
[Bug c++/114747] New: [RISC-V RVV] Wrong SEW set for mixed-size intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114747 Bug ID: 114747 Summary: [RISC-V RVV] Wrong SEW set for mixed-size intrinsics Product: gcc Version: 13.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: wojciech_mula at poczta dot onet.pl Target Milestone: --- This is a distilled procedure from simdutf project: --- #include #include #include size_t convert_latin1_to_utf16le(const char *src, size_t len, char16_t *dst) { char16_t *beg = dst; for (size_t vl; len > 0; len -= vl, src += vl, dst += vl) { vl = __riscv_vsetvl_e8m4(len); vuint8m4_t v = __riscv_vle8_v_u8m4((uint8_t*)src, vl); __riscv_vse16_v_u16m8((uint16_t*)dst, __riscv_vzext_vf2_u16m8(v, vl), vl); } return dst - beg; } --- When compiled with gcc 13.2.0 with flags "-march=rv64gcv -O2" it sets a wrong SEW: --- convert_latin1_to_utf16le(char const*, unsigned long, char16_t*): beq a1,zero,.L4 mv a4,a2 .L3: vsetvli a5,a1,e8,m4,ta,ma # set SEW=8 vle8.v v8,0(a0) sllia3,a5,1 vzext.vf2 v24,v8 # illegal instruction, as SEW/2 < 8 sub a1,a1,a5 vse16.v v24,0(a4) add a0,a0,a5 add a4,a4,a3 bne a1,zero,.L3 sub a0,a4,a2 sraia0,a0,1 ret .L4: li a0,0 ret --- The trunk available on godbold.org (riscv64-unknown-linux-gnu-g++ 14.0.1 20240415) emits vsetvli with e16 argument, which seems to be fine.
[Bug target/114809] New: [RISC-V RVV] Counting elements might be simpler
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114809 Bug ID: 114809 Summary: [RISC-V RVV] Counting elements might be simpler Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: wojciech_mula at poczta dot onet.pl Target Milestone: --- Consider this simple procedure --- #include #include size_t count_chars(const char *src, size_t len, char c) { size_t count = 0; for (size_t i=0; i < len; i++) { count += src[i] == c; } return count; } --- Assembly for it (GCC 14.0, -march=rv64gcv -O3): --- count_chars(char const*, unsigned long, char): beq a1,zero,.L4 vsetvli a4,zero,e8,mf8,ta,ma vmv.v.x v2,a2 vsetvli zero,zero,e64,m1,ta,ma vmv.v.i v1,0 .L3: vsetvli a5,a1,e8,mf8,ta,ma vle8.v v0,0(a0) sub a1,a1,a5 add a0,a0,a5 vmseq.vvv0,v0,v2 vsetvli zero,zero,e64,m1,tu,mu vadd.vi v1,v1,1,v0.t bne a1,zero,.L3 vsetvli a5,zero,e64,m1,ta,ma li a4,0 vmv.s.x v2,a4 vredsum.vs v1,v1,v2 vmv.x.s a0,v1 ret .L4: li a0,0 ret --- The counting procedure might use `vcpop.m` instead of updating vector of counters (`v1`) and summing them in the end. This would move all mode switches outside the loop. And there's a missing peephole optimization: li a4,0 vmv.s.x v2,a4 It can be: vmv.s.x v2,zero
[Bug target/117421] New: [RISCV] Use byte comparison instead of word comparison
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117421 Bug ID: 117421 Summary: [RISCV] Use byte comparison instead of word comparison Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: wojciech_mula at poczta dot onet.pl Target Milestone: --- Consider this simple function: --- #include bool ext_is_gzip(std::string_view ext) { return ext == "gzip"; } --- For the x86 target, GCC nicely inlines compile-time constant, and produces the code like that (it's from GCC 15, with `-O3 -march=icelake-server`): --- ext_is_gzip(std::basic_string_view >): xorl%eax, %eax cmpq$4, %rdi je .L5 ret .L5: cmpl$1885960807, (%rsi) sete%al ret --- However, for the RISC-V target, GCC emits plain byte-by-byte comparison (riscv64-unknown-linux-gnu-g++ (crosstool-NG UNKNOWN) 15.0.0 20241031 (experimental), with `-O3 -march=rv64gcv`): --- ext_is_gzip(std::basic_string_view >): addisp,sp,-16 sd a0,0(sp) sd a1,8(sp) li a5,4 beq a0,a5,.L9 li a0,0 addisp,sp,16 jr ra .L9: lbu a4,0(a1) li a5,103 beq a4,a5,.L10 .L3: li a0,1 .L4: xoria0,a0,1 addisp,sp,16 jr ra .L10: lbu a4,1(a1) li a5,122 bne a4,a5,.L3 lbu a4,2(a1) li a5,105 bne a4,a5,.L3 lbu a4,3(a1) li a5,112 li a0,0 beq a4,a5,.L4 li a0,1 j .L4 --- My wild guess is that we have by default a high cost of placing huge compile-time values in RISC-V. However, when I checked what is emitted for "gzip" & "pizg" given as u32, then we have: --- 0: 677a7537lui a0,0x677a7 4: 9705051baddiw a0,a0,-1680 # 677a6970 8: 70698537lui a0,0x70698 c: a675051baddiw a0,a0,-1433 # 70697a67 --- A godbolt link for convenience: https://godbolt.org/z/e16bP369n.
[Bug target/109279] RISC-V: complex constants synthesized should be improved
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109279 --- Comment #20 from Wojciech Mula --- This constants is worth checking (appears in division by 10): ``` unsigned long ccd() { return 0xcccd; } ``` riscv64-unknown-linux-gnu-g++ (crosstool-NG UNKNOWN) 15.0.0 2024 (experimental): ``` ccd(): li a0,858992640 li a5,858992640 addia0,a0,819 addia5,a5,818 sllia0,a0,32 add a0,a0,a5 xoria0,a0,-1 ret ``` clang 20: ``` ccd(): lui a0, 838861 addiw a0, a0, -819 sllia1, a0, 32 add a0, a0, a1 ret ```
[Bug target/117421] [RISCV] Use byte comparison instead of word comparison
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117421 --- Comment #4 from Wojciech Mula --- Although, there's no word-wise set for equality, thus I think this sequence would be better. ``` lbu a0, 1(a1) lbu a2, 0(a1) lbu a3, 2(a1) lb a1, 3(a1) xoria0, a0, 'z' xoria2, a2, 'g' xoria3, a3, 'i' xoria1, a1, 'p' or a0, a0, a2 or a1, a1, a3 or a0, a0, a1 seqza0, a0 ```
[Bug target/117421] [RISCV] Use byte comparison instead of word comparison
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117421 --- Comment #3 from Wojciech Mula --- It's worth noting, that Clang first synthesizes a 32-bit word from individual bytes, and then use a single comparison. ``` ext_is_gzip(std::basic_string_view>): li a2, 4 bne a0, a2, .LBB0_2 lbu a0, 1(a1) lbu a2, 0(a1) lbu a3, 2(a1) lb a1, 3(a1) sllia0, a0, 8 or a0, a0, a2 sllia3, a3, 16 sllia1, a1, 24 or a1, a1, a3 or a0, a0, a1 lui a1, 460440 addiw a1, a1, -1433 xor a0, a0, a1 seqza0, a0 ret .LBB0_2: li a0, 0 ret ```
[Bug target/117421] [RISCV] Use byte comparison instead of word comparison
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117421 --- Comment #2 from Wojciech Mula --- First of all, thanks for looking at this! > I should note that -mno-strict-align still does not do it but that is because > it might be slow still to do unaligned access. OK, maybe `-mno-strict-align` should issue a warning in such cases?
[Bug target/119911] New: [RVV] Suboptimal code generation for multiple extracting 0-th elements of vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119911 Bug ID: 119911 Summary: [RVV] Suboptimal code generation for multiple extracting 0-th elements of vector Product: gcc Version: 16.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: wojciech_mula at poczta dot onet.pl Target Milestone: --- I observed the issue on GCC 14.2, but it's still visible on the godbolt trunk, which is 16.0.0 20250423 (experimental). Summary: when we have multiple `vmv.x.s` (move the 0th vector element into a scalar register), GCC always emit shift-left then shift-right to apply masking of result lower bits (like 8 or 16). However, when there are more `vmv.x.s` instances, then it would be profitable to create the mask in a register (which is a compile-time const) and use bit-and for masking. Clang performs this optimization. Consider this simple function: ---test.cpp--- #include #include uint64_t sum_of_first_three(vuint16m1_t x) { const uint64_t mask = 0x; const auto vl = __riscv_vsetvlmax_e16m1(); return uint64_t(__riscv_vmv_x_s_u16m1_u16(x)) + uint64_t(__riscv_vmv_x_s_u16m1_u16(__riscv_vslidedown(x, 1, vl))) + uint64_t(__riscv_vmv_x_s_u16m1_u16(__riscv_vslidedown(x, 2, vl))); } ---eof--- When compiled with `-O3 -march=rv64gcv`, the assembly is: --- sum_of_first_three(__rvv_uint16m1_t): vsetvli a5,zero,e16,m1,ta,ma vslidedown.vi v10,v8,1 vslidedown.vi v9,v8,2 vmv.x.s a5,v8 vmv.x.s a4,v10 vmv.x.s a0,v9 sllia4,a4,48 sllia5,a5,48 srlia4,a4,48 srlia5,a5,48 sllia0,a0,48 add a5,a5,a4 srlia0,a0,48 add a0,a5,a0 ret --- godbolt link: https://godbolt.org/z/hPrM8vz4v
[Bug driver/109605] -fno-tree-vectorize does not disable vectorizer
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109605 Wojciech Mula changed: What|Removed |Added CC||wojciech_mula at poczta dot onet.p ||l --- Comment #3 from Wojciech Mula --- This is somehow related. I needed to generate the particular procedure without any vector instruction (the surrounding code is free to RVV instructions). But when a code uses the builtin function `memset`, GCC still emits some vector instruction. The cure is setting `-fno-builtin`, because pragma does not accept that option. The attached sample code comes from simdutf project (src/scalar/utf.f), godbolt link for convenience https://godbolt.org/z/Ya91he99v. ---no-vector.cpp-- #include #include #include #pragma GCC optimize ("no-tree-vectorize") #pragma GCC optimize ("no-tree-loop-vectorize") #pragma GCC optimize ("no-tree-slp-vectorize") #pragma GCC optimize ("no-builtin") // not accepted by the compiler bool validate(const char *buf, size_t len) noexcept { const uint8_t *data = reinterpret_cast(buf); uint64_t pos = 0; uint32_t code_point = 0; while (pos < len) { // check of the next 16 bytes are ascii. uint64_t next_pos = pos + 16; if (next_pos <= len) { // if it is safe to read 16 more bytes, check that they are ascii uint64_t v1; std::memcpy(&v1, data + pos, sizeof(uint64_t)); uint64_t v2; std::memcpy(&v2, data + pos + sizeof(uint64_t), sizeof(uint64_t)); uint64_t v{v1 | v2}; if ((v & 0x8080808080808080) == 0) { pos = next_pos; continue; } } unsigned char byte = data[pos]; while (byte < 0b1000) { if (++pos == len) { return true; } byte = data[pos]; } if ((byte & 0b1110) == 0b1100) { next_pos = pos + 2; if (next_pos > len) { return false; } if ((data[pos + 1] & 0b1100) != 0b1000) { return false; } // range check code_point = (byte & 0b0001) << 6 | (data[pos + 1] & 0b0011); if ((code_point < 0x80) || (0x7ff < code_point)) { return false; } } else if ((byte & 0b) == 0b1110) { next_pos = pos + 3; if (next_pos > len) { return false; } if ((data[pos + 1] & 0b1100) != 0b1000) { return false; } if ((data[pos + 2] & 0b1100) != 0b1000) { return false; } // range check code_point = (byte & 0b) << 12 | (data[pos + 1] & 0b0011) << 6 | (data[pos + 2] & 0b0011); if ((code_point < 0x800) || (0x < code_point) || (0xd7ff < code_point && code_point < 0xe000)) { return false; } } else if ((byte & 0b1000) == 0b) { // 0b next_pos = pos + 4; if (next_pos > len) { return false; } if ((data[pos + 1] & 0b1100) != 0b1000) { return false; } if ((data[pos + 2] & 0b1100) != 0b1000) { return false; } if ((data[pos + 3] & 0b1100) != 0b1000) { return false; } // range check code_point = (byte & 0b0111) << 18 | (data[pos + 1] & 0b0011) << 12 | (data[pos + 2] & 0b0011) << 6 | (data[pos + 3] & 0b0011); if (code_point <= 0x || 0x10 < code_point) { return false; } } else { // we may have a continuation return false; } pos = next_pos; } return true; } ---eof--- The head of generated asm: --- validate(char const*, unsigned long): beq a1,zero,.L32 li a4,2139062272 addia4,a4,-129 sllia2,a4,32 addisp,sp,-16 add a2,a2,a4 li a5,0 xoria2,a2,-1 addia7,sp,8 vsetivlizero,8,e8,mf2,ta,ma ## here .L2: addia3,a5,16 add t1,a0,a5 bltua1,a3,.L36 vle8.v v1,0(t1) # addia4,a5,8 add a4,a0,a4 vse8.v v1,0(sp) # vle8.v v1,0(a4) # ld a4,0(sp) vse8.v v1,0(a7) # ld a6,8(sp) or a4,a4,a6 and a4,a4,a2 bne a4,zero,.L36 mv a5,a3 .L6: ---
[Bug target/119040] New: [PPC/Altivec] Missing bit-level optimization (select)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119040 Bug ID: 119040 Summary: [PPC/Altivec] Missing bit-level optimization (select) Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: wojciech_mula at poczta dot onet.pl Target Milestone: --- This come from real-world usage. Suppose we have a vector of words, we want to move around some bit-fields of that words. We isolate the bit-fields with `and` then shift the bit-fields to the desired positions in the final word. But we never end up with overlapping bit-fields. Below is a sample code that merges two 6-bit fields into 12-bit one in 32-bit elements: ---test.cpp--- #include #include using vec_u32_t = __vector uint32_t; vec_u32_t merge_2x6_bits(const vec_u32_t a, const vec_u32_t b) { vec_u32_t t0 = vec_and(a, vec_splats(uint32_t(0x003f))); vec_u32_t t1 = vec_and(b, vec_splats(uint32_t(0x3f00))); vec_u32_t t2 = vec_sr(t1, vec_splats(uint32_t(2))); return vec_or(t2, t0); } ---eof--- GCC 14.2.0 with flags `-O3 -maltivec` produces the following code (I omitted the constants .LC0 & .LC1): lis 10,.LC0@ha lis 9,.LC1@ha la 10,.LC0@l(10) la 9,.LC1@l(9) lvx 13,0,10 vspltisw 0,2 lvx 1,0,9 vand 3,3,13 vsrw 3,3,0 vand 2,2,1 vor 2,3,2 Since the bit-fields do not overlap, the last sequence of `vand` and `vor` can be replace with `vsel`. Instead of `((a & mask1) >> 2) | (b & mask2)` we may have `select(mask1 >> 2, a >> 2, b & mask2)` [providing the prototype of `select` is select(condition, true, false)].
[Bug target/120141] New: [RVV] Noop are not removed
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120141 Bug ID: 120141 Summary: [RVV] Noop are not removed Product: gcc Version: 16.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: wojciech_mula at poczta dot onet.pl Target Milestone: --- I observed that RVV noops, like shifting by 0 or adding 0, are not removed from the program. I fully understand that a compiler cannot do it when `vsetvli` changes the mode between operations. But in the sample program `v8` is written and then shifted under the same vector mode. Is there any reason that comes from the RVV spec which is not obvious? Sample program: --- #include vuint16m1_t naive_avg(vuint16m1_t x, vuint16m1_t y) { const auto vl = __riscv_vsetvlmax_e16m1(); const auto a = __riscv_vadd(x, y, vl); return __riscv_vsrl(a, 0, vl); } --- Compiled with `-O3 -march=rv64gcv` yield the following assembly: --- naive_avg(__rvv_uint16m1_t, __rvv_uint16m1_t): vsetvli a5,zero,e16,m1,ta,ma vadd.vv v8,v8,v9 vsrl.vi v8,v8,0 ret ---