Re: [PATCH] Fix RTL simplifications of FFS, POPCOUNT and PARITY.
The motivation for the current design (requiring the result and the operand to have the same mode) was from PR middle-end/50161. The challenge there is that when the RTL optimizers can determine that the operand is a constant, simplify_rtx sees just a CONST_INT with VOIDmode and therefore doesn't know what the original/intended mode of the operation is/should be. Using patterns with explicit truncates, sign or zero extensions avoids these problems, but still allows backend insns where the result mode differs from the operand mode. > On 1 Jan 2023, at 21:03, Segher Boessenkool > wrote: > On Sun, Jan 01, 2023 at 03:55:26PM -, Roger Sayle wrote: >> In 2011, the rtl.texi documentation was updated to reflect that the >> modes of the RTX unary operations FFS, POPCOUNT and PARITY must >> match those of their operands. > > Is that not a limitation we should try to get rid of? It does not > really make any sense after all. > >> Unfortunately, some of the transformations >> in simplify-rtx.cc predate this tightening of RTL semantics, and have >> not (until now) been updated/fixed. i.e. The POPCOUNT and PARITY >> optimizations were "correct" when I originally added them back in 2007. > > What would be needed to fix this the "other" way? Just update the > documentation? Or is there some code that requires this strange > constraint, maybe even some checking thing? > > If we *do* want this strange limitation, we really should have some RTL > check that enforces it, makes sure we find it if the rules are broken. > > Segher
Re: [PATCH] Fix RTL simplifications of FFS, POPCOUNT and PARITY.
Hi Jeff, > On 2 Jan 2023, at 15:45, Jeff Law wrote: > On 1/1/23 08:55, Roger Sayle wrote: >> In 2011, the rtl.texi documentation was updated to reflect that the >> modes of the RTX unary operations FFS, POPCOUNT and PARITY must >> match those of their operands. Unfortunately, some of the transformations >> in simplify-rtx.cc predate this tightening of RTL semantics, and have >> not (until now) been updated/fixed. i.e. The POPCOUNT and PARITY >> optimizations were "correct" when I originally added them back in 2007. >> Segher requested that I split this piece out from a fix for PR 106594 in >> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/601501.html >> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap >> and make -k check, both with and without --target_board=unix{-m32}, >> with no new failures. Ok for mainline? >> 2023-01-01 Roger Sayle >> gcc/ChangeLog >> * gcc/simplify-rtx.cc (simplify_unary_operation_1) : >> Avoid generating FFS with mismatched operand and result modes, by >> using an explicit SIGN_EXTEND/ZERO_EXTEND instead. >> : Likewise, for POPCOUNT of ZERO_EXTEND. >> : Likewise, for PARITY of {ZERO,SIGN}_EXTEND. > ?!? The docs still seem to indicate to me that the modes of the input and > output operands can differ. > Let's take PARITY as an example: > >> @cindex @code{parity@var{m}2} instruction pattern >> @item @samp{parity@var{m}2} >> Store into operand 0 the parity of operand 1, i.e.@: the number of 1-bits >> in operand 1 modulo 2. >> @var{m} is either a scalar or vector integer mode. When it is a scalar, >> operand 1 has mode @var{m} but operand 0 can have whatever scalar >> integer mode is suitable for the target. > > The mode of the pattern name has to match the mode of the input operand. The > mode of the > output operand can differ from the mode of the input operand. we seem to > have a disagreement > on the documented semantics of these opcodes. The documentation that you're looking at is the definition of the parity optab in md.texi, not the definition of the PARITY rtx in rtl.texi. The distinction is subtle. Hence a backend can define paritysiqi2 but in the RTL pattern it expands to the unary PARITY operator must have the same result type as its operand type, wrapped in either a truncate or extension if necessary. I hope this helps. Cheers, Roger --
[JAVA PATCH] Enable more array bounds check elimination
It has been a while since my last contribution. The following patch allows GCC's optimizers to more aggressively eliminate and optimize java array bounds checks. The results are quite impressive, for example producing a 26% performance improvement on the sieve.java benchmark given at http://keithlea.com/javabench/ on x86_64-pc-linux-gnu, reducing the runtime for 1 million iterations from 31.5 seconds on trunk, to 25.0s with this patch, in fact eliminating all array bounds checks. This is close to the 22.3s of an equivalent C/C++ implementation, and significantly closes the gap to Java HotSpot(TM) JIT at 23.0 seconds. The approach is to provide sufficient information in the gimple generated by the gcj front-end to allow the optimizers to do their thing. For array allocations of constant length, I propose generating an additional (cheap) write to the array length field returned from _Jv_NewPrimArray, which is then sufficient to allow this value to propagate throughout the optimizers. This is probably best explained by a simple example. Consider the array initializer below: private static int mk1[] = { 71,85,95 }; which gets compiled into the java byte code sequence below: 0: iconst_3 1: newarray int 3: dup 4: iconst_0 5: bipush71 7: iastore 8: dup 9: iconst_1 10: bipush85 12: iastore 13: dup 14: iconst_2 15: bipush95 17: iastore 18: putstatic #3 // Field mk1:[I 21: return Currently, the .004t.gimple generated by gcj for the array allocation is the cryptic: #slot#0#0 = 3; #ref#0#2 = _Jv_NewPrimArray (&_Jv_intClass, #slot#0#0); #ref#1#4 = #ref#0#2; _ref_1_4.6 = #ref#1#4; which unfortunately doesn't provide many clues for the middle-end, so we end up generating the following .210t.optimized: void * _3 = _Jv_NewPrimArray (&_Jv_intClass, 3); int _4 = MEM[(struct int[] *)_3].length; unsigned int _5 = (unsigned int) _4; if (_4 == 0) goto ; else goto ; : _Jv_ThrowBadArrayIndex (0); : MEM[(int *)_3 + 12B] = 71; if (_5 == 1) goto ; else goto ; : _Jv_ThrowBadArrayIndex (1); : MEM[(int *)_3 + 16B] = 85; if (_5 == 2) goto ; else goto ; : _Jv_ThrowBadArrayIndex (2); : MEM[(int *)_3 + 20B] = 95; mk1 = _3; return; which obviously contains three run-time array bounds checks. These same checks appear in the x86_64 assembly language: subq$8, %rsp xorl%eax, %eax movl$3, %esi movl$_Jv_intClass, %edi call_Jv_NewPrimArray movl8(%rax), %edx testl %edx, %edx je .L13 cmpl$1, %edx movl$71, 12(%rax) je .L14 cmpl$2, %edx movl$85, 16(%rax) je .L15 movl$95, 20(%rax) movq%rax, _ZN9TestArray3mk1E(%rip) addq$8, %rsp ret .L13: xorl%edi, %edi xorl%eax, %eax call_Jv_ThrowBadArrayIndex .L15: movl$2, %edi xorl%eax, %eax call_Jv_ThrowBadArrayIndex .L14: movl$1, %edi xorl%eax, %eax call_Jv_ThrowBadArrayIndex With the patch below, we now generate the much more informative .004t.gimple for this: D.926 = _Jv_NewPrimArray (&_Jv_intClass, 3); D.926->length = 3; This additional write to the array length field of the newly allocated array enables much more simplification. The resulting .210t.optimized for our array initialization now becomes: struct int[3] * _3; _3 = _Jv_NewPrimArray (&_Jv_intClass, 3); MEM[(int *)_3 + 8B] = { 3, 71, 85, 95 }; mk1 = _3; return; And the x86_64 assembly code is also much prettier: subq$8, %rsp movl$3, %esi movl$_Jv_intClass, %edi xorl%eax, %eax call_Jv_NewPrimArray movdqa .LC0(%rip), %xmm0 movq%rax, _ZN9TestArray3mk1E(%rip) movups %xmm0, 8(%rax) addq$8, %rsp ret .LC0: .long 3 .long 71 .long 85 .long 95 Achieving this result required two minor tweaks. The first is to allow the array length constant to reach the newarray call, by allowing constants to remain on the quickstack. This allows the call to _Jv_NewPrimArray to have a constant integer argument instead of the opaque #slot#0#0. Then in the code that constructs the call to _Jv_NewPrimArray we wrap it in a COMPOUND_EXPR allowing us to insert the superfluous, but helpful, write to the length field. Whilst working on this improvement I also noticed that the array bounds checks we were initially generating could also be improved. Currently, an array bound check in 004t.gimple looks like: D.925 = MEM[(struct int[] *)_ref_1_4.6].length; D.926 = (unsigned int) D.925; if (_slot_2_5.9 >= D.926) goto ; else goto ; :
[JAVA PATCH] Builtin support for popcount* and bswap* functions
The following patch provides builtin support for byte swapping and bit counting. On suitable hardware, these generate the x86 popcount instructions, as also generated by the SUN HotSpot JIT/JVM for these method calls. java.lang.Integer.bitCount -> __builtin_popcount java.lang.Long.bitCount -> __builtin_popcountl java.lang.Short.reverseBytes -> __builtin_bswap16 java.lang.Integer.reverseBytes -> __builtin_bswap32 java.lang.Long.reverseBytes -> __builtin_bswap64 New test cases have been added to libjava to double check nothing breaks. Whist I was there I noticed that the math builtins (many of which I added support for back in 2003/2004) weren't marked as ECF_LEAF, indicating that they don't/can't invoke any user specified functions. This has also been fixed in the patch below. The following patch has been tested on x86_64-pc-linux-gnu with a full make bootstrap and make check with no new failures/regressions. Ok for stage1 once it reopens? Cheers, Roger -- Roger Sayle, Ph.D. CEO and founder NextMove Software Limited Registered in England No. 07588305 Registered Office: Innovation Centre (Unit 23), Cambridge Science Park, Cambridge CB4 0EY 2016-02-21 Roger Sayle gcc/java: * builtins.c (java_builtins): Use popcount* and bswap* builtins to implement bitCount() and reverseBytes() methods in java.lang.Integer and friends. (initialize_builtins): Annotate math builtins with ECF_LEAF. Call define_builtin for the new popcount* and bswap* builtins. libjava: * testsuite/libjava.lang/BuiltinBitCount.java: New test case. * testsuite/libjava.lang/BuiltinReverseBytes.java: Likewise. Index: builtins.c === --- builtins.c (revision 233190) +++ builtins.c (working copy) @@ -98,6 +98,11 @@ { { "java.lang.Math" }, { "sin" }, NULL, BUILT_IN_SIN }, { { "java.lang.Math" }, { "sqrt" }, NULL, BUILT_IN_SQRT }, { { "java.lang.Math" }, { "tan" }, NULL, BUILT_IN_TAN }, + { { "java.lang.Integer" }, { "bitCount" }, NULL, BUILT_IN_POPCOUNT }, + { { "java.lang.Integer" }, { "reverseBytes" }, NULL, BUILT_IN_BSWAP32 }, + { { "java.lang.Long" }, { "bitCount" }, NULL, BUILT_IN_POPCOUNTL }, + { { "java.lang.Long" }, { "reverseBytes" }, NULL, BUILT_IN_BSWAP64 }, + { { "java.lang.Short" }, { "reverseBytes" }, NULL, BUILT_IN_BSWAP16 }, { { "java.lang.Float" }, { "intBitsToFloat" }, convert_real, (enum built_in_function) 0 }, { { "java.lang.Double" }, { "longBitsToDouble" }, convert_real, @@ -483,6 +488,7 @@ tree double_ftype_double, double_ftype_double_double; tree float_ftype_float_float; tree boolean_ftype_boolean_boolean; + tree int_ftype_int; int i; for (i = 0; java_builtins[i].builtin_code != END_BUILTINS; ++i) @@ -507,50 +513,77 @@ double_type_node, double_type_node, NULL_TREE); define_builtin (BUILT_IN_FMOD, "__builtin_fmod", - double_ftype_double_double, "fmod", ECF_CONST); + double_ftype_double_double, "fmod", ECF_CONST | ECF_LEAF); define_builtin (BUILT_IN_FMODF, "__builtin_fmodf", - float_ftype_float_float, "fmodf", ECF_CONST); + float_ftype_float_float, "fmodf", ECF_CONST | ECF_LEAF); define_builtin (BUILT_IN_ACOS, "__builtin_acos", double_ftype_double, "_ZN4java4lang4Math4acosEJdd", - ECF_CONST); + ECF_CONST | ECF_LEAF); define_builtin (BUILT_IN_ASIN, "__builtin_asin", double_ftype_double, "_ZN4java4lang4Math4asinEJdd", - ECF_CONST); + ECF_CONST | ECF_LEAF); define_builtin (BUILT_IN_ATAN, "__builtin_atan", double_ftype_double, "_ZN4java4lang4Math4atanEJdd", - ECF_CONST); + ECF_CONST | ECF_LEAF); define_builtin (BUILT_IN_ATAN2, "__builtin_atan2", double_ftype_double_double, "_ZN4java4lang4Math5atan2EJddd", - ECF_CONST); + ECF_CONST | ECF_LEAF); define_builtin (BUILT_IN_CEIL, "__builtin_ceil", double_ftype_double, "_ZN4java4lang4Math4ceilEJdd", - ECF_CONST); + ECF_CONST | ECF_LEAF); define_builtin (BUILT_IN_COS, "__builtin_cos", double_ftype_double, "_ZN4java4lang4Math3cosEJdd", - ECF_CONST); + ECF_CONST | ECF_LEAF); define_builtin (BUILT_IN_EXP, "__builtin_exp", double_ftype_double, "_ZN4java4lang4Math3expEJdd", - ECF_CONST); + ECF_CONST | ECF_LEAF); define_builtin (BUILT_IN_FLOOR, "__builtin_floor", double_ftype_double, "_ZN4java4lang4Math5floorEJdd", - ECF_CONST); + ECF_CONST | ECF_LEAF); define_builtin (BUILT_IN_LOG, "__builtin_log", double_ftype_double, "_ZN4java4lang4Math3logEJdd", - ECF_CO