Re: [PATCH] Fix RTL simplifications of FFS, POPCOUNT and PARITY.

2023-01-01 Thread ro...@nextmovesoftware.com


The motivation for the current design (requiring the result and the operand to 
have
the same mode) was from PR middle-end/50161.  The challenge there is that when
the RTL optimizers can determine that the operand is a constant, simplify_rtx 
sees
just a CONST_INT with VOIDmode and therefore doesn't know what the 
original/intended
mode of the operation is/should be.

Using patterns with explicit truncates, sign or zero extensions avoids these 
problems,
but still allows backend insns where the result mode differs from the operand 
mode.

> On 1 Jan 2023, at 21:03, Segher Boessenkool  
> wrote:
> On Sun, Jan 01, 2023 at 03:55:26PM -, Roger Sayle wrote:
>> In 2011, the rtl.texi documentation was updated to reflect that the
>> modes of the RTX unary operations FFS, POPCOUNT and PARITY must
>> match those of their operands.
> 
> Is that not a limitation we should try to get rid of?  It does not
> really make any sense after all.
> 
>> Unfortunately, some of the transformations
>> in simplify-rtx.cc predate this tightening of RTL semantics, and have
>> not (until now) been updated/fixed.  i.e. The POPCOUNT and PARITY
>> optimizations were "correct" when I originally added them back in 2007.
> 
> What would be needed to fix this the "other" way?  Just update the
> documentation?  Or is there some code that requires this strange
> constraint, maybe even some checking thing?
> 
> If we *do* want this strange limitation, we really should have some RTL
> check that enforces it, makes sure we find it if the rules are broken.
> 
> Segher



Re: [PATCH] Fix RTL simplifications of FFS, POPCOUNT and PARITY.

2023-01-02 Thread ro...@nextmovesoftware.com


Hi Jeff,

> On 2 Jan 2023, at 15:45, Jeff Law  wrote:
> On 1/1/23 08:55, Roger Sayle wrote:
>> In 2011, the rtl.texi documentation was updated to reflect that the
>> modes of the RTX unary operations FFS, POPCOUNT and PARITY must
>> match those of their operands.  Unfortunately, some of the transformations
>> in simplify-rtx.cc predate this tightening of RTL semantics, and have
>> not (until now) been updated/fixed.  i.e. The POPCOUNT and PARITY
>> optimizations were "correct" when I originally added them back in 2007.
>> Segher requested that I split this piece out from a fix for PR 106594 in
>> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/601501.html
>> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
>> and make -k check, both with and without --target_board=unix{-m32},
>> with no new failures.  Ok for mainline?
>> 2023-01-01  Roger Sayle  
>> gcc/ChangeLog
>>  * gcc/simplify-rtx.cc (simplify_unary_operation_1) :
>>  Avoid generating FFS with mismatched operand and result modes, by
>>  using an explicit SIGN_EXTEND/ZERO_EXTEND instead.
>>  : Likewise, for POPCOUNT of ZERO_EXTEND.
>>  : Likewise, for PARITY of {ZERO,SIGN}_EXTEND.
> ?!?  The docs still seem to indicate to me that the modes of the input and 
> output operands can differ.
> Let's take PARITY as an example:
> 
>> @cindex @code{parity@var{m}2} instruction pattern
>> @item @samp{parity@var{m}2}
>> Store into operand 0 the parity of operand 1, i.e.@: the number of 1-bits
>> in operand 1 modulo 2.
>> @var{m} is either a scalar or vector integer mode.  When it is a scalar,
>> operand 1 has mode @var{m} but operand 0 can have whatever scalar
>> integer mode is suitable for the target.  
> 
> The mode of the pattern name has to match the mode of the input operand.  The 
> mode of the
> output operand can differ from the mode of the input operand.  we seem to 
> have a disagreement
> on the documented semantics of these opcodes.

The documentation that you're looking at is the definition of the parity optab 
in
md.texi, not the definition of the PARITY rtx in rtl.texi.  The distinction is 
subtle.
Hence a backend can define paritysiqi2 but in the RTL pattern it expands to the
unary PARITY operator must have the same result type as its operand type,
wrapped in either a truncate or extension if necessary.

I hope this helps.

Cheers,
Roger
--



[JAVA PATCH] Enable more array bounds check elimination

2016-02-22 Thread ro...@nextmovesoftware.com

It has been a while since my last contribution.  The following patch allows 
GCC's optimizers
to more aggressively eliminate and optimize java array bounds checks.  The 
results are
quite impressive, for example producing a 26% performance improvement on the 
sieve.java
benchmark given at http://keithlea.com/javabench/ on x86_64-pc-linux-gnu, 
reducing the
runtime for 1 million iterations from 31.5 seconds on trunk, to 25.0s with this 
patch, in fact
eliminating all array bounds checks.  This is close to the 22.3s of an 
equivalent C/C++
implementation, and significantly closes the gap to Java HotSpot(TM) JIT at 
23.0 seconds.

The approach is to provide sufficient information in the gimple generated by 
the gcj front-end
to allow the optimizers to do their thing.  For array allocations of constant 
length, I propose
generating an additional (cheap) write to the array length field returned from 
_Jv_NewPrimArray,
which is then sufficient to allow this value to propagate throughout the 
optimizers.

This is probably best explained by a simple example.  Consider the array 
initializer below:

private static int mk1[] = { 71,85,95 };

which gets compiled into the java byte code sequence below:

   0: iconst_3
   1: newarray   int
   3: dup
   4: iconst_0
   5: bipush71
   7: iastore
   8: dup
   9: iconst_1
  10: bipush85
  12: iastore
  13: dup
  14: iconst_2
  15: bipush95
  17: iastore
  18: putstatic #3  // Field mk1:[I
  21: return

Currently, the .004t.gimple generated by gcj for the array allocation is the 
cryptic:

#slot#0#0 = 3;
#ref#0#2 = _Jv_NewPrimArray (&_Jv_intClass, #slot#0#0);
#ref#1#4 = #ref#0#2;
_ref_1_4.6 = #ref#1#4;

which unfortunately doesn't provide many clues for the middle-end, so we end up
generating the following .210t.optimized:

  void * _3 = _Jv_NewPrimArray (&_Jv_intClass, 3);
  int _4 = MEM[(struct int[] *)_3].length;
  unsigned int _5 = (unsigned int) _4;
  if (_4 == 0)
goto ;
  else
goto ;
  :
  _Jv_ThrowBadArrayIndex (0);
  :
  MEM[(int *)_3 + 12B] = 71;
  if (_5 == 1)
goto ;
  else
goto ;
  :
  _Jv_ThrowBadArrayIndex (1);
  :
  MEM[(int *)_3 + 16B] = 85;
  if (_5 == 2)
goto ;
  else
goto ;
  :
  _Jv_ThrowBadArrayIndex (2);
  :
  MEM[(int *)_3 + 20B] = 95;
  mk1 = _3;
  return;

which obviously contains three run-time array bounds checks.  These same checks 
appear
in the x86_64 assembly language:

subq$8, %rsp
xorl%eax, %eax
movl$3, %esi
movl$_Jv_intClass, %edi
call_Jv_NewPrimArray
movl8(%rax), %edx
testl   %edx, %edx
je  .L13
cmpl$1, %edx
movl$71, 12(%rax)
je  .L14
cmpl$2, %edx
movl$85, 16(%rax)
je  .L15
movl$95, 20(%rax)
movq%rax, _ZN9TestArray3mk1E(%rip)
addq$8, %rsp
ret
.L13:   xorl%edi, %edi
xorl%eax, %eax
call_Jv_ThrowBadArrayIndex
.L15:   movl$2, %edi
xorl%eax, %eax
call_Jv_ThrowBadArrayIndex
.L14:   movl$1, %edi
xorl%eax, %eax
call_Jv_ThrowBadArrayIndex


With the patch below, we now generate the much more informative .004t.gimple 
for this:

D.926 = _Jv_NewPrimArray (&_Jv_intClass, 3);
D.926->length = 3;

This additional write to the array length field of the newly allocated array 
enables much
more simplification.  The resulting .210t.optimized for our array 
initialization now becomes:

  struct int[3] * _3;
  _3 = _Jv_NewPrimArray (&_Jv_intClass, 3);
  MEM[(int *)_3 + 8B] = { 3, 71, 85, 95 };
  mk1 = _3;
  return;

And the x86_64 assembly code is also much prettier:

subq$8, %rsp
movl$3, %esi
movl$_Jv_intClass, %edi
xorl%eax, %eax
call_Jv_NewPrimArray
movdqa  .LC0(%rip), %xmm0
movq%rax, _ZN9TestArray3mk1E(%rip)
movups  %xmm0, 8(%rax)
addq$8, %rsp
ret
.LC0:   .long 3
.long 71
.long 85
.long 95


Achieving this result required two minor tweaks.   The first is to allow the 
array length constant
to reach the newarray call, by allowing constants to remain on the quickstack.  
This allows the
call to _Jv_NewPrimArray to have a constant integer argument instead of the 
opaque #slot#0#0.
Then in the code that constructs the call to _Jv_NewPrimArray we wrap it in a 
COMPOUND_EXPR
allowing us to insert the superfluous, but helpful, write to the length field.

Whilst working on this improvement I also noticed that the array bounds checks 
we were
initially generating could also be improved.  Currently, an array bound check 
in 004t.gimple
looks like:

D.925 = MEM[(struct int[] *)_ref_1_4.6].length;
D.926 = (unsigned int) D.925;
if (_slot_2_5.9 >= D.926) goto ; else goto ;
:
   

[JAVA PATCH] Builtin support for popcount* and bswap* functions

2016-02-22 Thread ro...@nextmovesoftware.com

The following patch provides builtin support for byte swapping and bit counting.
On suitable hardware, these generate the x86 popcount instructions, as also
generated by the SUN HotSpot JIT/JVM for these method calls.

java.lang.Integer.bitCount -> __builtin_popcount
java.lang.Long.bitCount -> __builtin_popcountl

java.lang.Short.reverseBytes -> __builtin_bswap16
java.lang.Integer.reverseBytes -> __builtin_bswap32
java.lang.Long.reverseBytes -> __builtin_bswap64

New test cases have been added to libjava to double check nothing breaks.
Whist I was there I noticed that the math builtins (many of which I added 
support for
back in 2003/2004) weren't marked as ECF_LEAF, indicating that they don't/can't
invoke any user specified functions.   This has also been fixed in the patch 
below.

The following patch has been tested on x86_64-pc-linux-gnu with a full make 
bootstrap
and make check with no new failures/regressions.

Ok for stage1 once it reopens?

Cheers,

Roger
--
Roger Sayle, Ph.D.
CEO and founder
NextMove Software Limited
Registered in England No. 07588305
Registered Office: Innovation Centre (Unit 23), Cambridge Science Park, 
Cambridge CB4 0EY

2016-02-21  Roger Sayle  

gcc/java:
* builtins.c (java_builtins): Use popcount* and bswap* builtins to
implement bitCount() and reverseBytes() methods in java.lang.Integer
and friends.
(initialize_builtins): Annotate math builtins with ECF_LEAF.  Call
define_builtin for the new popcount* and bswap* builtins.

libjava:
* testsuite/libjava.lang/BuiltinBitCount.java: New test case.
* testsuite/libjava.lang/BuiltinReverseBytes.java: Likewise.

 
Index: builtins.c
===
--- builtins.c  (revision 233190)
+++ builtins.c  (working copy)
@@ -98,6 +98,11 @@
   { { "java.lang.Math" }, { "sin" }, NULL, BUILT_IN_SIN },
   { { "java.lang.Math" }, { "sqrt" }, NULL, BUILT_IN_SQRT },
   { { "java.lang.Math" }, { "tan" }, NULL, BUILT_IN_TAN },
+  { { "java.lang.Integer" }, { "bitCount" }, NULL, BUILT_IN_POPCOUNT },
+  { { "java.lang.Integer" }, { "reverseBytes" }, NULL, BUILT_IN_BSWAP32 },
+  { { "java.lang.Long" }, { "bitCount" }, NULL, BUILT_IN_POPCOUNTL },
+  { { "java.lang.Long" }, { "reverseBytes" }, NULL, BUILT_IN_BSWAP64 },
+  { { "java.lang.Short" }, { "reverseBytes" }, NULL, BUILT_IN_BSWAP16 },
   { { "java.lang.Float" }, { "intBitsToFloat" }, convert_real,
 (enum built_in_function) 0 },
   { { "java.lang.Double" }, { "longBitsToDouble" }, convert_real,
@@ -483,6 +488,7 @@
   tree double_ftype_double, double_ftype_double_double;
   tree float_ftype_float_float;
   tree boolean_ftype_boolean_boolean;
+  tree int_ftype_int;
   int i;
 
   for (i = 0; java_builtins[i].builtin_code != END_BUILTINS; ++i)
@@ -507,50 +513,77 @@
double_type_node, double_type_node, NULL_TREE);
 
   define_builtin (BUILT_IN_FMOD, "__builtin_fmod",
- double_ftype_double_double, "fmod", ECF_CONST);
+ double_ftype_double_double, "fmod", ECF_CONST | ECF_LEAF);
   define_builtin (BUILT_IN_FMODF, "__builtin_fmodf",
- float_ftype_float_float, "fmodf", ECF_CONST);
+ float_ftype_float_float, "fmodf", ECF_CONST | ECF_LEAF);
 
   define_builtin (BUILT_IN_ACOS, "__builtin_acos",
  double_ftype_double, "_ZN4java4lang4Math4acosEJdd",
- ECF_CONST);
+ ECF_CONST | ECF_LEAF);
   define_builtin (BUILT_IN_ASIN, "__builtin_asin",
  double_ftype_double, "_ZN4java4lang4Math4asinEJdd",
- ECF_CONST);
+ ECF_CONST | ECF_LEAF);
   define_builtin (BUILT_IN_ATAN, "__builtin_atan",
  double_ftype_double, "_ZN4java4lang4Math4atanEJdd",
- ECF_CONST);
+ ECF_CONST | ECF_LEAF);
   define_builtin (BUILT_IN_ATAN2, "__builtin_atan2",
  double_ftype_double_double, "_ZN4java4lang4Math5atan2EJddd",
- ECF_CONST);
+ ECF_CONST | ECF_LEAF);
   define_builtin (BUILT_IN_CEIL, "__builtin_ceil",
  double_ftype_double, "_ZN4java4lang4Math4ceilEJdd",
- ECF_CONST);
+ ECF_CONST | ECF_LEAF);
   define_builtin (BUILT_IN_COS, "__builtin_cos",
  double_ftype_double, "_ZN4java4lang4Math3cosEJdd",
- ECF_CONST);
+ ECF_CONST | ECF_LEAF);
   define_builtin (BUILT_IN_EXP, "__builtin_exp",
  double_ftype_double, "_ZN4java4lang4Math3expEJdd",
- ECF_CONST);
+ ECF_CONST | ECF_LEAF);
   define_builtin (BUILT_IN_FLOOR, "__builtin_floor",
  double_ftype_double, "_ZN4java4lang4Math5floorEJdd",
- ECF_CONST);
+ ECF_CONST | ECF_LEAF);
   define_builtin (BUILT_IN_LOG, "__builtin_log",
  double_ftype_double, "_ZN4java4lang4Math3logEJdd",
- ECF_CO