[RFC] Add new flag to specify output constraint in match.pd

2020-08-20 Thread Feng Xue OS via Gcc
Hi,

  There is a match-folding issue derived from pr94234.  A piece of code like:

  int foo (int n)
  {
 int t1 = 8 * n;
 int t2 = 8 * (n - 1);

 return t1 - t2;
  }

 It can be perfectly caught by the rule "(A * C) +- (B * C) -> (A +- B) * C", 
and
 be folded to constant "8". But this folding will fail if both v1 and v2 have
multiple uses, as the following code.

  int foo (int n)
  {
 int t1 = 8 * n;
 int t2 = 8 * (n - 1);

 use_fn (t1, t2);
 return t1 - t2;
  }

Given an expression with non-single-use operands, folding it will introduce
duplicated computation in most situations, and is deemed to be unprofitable.
But it is always beneficial if final result is a constant or existing SSA value.

And the rule is:
  (simplify
   (plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2))
   (if ((!ANY_INTEGRAL_TYPE_P (type)
 || TYPE_OVERFLOW_WRAPS (type)
 || (INTEGRAL_TYPE_P (type)
 && tree_expr_nonzero_p (@0)
 && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type)
/* If @1 +- @2 is constant require a hard single-use on either
   original operand (but not on both).  */
&& (single_use (@3) || single_use (@4)))   <- control whether match 
or not
(mult (plusminus @1 @2) @0)))

Current matcher only provides a way to check something before folding,
but no mechanism to affect decision after folding. If has, for the above
case, we can let it go when we find result is a constant.

Like the way to describe input operand using flags, we could also add
a new flag to specify this kind of constraint on output that we expect
it is a simple gimple value.

Proposed syntax is

  (opcode:v{ condition } )

The char "v" stands for gimple value, if more descriptive, other char is
preferred. "condition" enclosed by { } is an optional c-syntax condition
expression. If present, only when "condition" is met, matcher will check
whether folding result is a gimple value using
gimple_simplified_result_is_gimple_val ().

Since there is no SSA concept in GENERIC, this is only for GIMPLE-match,
not GENERIC-match.

With this syntax, the rule is changed to

#Form 1:
  (simplify
   (plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2))
   (if ((!ANY_INTEGRAL_TYPE_P (type)
 || TYPE_OVERFLOW_WRAPS (type)
 || (INTEGRAL_TYPE_P (type)
 && tree_expr_nonzero_p (@0)
 && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type))
   ( if (!single_use (@3) && !single_use (@4))
  (mult:v (plusminus @1 @2) @0)))
  (mult (plusminus @1 @2) @0)

#Form 2:
  (simplify
   (plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2))
   (if ((!ANY_INTEGRAL_TYPE_P (type)
 || TYPE_OVERFLOW_WRAPS (type)
 || (INTEGRAL_TYPE_P (type)
 && tree_expr_nonzero_p (@0)
 && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type))
  (mult:v{ !single_use (@3) && !single_use (@4 } (plusminus @1 @2) @0

This is just a proposal, has not been implemented. Hope your comments
on this.

Best Regards,
Feng

Re: [RFC] Add new flag to specify output constraint in match.pd

2020-08-23 Thread Feng Xue OS via Gcc
>>   There is a match-folding issue derived from pr94234.  A piece of code like:
>> 
>>   int foo (int n)
>>   {
>>  int t1 = 8 * n;
>>  int t2 = 8 * (n - 1);
>> 
>>  return t1 - t2;
>>   }
>> 
>>  It can be perfectly caught by the rule "(A * C) +- (B * C) -> (A +- B) * 
>> C", and
>>  be folded to constant "8". But this folding will fail if both v1 and v2 have
>>  multiple uses, as the following code.
>> 
>>   int foo (int n)
>>   {
>>  int t1 = 8 * n;
>>  int t2 = 8 * (n - 1);
>> 
>>  use_fn (t1, t2);
>>  return t1 - t2;
>>   }
>> 
>>  Given an expression with non-single-use operands, folding it will introduce
>>  duplicated computation in most situations, and is deemed to be unprofitable.
>>  But it is always beneficial if final result is a constant or existing SSA 
>> value.
>> 
>>  And the rule is:
>>   (simplify
>>(plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2))
>>(if ((!ANY_INTEGRAL_TYPE_P (type)
>> || TYPE_OVERFLOW_WRAPS (type)
>> || (INTEGRAL_TYPE_P (type)
>> && tree_expr_nonzero_p (@0)
>> && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION 
>> (type)
>>/* If @1 +- @2 is constant require a hard single-use on either
>>   original operand (but not on both).  */
>>&& (single_use (@3) || single_use (@4)))   <- control whether 
>> match or not
>> (mult (plusminus @1 @2) @0)))
>> 
>>  Current matcher only provides a way to check something before folding,
>>  but no mechanism to affect decision after folding. If has, for the above
>>  case, we can let it go when we find result is a constant.
> 
> :s already has a counter-measure where it still folds if the output is at
> most one operation. So this transformation has a counter-counter-measure
> of checking single_use explicitly. And now we want a counter^3-measure...
> 
Counter-measure is key factor to matching-cost.  ":s" seems to be somewhat
coarse-grained. And here we do need more control over it.

But ideally, we could decouple these counter-measures from definitions of
match-rule, and let gimple-matcher get a more reasonable match-or-not
decision based on these counters. Anyway, it is another story.

>>  Like the way to describe input operand using flags, we could also add
>>  a new flag to specify this kind of constraint on output that we expect
>>  it is a simple gimple value.
>> 
>>  Proposed syntax is
>> 
>>   (opcode:v{ condition } )
>> 
>>  The char "v" stands for gimple value, if more descriptive, other char is
>>  preferred. "condition" enclosed by { } is an optional c-syntax condition
>>  expression. If present, only when "condition" is met, matcher will check
>>  whether folding result is a gimple value using
>>  gimple_simplified_result_is_gimple_val ().
>> 
>>  Since there is no SSA concept in GENERIC, this is only for GIMPLE-match,
>>  not GENERIC-match.
>> 
>>  With this syntax, the rule is changed to
>> 
>>  #Form 1:
>>   (simplify
>>(plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2))
>>(if ((!ANY_INTEGRAL_TYPE_P (type)
>> || TYPE_OVERFLOW_WRAPS (type)
>> || (INTEGRAL_TYPE_P (type)
>> && tree_expr_nonzero_p (@0)
>> && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION 
>> (type))
>>( if (!single_use (@3) && !single_use (@4))
>>   (mult:v (plusminus @1 @2) @0)))
>>   (mult (plusminus @1 @2) @0)
> 
> That seems to match what you can do with '!' now (that's very recent).
>>  #Form 2:
>>   (simplify
>>(plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2))
>>(if ((!ANY_INTEGRAL_TYPE_P (type)
>> || TYPE_OVERFLOW_WRAPS (type)
>> || (INTEGRAL_TYPE_P (type)
>> && tree_expr_nonzero_p (@0)
>> && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION 
>> (type))
>>   (mult:v{ !single_use (@3) && !single_use (@4 } (plusminus @1 @2) @0
> 
> Indeed, something more flexible than '!' would be nice, but I am not so
> sure about this versioxzln. If we are going to allow inserting code after
> resimplification and before validation, maybe we should go even further
> and let people insert arbitrary code there...
Yes. I think so and tried to do that.  For sure, result of resimplification 
must be
core data referenced by such user code. But it is hard to unify result
representation for GIMPLE and GENERIC matcher. And one more consideration
is that how often is this generalized usage.

Thanks,
Feng

Re: [RFC] Add new flag to specify output constraint in match.pd

2020-09-02 Thread Feng Xue OS via Gcc
>
>>
>> >>   There is a match-folding issue derived from pr94234.  A piece of code 
>> >> like:
>> >>
>> >>   int foo (int n)
>> >>   {
>> >>  int t1 = 8 * n;
>> >>  int t2 = 8 * (n - 1);
>> >>
>> >>  return t1 - t2;
>> >>   }
>> >>
>> >>  It can be perfectly caught by the rule "(A * C) +- (B * C) -> (A +- B) * 
>> >> C", and
>> >>  be folded to constant "8". But this folding will fail if both v1 and v2 
>> >> have
>> >>  multiple uses, as the following code.
>> >>
>> >>   int foo (int n)
>> >>   {
>> >>  int t1 = 8 * n;
>> >>  int t2 = 8 * (n - 1);
>> >>
>> >>  use_fn (t1, t2);
>> >>  return t1 - t2;
>> >>   }
>> >>
>> >>  Given an expression with non-single-use operands, folding it will 
>> >> introduce
>> >>  duplicated computation in most situations, and is deemed to be 
>> >> unprofitable.
>> >>  But it is always beneficial if final result is a constant or existing 
>> >> SSA value.
>> >>
>> >>  And the rule is:
>> >>   (simplify
>> >>(plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2))
>> >>(if ((!ANY_INTEGRAL_TYPE_P (type)
>> >> || TYPE_OVERFLOW_WRAPS (type)
>> >> || (INTEGRAL_TYPE_P (type)
>> >> && tree_expr_nonzero_p (@0)
>> >> && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION 
>> >> (type)
>> >>/* If @1 +- @2 is constant require a hard single-use on either
>> >>   original operand (but not on both).  */
>> >>&& (single_use (@3) || single_use (@4)))   <- control whether 
>> >> match or not
>> >> (mult (plusminus @1 @2) @0)))
>> >>
>> >>  Current matcher only provides a way to check something before folding,
>> >>  but no mechanism to affect decision after folding. If has, for the above
>> >>  case, we can let it go when we find result is a constant.
>> >
>> > :s already has a counter-measure where it still folds if the output is at
>> > most one operation. So this transformation has a counter-counter-measure
>> > of checking single_use explicitly. And now we want a counter^3-measure...
>> >
>> Counter-measure is key factor to matching-cost.  ":s" seems to be somewhat
>> coarse-grained. And here we do need more control over it.
>>
>> But ideally, we could decouple these counter-measures from definitions of
>> match-rule, and let gimple-matcher get a more reasonable match-or-not
>> decision based on these counters. Anyway, it is another story.
>>
>> >>  Like the way to describe input operand using flags, we could also add
>> >>  a new flag to specify this kind of constraint on output that we expect
>> >>  it is a simple gimple value.
>> >>
>> >>  Proposed syntax is
>> >>
>> >>   (opcode:v{ condition } )
>> >>
>> >>  The char "v" stands for gimple value, if more descriptive, other char is
>> >>  preferred. "condition" enclosed by { } is an optional c-syntax condition
>> >>  expression. If present, only when "condition" is met, matcher will check
>> >>  whether folding result is a gimple value using
>> >>  gimple_simplified_result_is_gimple_val ().
>> >>
>> >>  Since there is no SSA concept in GENERIC, this is only for GIMPLE-match,
>> >>  not GENERIC-match.
>> >>
>> >>  With this syntax, the rule is changed to
>> >>
>> >>  #Form 1:
>> >>   (simplify
>> >>(plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2))
>> >>(if ((!ANY_INTEGRAL_TYPE_P (type)
>> >> || TYPE_OVERFLOW_WRAPS (type)
>> >> || (INTEGRAL_TYPE_P (type)
>> >> && tree_expr_nonzero_p (@0)
>> >> && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION 
>> >> (type))
>> >>( if (!single_use (@3) && !single_use (@4))
>> >>   (mult:v (plusminus @1 @2) @0)))
>> >>   (mult (plusminus @1 @2) @0)
>> >
>> > That seems to match what you can do with '!' now (that's very recent).
>
> It's also what :s does but a slight bit more "local".  When any operand is
> marked :s and it has more than a single-use we only allow simplifications
> that do not require insertion of extra stmts.  So basically the above pattern
> doesn't behave any different than if you omit your :v.  Only if you'd
> place :v on an inner expression there would be a difference.  Correlating
> the inner expression we'd not want to insert new expressions for with
> a specific :s (or multiple ones) would be a more natural extension of what
> :s provides.
>
> Thus, for the above case (Form 1), you do not need :v at all and :s works.

Between ":s" and ":v", there is a subtle difference. ":s" only ensures interior
transform does not insert any new stmts, but this is not true for final one.

Code snippet generated for (A * C) +- (B * C) -> (A+-B) * C:
 
  gimple_seq *lseq = seq;
  if (lseq
  && (!single_use (captures[0])
  || !single_use (captures[3])))
lseq = NULL;
  if (__builtin_expect (!dbg_cnt (match), 0)) goto next_after_fail621;
  if (__builtin_expect (dump_file && (dump_flags & TDF_FOLDING), 0))  
fprintf (dump_file, "Applying patte

Re: [RFC] Add new flag to specify output constraint in match.pd

2020-09-02 Thread Feng Xue OS via Gcc
>>
>> >
>> >>
>> >> >>   There is a match-folding issue derived from pr94234.  A piece of 
>> >> >> code like:
>> >> >>
>> >> >>   int foo (int n)
>> >> >>   {
>> >> >>  int t1 = 8 * n;
>> >> >>  int t2 = 8 * (n - 1);
>> >> >>
>> >> >>  return t1 - t2;
>> >> >>   }
>> >> >>
>> >> >>  It can be perfectly caught by the rule "(A * C) +- (B * C) -> (A +- 
>> >> >> B) * C", and
>> >> >>  be folded to constant "8". But this folding will fail if both v1 and 
>> >> >> v2 have
>> >> >>  multiple uses, as the following code.
>> >> >>
>> >> >>   int foo (int n)
>> >> >>   {
>> >> >>  int t1 = 8 * n;
>> >> >>  int t2 = 8 * (n - 1);
>> >> >>
>> >> >>  use_fn (t1, t2);
>> >> >>  return t1 - t2;
>> >> >>   }
>> >> >>
>> >> >>  Given an expression with non-single-use operands, folding it will 
>> >> >> introduce
>> >> >>  duplicated computation in most situations, and is deemed to be 
>> >> >> unprofitable.
>> >> >>  But it is always beneficial if final result is a constant or existing 
>> >> >> SSA value.
>> >> >>
>> >> >>  And the rule is:
>> >> >>   (simplify
>> >> >>(plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2))
>> >> >>(if ((!ANY_INTEGRAL_TYPE_P (type)
>> >> >> || TYPE_OVERFLOW_WRAPS (type)
>> >> >> || (INTEGRAL_TYPE_P (type)
>> >> >> && tree_expr_nonzero_p (@0)
>> >> >> && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION 
>> >> >> (type)
>> >> >>/* If @1 +- @2 is constant require a hard single-use on either
>> >> >>   original operand (but not on both).  */
>> >> >>&& (single_use (@3) || single_use (@4)))   <- control 
>> >> >> whether match or not
>> >> >> (mult (plusminus @1 @2) @0)))
>> >> >>
>> >> >>  Current matcher only provides a way to check something before folding,
>> >> >>  but no mechanism to affect decision after folding. If has, for the 
>> >> >> above
>> >> >>  case, we can let it go when we find result is a constant.
>> >> >
>> >> > :s already has a counter-measure where it still folds if the output is 
>> >> > at
>> >> > most one operation. So this transformation has a counter-counter-measure
>> >> > of checking single_use explicitly. And now we want a 
>> >> > counter^3-measure...
>> >> >
>> >> Counter-measure is key factor to matching-cost.  ":s" seems to be somewhat
>> >> coarse-grained. And here we do need more control over it.
>> >>
>> >> But ideally, we could decouple these counter-measures from definitions of
>> >> match-rule, and let gimple-matcher get a more reasonable match-or-not
>> >> decision based on these counters. Anyway, it is another story.
>> >>
>> >> >>  Like the way to describe input operand using flags, we could also add
>> >> >>  a new flag to specify this kind of constraint on output that we expect
>> >> >>  it is a simple gimple value.
>> >> >>
>> >> >>  Proposed syntax is
>> >> >>
>> >> >>   (opcode:v{ condition } )
>> >> >>
>> >> >>  The char "v" stands for gimple value, if more descriptive, other char 
>> >> >> is
>> >> >>  preferred. "condition" enclosed by { } is an optional c-syntax 
>> >> >> condition
>> >> >>  expression. If present, only when "condition" is met, matcher will 
>> >> >> check
>> >> >>  whether folding result is a gimple value using
>> >> >>  gimple_simplified_result_is_gimple_val ().
>> >> >>
>> >> >>  Since there is no SSA concept in GENERIC, this is only for 
>> >> >> GIMPLE-match,
>> >> >>  not GENERIC-match.
>> >> >>
>> >> >>  With this syntax, the rule is changed to
>> >> >>
>> >> >>  #Form 1:
>> >> >>   (simplify
>> >> >>(plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2))
>> >> >>(if ((!ANY_INTEGRAL_TYPE_P (type)
>> >> >> || TYPE_OVERFLOW_WRAPS (type)
>> >> >> || (INTEGRAL_TYPE_P (type)
>> >> >> && tree_expr_nonzero_p (@0)
>> >> >> && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION 
>> >> >> (type))
>> >> >>( if (!single_use (@3) && !single_use (@4))
>> >> >>   (mult:v (plusminus @1 @2) @0)))
>> >> >>   (mult (plusminus @1 @2) @0)
>> >> >
>> >> > That seems to match what you can do with '!' now (that's very recent).
>> >
>> > It's also what :s does but a slight bit more "local".  When any operand is
>> > marked :s and it has more than a single-use we only allow simplifications
>> > that do not require insertion of extra stmts.  So basically the above 
>> > pattern
>> > doesn't behave any different than if you omit your :v.  Only if you'd
>> > place :v on an inner expression there would be a difference.  Correlating
>> > the inner expression we'd not want to insert new expressions for with
>> > a specific :s (or multiple ones) would be a more natural extension of what
>> > :s provides.
>> >
>> > Thus, for the above case (Form 1), you do not need :v at all and :s works.
>>
>> Between ":s" and ":v", there is a subtle difference. ":s" only ensures 
>> interior
>> transform does not insert any new stmts, but this is not true for final one.
>>
>> Code s

[RFC] A memory gathering optimization for loop

2021-01-13 Thread Feng Xue OS via Gcc
1. Background

In a loop, it is optimal if only one memory stream is activated, that
is, all memory operations sequentially access one data region. But that
is always not the case, such as traversing link list and manipulating
discrete arrays. In this scenario, the loop would contain multiple
scattered memory accesses, which could trigger inefficient multi-way
hardware prefetching, thus making cpu cache hit rate drop. The tracker
pr98598 (https://gcc.gnu.org/bugzilla//show_bug.cgi?id=98598) gives a
description on one related problem that we are meant to target.

2. Memory gathering optimization (MGO)

For nested loops, if scattered memory accesses inside inner loop remain
unchanged in each iteration of outer loop, we can dynamically gather
result data into a sequential memory region when the first access in the
inner loop takes place. This way the hardware prefetching can be reduced,
and cpu cache hit rate can be improved. We name this optimization MGO
(memory gathering optimization). An example is given as below, which is
a little different from pr98598, and could not be optimized by loop
distribution.

  struct A { int **p1; };

  int foo (int n, struct A *array)
  {
  int sum = 0;

  for (int i = 0; i < n; i++) {
  for (int j = 0; j < i; j++) {  /* iteration count is i */
  int **p1 = array[j].p1;
  int  *p2 = *p1;

  sum += *p2;
  }
  }

  return sum;
  }

Although this example seems to be pretty artificial, the kind of data
reuse in nested loops is often seen in real applications, especially
written by Fortran, and C++ STL.

To simplify explanation of this memory gathering optimization, pseudo
code on original nested loops is given as:

  outer-loop ( ) { /* data in inner loop are also invariant in outer loop.  */
  ...
  inner-loop (iter, iter_count) {  /* inner loop to apply MGO */

  Type1 v1 = LOAD_FN1 (iter);
  Type2 v2 = LOAD_FN2 (v1);
  Type3 v3 = LOAD_FN3 (v2);
  ... 
  iter = NEXT_FN (iter);
  }

 ...
  }

"LOAD_FNx()" is a conceptual function that translates its argument to a
result, and is composed of a sequence of operations in which there is
only one memory load. It is capable of representing simple memory
dereference like "iter->field", or more complicated expression like
"array[iter * 2 + cst].field".

Likewise, "NEXT_FN()" is also a conceptual function which defines how an
iterator is advanced. It is able to describe simple integer iterator,
list iterator, or even pure-function-based iterator on advanced stuff
like hashset/tree.

Then the code will be transformed to:

  typedef struct cache_elem {
  bool   init;
  Type1  c_v1;
  Type2  c_v2;
  Type3  c_v3;
  } cache_elem;

  cache_elem *cache_arr = calloc (iter_count, sizeof(cache_elem));

  outer-loop () {
  ...
  size_t cache_idx = 0;

  inner-loop (iter, iter_count) {

  if (!cache_arr[cache_idx]->init) {
  v1 = LOAD_FN1 (iter);
  v2 = LOAD_FN2 (v1);
  v3 = LOAD_FN3 (v2);

  cache_arr[cache_idx]->init = true;
  cache_arr[cache_idx]->c_v1 = v1;
  cache_arr[cache_idx]->c_v2 = v2;
  cache_arr[cache_idx]->c_v3 = v3;
  }
  else {
  v1 = cache_arr[cache_idx]->c_v1;
  v2 = cache_arr[cache_idx]->c_v2;
  v3 = cache_arr[cache_idx]->c_v3;
  }
  ...
  cache_idx++;
  iter = NEXT_FN (iter);
  }
  ...
  }

  free (cache_arr);

In the transformation, cache space belongs to compiler-generated
temporary storage, but it is from user heap. We find this trick is very
uncommon in gcc. We’d like to make sure whether there are any special
considerations or restrictions on this. Anyway, using alloca() to get
space from stack is an alternative, but we should be cautious to avoid
incurring stack overflow by non-user logic.

"iter_count" stands for an upper bound for inner-loop iteration count,
which must be an outer-loop invariant expression, in that it determines
data count in cache space, and the space is allocated before outer-loop.

In the example below, we can simply get such bound for inner-loop 1 from
its niter_desc, while for inner-loop 2, it is not that easy, but it is
possible to infer that the loop's iteration count never exceeds "n" via
comparison predicate information.

  foo (int n, int m)
  {
  for (int i = 0; i < n; i++ ) {  /* outer-loop 1 */
  ...
  for (int j = 0; j < m; j++) {   /* inner-loop 1 */
  ...
  }
  ...
  }

  for (int i = 0; i < n; i ++) { /* outer-loop 2 */
  ...
  for (int j = 0; j < i; j++) {  /* inner-loop 2 */
  ...
  }
  ...
  }
  }

3. Cache count threshold

We expect cache space works as normal stack temporary, and would not
impact execution of program so much, especially when heap

[RFC] A memory gathering optimization for loop

2021-01-13 Thread Feng Xue OS via Gcc
1. Background

In a loop, it is optimal if only one memory stream is activated, that
is, all memory operations sequentially access one data region. But that
is always not the case, such as traversing link list and manipulating
discrete arrays. In this scenario, the loop would contain multiple
scattered memory accesses, which could trigger inefficient multi-way
hardware prefetching, thus making cpu cache hit rate drop. The tracker
pr98598 (https://gcc.gnu.org/bugzilla//show_bug.cgi?id=98598) gives a
description on one related problem that we are meant to target.

2. Memory gathering optimization (MGO)

For nested loops, if scattered memory accesses inside inner loop remain
unchanged in each iteration of outer loop, we can dynamically gather
result data into a sequential memory region when the first access in the
inner loop takes place. This way the hardware prefetching can be reduced,
and cpu cache hit rate can be improved. We name this optimization MGO
(memory gathering optimization). An example is given as below, which is
a little different from pr98598, and could not be optimized by loop
distribution.

  struct A { int **p1; };

  int foo (int n, struct A *array)
  {
  int sum = 0;

  for (int i = 0; i < n; i++) {
  for (int j = 0; j < i; j++) {  /* iteration count is i */
  int **p1 = array[j].p1;
  int  *p2 = *p1;

  sum += *p2;
  }
  }

  return sum;
  }

Although this example seems to be pretty artificial, the kind of data
reuse in nested loops is often seen in real applications, especially
written by Fortran, and C++ STL.

To simplify explanation of this memory gathering optimization, pseudo
code on original nested loops is given as:

  outer-loop ( ) { /* data in inner loop are also invariant in outer loop.  */
  ...
  inner-loop (iter, iter_count) {  /* inner loop to apply MGO */

  Type1 v1 = LOAD_FN1 (iter);
  Type2 v2 = LOAD_FN2 (v1);
  Type3 v3 = LOAD_FN3 (v2);
  ... 
  iter = NEXT_FN (iter);
  }

 ...
  }

"LOAD_FNx()" is a conceptual function that translates its argument to a
result, and is composed of a sequence of operations in which there is
only one memory load. It is capable of representing simple memory
dereference like "iter->field", or more complicated expression like
"array[iter * 2 + cst].field".

Likewise, "NEXT_FN()" is also a conceptual function which defines how an
iterator is advanced. It is able to describe simple integer iterator,
list iterator, or even pure-function-based iterator on advanced stuff
like hashset/tree.

Then the code will be transformed to:

  typedef struct cache_elem {
  bool   init;
  Type1  c_v1;
  Type2  c_v2;
  Type3  c_v3;
  } cache_elem;

  cache_elem *cache_arr = calloc (iter_count, sizeof(cache_elem));

  outer-loop () {
  ...
  size_t cache_idx = 0;

  inner-loop (iter, iter_count) {

  if (!cache_arr[cache_idx]->init) {
  v1 = LOAD_FN1 (iter);
  v2 = LOAD_FN2 (v1);
  v3 = LOAD_FN3 (v2);

  cache_arr[cache_idx]->init = true;
  cache_arr[cache_idx]->c_v1 = v1;
  cache_arr[cache_idx]->c_v2 = v2;
  cache_arr[cache_idx]->c_v3 = v3;
  }
  else {
  v1 = cache_arr[cache_idx]->c_v1;
  v2 = cache_arr[cache_idx]->c_v2;
  v3 = cache_arr[cache_idx]->c_v3;
  }
  ...
  cache_idx++;
  iter = NEXT_FN (iter);
  }
  ...
  }

  free (cache_arr);

In the transformation, cache space belongs to compiler-generated
temporary storage, but it is from user heap. We find this trick is very
uncommon in gcc. We’d like to make sure whether there are any special
considerations or restrictions on this. Anyway, using alloca() to get
space from stack is an alternative, but we should be cautious to avoid
incurring stack overflow by non-user logic.

"iter_count" stands for an upper bound for inner-loop iteration count,
which must be an outer-loop invariant expression, in that it determines
data count in cache space, and the space is allocated before outer-loop.

In the example below, we can simply get such bound for inner-loop 1 from
its niter_desc, while for inner-loop 2, it is not that easy, but it is
possible to infer that the loop's iteration count never exceeds "n" via
comparison predicate information.

  foo (int n, int m)
  {
  for (int i = 0; i < n; i++ ) {  /* outer-loop 1 */
  ...
  for (int j = 0; j < m; j++) {   /* inner-loop 1 */
  ...
  }
  ...
  }

  for (int i = 0; i < n; i ++) { /* outer-loop 2 */
  ...
  for (int j = 0; j < i; j++) {  /* inner-loop 2 */
  ...
  }
  ...
  }
  }

3. Cache count threshold

We expect cache space works as normal stack temporary, and would not
impact execution of program so much, especially when heap

Re: [RFC] A memory gathering optimization for loop

2021-01-17 Thread Feng Xue OS via Gcc
>> -Original Message-
>> From: Feng Xue OS 
>> Sent: Thursday, January 14, 2021 12:28 PM
>> To: gcc@gcc.gnu.org
>> Cc: JiangNing OS ; Hao Liu OS
>> 
>> Subject: [RFC] A memory gathering optimization for loop
>>
>> 1. Background
>>
>> In a loop, it is optimal if only one memory stream is activated, that is, all
>> memory operations sequentially access one data region. But that is always
>> not the case, such as traversing link list and manipulating discrete arrays. 
>> In
>> this scenario, the loop would contain multiple scattered memory accesses,
>> which could trigger inefficient multi-way hardware prefetching, thus making
>> cpu cache hit rate drop. The tracker pr98598 
>> (https://gcc.gnu.org/bugzilla//show_bug.cgi?id=98598)
>> gives a description on one related problem that we are meant to target.
>>
>> 2. Memory gathering optimization (MGO)
>>
>> For nested loops, if scattered memory accesses inside inner loop remain
>> unchanged in each iteration of outer loop, we can dynamically gather result
>> data into a sequential memory region when the first access in the inner loop
>> takes place. This way the hardware prefetching can be reduced, and cpu
>> cache hit rate can be improved. We name this optimization MGO (memory
>> gathering optimization). An example is given as below, which is a little
>> different from pr98598, and could not be optimized by loop distribution.
>>
>>   struct A { int **p1; };
>>
>>   int foo (int n, struct A *array)
>>   {
>>   int sum = 0;
>>
>>   for (int i = 0; i < n; i++) {
>>   for (int j = 0; j < i; j++) {  /* iteration count is i */
>>   int **p1 = array[j].p1;
>>   int  *p2 = *p1;
>>
>>   sum += *p2;
>>   }
>>   }
>>
>>   return sum;
>>   }
>>
>>


> The number of memory streams in this loop is limited (practically 1), how 
> would the transform be affected when issuing prefetches for array at a well 
> placed spot?
>
> That is, what's the hazard we avoid (besides the obvious data dependence
> Shortening in case the load from array is in cache?)
>
> Richard.
>

In the example, to evaluate "sum += *p2" at each iteration, three memory
loads will be involved, each of them forms a memory stream (then totally
3 at least, not 1) since they touch three discrete regions. Spatial locality
only exists for the first load "array[j].p1", some kind of prefetch either via
hardware or software works well for it, while for other two, prefetch would
be of not use and cpu cache could not speculate where the other two loads
will go.

So intention of mgo is to put those discontinuous and unpredictable
memory accesses together to form one new sequential memory stream
in favor of cpu cache.

And if we statically find hazard due to certain write dependence, mgo
will not be applied.

Thanks,
Feng

>> Although this example seems to be pretty artificial, the kind of data reuse 
>> in
>> nested loops is often seen in real applications, especially written by 
>> Fortran,
>> and C++ STL.
>>
>> To simplify explanation of this memory gathering optimization, pseudo code
>> on original nested loops is given as:
>>
>>   outer-loop ( ) { /* data in inner loop are also invariant in outer loop.  
>> */
>>   ...
>>   inner-loop (iter, iter_count) {  /* inner loop to apply MGO */
>>
>>   Type1 v1 = LOAD_FN1 (iter);
>>   Type2 v2 = LOAD_FN2 (v1);
>>   Type3 v3 = LOAD_FN3 (v2);
>>   ...
>>   iter = NEXT_FN (iter);
>>   }
>>
>>  ...
>>   }
>>
>> "LOAD_FNx()" is a conceptual function that translates its argument to a 
>> result,
>> and is composed of a sequence of operations in which there is only one
>> memory load. It is capable of representing simple memory dereference like
>> "iter->field", or more complicated expression like "array[iter * 2 + 
>> cst].field".
>>
>> Likewise, "NEXT_FN()" is also a conceptual function which defines how an
>> iterator is advanced. It is able to describe simple integer iterator, list 
>> iterator,
>> or even pure-function-based iterator on advanced stuff like hashset/tree.
>>
>> Then the code will be transformed to:
>>
>>   typedef struct cache_elem {
>>   bool   init;
>>   Type1  c_v1;
>>   Type2  c_v2;
>>   Type3  c_v3;
>>   } cache_elem;
>>
>>   cache_elem *cache_arr = calloc (iter_count, sizeof(cache_elem));
>>
>>   outer-loop () {
>>   ...
>>   size_t cache_idx = 0;
>>
>>   inner-loop (iter, iter_count) {
>>
>>   if (!cache_arr[cache_idx]->init) {
>>   v1 = LOAD_FN1 (iter);
>>   v2 = LOAD_FN2 (v1);
>>   v3 = LOAD_FN3 (v2);
>>
>>   cache_arr[cache_idx]->init = true;
>>   cache_arr[cache_idx]->c_v1 = v1;
>>   cache_arr[cache_idx]->c_v2 = v2;
>>   cache_arr[cache_idx]->c_v3 = v3;
>>   }
>>   else {
>>   v1 = cache_arr[cache_idx]->c_v1;
>>   v2 = cache_arr[cache_idx]->c_v2;
>>   v3

Re: [RFC] A memory gathering optimization for loop

2021-05-06 Thread Feng Xue OS via Gcc
>> To simplify explanation of this memory gathering optimization, pseudo
>> code on original nested loops is given as:
>>
>>   outer-loop ( ) { /* data in inner loop are also invariant in outer loop.  
>> */
>>   ...
>>   inner-loop (iter, iter_count) {  /* inner loop to apply MGO */
>>
>>   Type1 v1 = LOAD_FN1 (iter);
>>   Type2 v2 = LOAD_FN2 (v1);
>>   Type3 v3 = LOAD_FN3 (v2);
>>   ...
>>   iter = NEXT_FN (iter);
>>   }
>>
>>  ...
>>   }
>>
>> "LOAD_FNx()" is a conceptual function that translates its argument to a
>> result, and is composed of a sequence of operations in which there is
>> only one memory load. It is capable of representing simple memory
>> dereference like "iter->field", or more complicated expression like
>> "array[iter * 2 + cst].field".
>>
>> Likewise, "NEXT_FN()" is also a conceptual function which defines how an
>> iterator is advanced. It is able to describe simple integer iterator,
>> list iterator, or even pure-function-based iterator on advanced stuff
>> like hashset/tree.
>>
>> Then the code will be transformed to:
>>
>>   typedef struct cache_elem {
>>   bool   init;
>>   Type1  c_v1;
>>   Type2  c_v2;
>>   Type3  c_v3;
>>   } cache_elem;
>>
>>   cache_elem *cache_arr = calloc (iter_count, sizeof(cache_elem));
>>
>>   outer-loop () {
>>   ...
>>   size_t cache_idx = 0;
>>
>>   inner-loop (iter, iter_count) {
>>
>>   if (!cache_arr[cache_idx]->init) {
>>   v1 = LOAD_FN1 (iter);
>>   v2 = LOAD_FN2 (v1);
>>   v3 = LOAD_FN3 (v2);
>>
>>   cache_arr[cache_idx]->init = true;
>>   cache_arr[cache_idx]->c_v1 = v1;
>>   cache_arr[cache_idx]->c_v2 = v2;
>>   cache_arr[cache_idx]->c_v3 = v3;
>>   }
>>   else {
>>   v1 = cache_arr[cache_idx]->c_v1;
>>   v2 = cache_arr[cache_idx]->c_v2;
>>   v3 = cache_arr[cache_idx]->c_v3;
>>   }
>>   ...
>>   cache_idx++;
>>   iter = NEXT_FN (iter);
>>   }
>>   ...
>>   }
>>
>>   free (cache_arr);
> 
> Why do you need the init field?  Isn't the array initialized
> sequentially?

This is meant to support more complicate pattern, in which
actual iteration count of inner_loop depends on outer_loop,
and "iter_count" only represents an up-bound of the count.

for (i = 0; i < N; i++)
  {
for (j = 0; j < i; j++) 
   {
   Type1 v1 = LOAD_FN1 (j);
   Type2 v2 = LOAD_FN2 (v1);
   Type3 v3 = LOAD_FN3 (v2);

   ...

   condition = ...
   }

if (condition)
  break;
  }

In above case, "iter_count" is N, and is not exact count of inner_loop.
Before going through all loads, loop execution might end up with
certain condition. We should not cache all of them in one step since
some loads might be invalid if condition is not met. So we need a
"init" field to track whether a caching element got properly filled,
and allows runtime switch between "do caching" and "use caching".

> You need to keep the original loop and use it if calloc fails.  The
> transformation is not valid in general because calloc is not an
> async-signal-safe function, and GCC shouldn't introduce such calls if
> they are not present in the original source code.  For
> -fnon-call-exceptions, you need to introduce unwinding code that calls
> free.

This is an issue that we missed. The only way left is to use alloca(), but may
be not suitable for large data caching.

> Can you change this optimization so that it can use a fixed-size buffer?
> This would avoid all issues around calling calloc.  If iter_count can be
> very large, allocating that much extra memory might not be beneficial
> anyway.

And need additional code to synchronize access to this buffer.

Thanks,
Feng

Question about non-POD class type

2021-05-14 Thread Feng Xue OS via Gcc
Sorry, sent to wrong mail list.


From: Feng Xue OS
Sent: Friday, May 14, 2021 4:30 PM
To: gcc-patc...@gcc.gnu.org
Subject: Question about non-POD class type

For an instance of a non-POD class, can I always assume that any
operation on it should be type-safe, any wrong or even trick code
to violate this is UB in C++ spec? For example, here are some ways:

 union {
Type1  *p1;
Type2  *p2;
};

or

union {
Type1 t1;
Type2 t2;
};

or

void *p = Type1 *p1;
Type2 *p2 = p;
p2->xxx;

Feng


[RFC] Whole Program Devirtualization

2021-08-20 Thread Feng Xue OS via Gcc
1. Background

C++ Devirtualization is meant to determine all possible class type
identities of a virtual dynamic call, and try to transform it to direct
static call, or certain kind of statements with lower runtime cost.
Current implementation in GCC tends to adopt a somewhat conservative
scheme, originated from a safe but defensive assumption that class
hierarchy of a type (except for FINAL class or class in anonymous
namespace) is incomplete, and might be extended by external library and
module. Therefore, most of qualified virtual calls would be partially
devirtualized to a speculative form of direct call as the following,
guarded by a function address check and a failsafe branch.

if (obj->vfn_ptr == &T::FN)
obj->T::FN();
else
obj->vfn_ptr();

Here, full devirtualization is required if we want to get vcalls
concisely transformed with no guard test. In theory, when whole program
compilation is enforced, we could enable such kind of devirtualization
based on the closed-world assumption about contained class types. But
"whole program" does not ensure that class hierarchy of a type never
span to dependent C++ libraries (one is libstdc++), which would result
in incorrect devirtualization. An example is given below to demonstrate
the problem. 

// Has been pre-compiled to a library
class Base {
virtual void method() = 0;
};

class Derive_in_Library : public Base {
virtual void method()  { ... }
};

Base *get_object() 
{
return new Derive_in_Library();
}

---
  
// User source code to compile
class Base {
virtual void method() = 0;
};

class Derive : public Base {
virtual void method()  { ... }
};

void foo()
{
  Base *obj = get_object();
 
  obj->method();
}

If there is no way to inspect entire class hierarchy comprised by
relevant types in the library and user source code, whole program
analysis would improperly think of 'Derive' as sole descendant of
'Base', and triggers devirtualizing 'obj->method' to 'Derive::method',
which is definitely unexpected transformation. To target this
complication, we should figure out those types that are referenced by
regular object file and library, and exclude them from candidates of
full devirtualization. This RFC will discuss some implementable
approaches to identify whether type is qualified for full
devirtualization under whole program assumption (so also called whole-
program devirtualization, WPD for short), and new optimizations
inspired by whole program type analysis.


2. Whole-program local type

In this RFC, we merely care about polymorphic class (class contains
virtual function), since only this kind of type is meaningful to
devirtualization. A class type is identified as local in terms of
whole-program scope iff there is no instantiation of the type or its
derived types in regular object files and libraries to link with.
A whole-program local type is safe to apply WPD. Here are two
mechanisms for us to perform this check on a given type: 

2.1 Typeinfo symbol resolution by linker-plugin

Essentially, devirtualization is driven by analysis on vtable lifetime,
which begins once the vtable is attached to a new complete object or
sub-object during instantiation of a class or its derived class.

Seemingly, we could find whether a class type is referenced in object
file or library by tracking availability of its vtable symbol. But
vtable might be purged and we are still interested in its belonging
class type. Refer to the library code in above example, vtable of
'Base' is unused since it neither participate construction of 'Base'
nor 'Derive_in_Library', but we still must know if 'Base' is live
in the library to ensure correct result.

Moreover, once a class is virtual inherited, it will have multiple
vtables (including construction vtables), but the way of looking up
class via vtable symbol requires one-to-one correspondence between
them, then it does not work.

Beside vtable symbol, class instantiation also creates references to
typeinfo symbols of itself and all its parent classes. At the same time,
each class has unique typeinfo, which could act as a desirable type
marker, and be used to perform type lookup in object file and library.
Anyway, the approach is not 100% safe, specifically, when typeinfo
symbols are invisible or missed, for example, when libraries to link
against was built with -fno-weak or -fno-rtti. But at least for
libstc++, these symbols should be present for the sake of allowing
dynamic_cast in user source code.

Lto-linker-plugin will work with linker to do a whole-program symbol
resolution before LTO/WPA happens, and attach binding information to
symbols. Given a typeinfo symbol, if it is resolved as
LDPR_PREVAILING_DEF_IRONLY (only referenced from IR code, with no
references from regular objects), its corresponding class type is
deemed to be whole-pro

Re: GCC [RFC] Whole Program Devirtualization

2021-08-22 Thread Feng Xue OS via Gcc
We are not going to create a new devirtualization framework from
scratch, just hope it to be an enhancement on current speculative
devirtualization. The process does not need parse native code in
library, but only resort to existing lightweight symbol resolution
by LTO-prelinker. And C++ virtual dispatching is expected to be
translated to gimple IR from C++ source, if user attempts to
hand-craft those using embedded ASMs, it should be considered as an
UB to C++ ABI.

Compile time of whole-program analysis is not that terrible as you
think, basically, it is realistically acceptable even base code is
very large. As I know, google enables WPD in building of chrome,
while it is based on llvm.

Thanks,
Feng


From: Basile Starynkevitch 
Sent: Friday, August 20, 2021 8:36 PM
To: Feng Xue OS
Cc: basile.starynkevi...@cea.fr; gcc@gcc.gnu.org
Subject: GCC [RFC] Whole Program Devirtualization

Hello Feng Xue OS


Your project is interesting, but ambitious.

I think the major points are:

whole program analysis. Static analysis tools like https://frama-c.com/ or 
https://github.com/bstarynk/bismon/ could be relevant. Projects like 
https://www.decoder-project.eu/ could be relevant. With cross-compilation, 
things are becoming harder.

abstract interpretation might be relevant (but difficult and costly to 
implement). See wikipedia.

size of the whole program which is analyzed.  If the entire program (including 
system libraries like libc) has e.g. less than ten thousand routines and less 
than a million GIMPLE instructions in total, it make sense. But if the entire 
program is as large as the Linux kernel, or the GCC compiler, or the Firefox 
browser (all have many millions lines of source code) you probably won't be 
able to do whole program devirtualization in a few years of human work.


computed gotos or labels as values (see 
https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html for more) are making 
this difficult. But they do exist, and probably could be hidden in GNU glibc or 
libstdc++ internal code.

asm statements are difficult. They usually appear inside your libc. How would 
you deal with them?

Can you afford a month of computer time to compile a large software with your 
whole program devirtualizer? In most cases, not, but Pitrat's book Artificial 
Beings - the conscience of a conscious machine (ISBN 9781848211018) suggest 
cases where it might make sense (he is explaining a "compiler like system" 
which runs for a month of CPU time).

My recommendation would be to code first a simple GCC plugin as a proof of 
concept thing, which reject programs which could not be realistically 
devirtualized, and store somewhere (in some database perhaps) a representation 
of them otherwise. I worked 3 years full time on 
https://github.com/bstarynk/bismon/ to achieve a similar goal (and I don't 
claim to have succeeded, and I don't have any more funding). My guess is that 
some code could be useful to you (then contact me by email both at work 
basile.starynkevi...@cea.fr and at home 
bas...@starynkevitch.net )

The most important thing: limit your ambition at first. Write a document (at 
least an internal one) stating what you won't do.


Cheers

--
Basile Starynkevitch  

(only mine opinions / les opinions sont miennes uniquement)
92340 Bourg-la-Reine, France
web page: starynkevitch.net/Basile/




PING: [RFC] Whole Program Devirtualization

2021-08-31 Thread Feng Xue OS via Gcc
Honza,

  How do you think about proposal in this RFC? Thanks a lot.

Best Regards,
Feng

From: Martin Liška 
Sent: Friday, August 20, 2021 9:45 PM
To: Feng Xue OS; gcc@gcc.gnu.org
Cc: JiangNing OS; Jan Hubicka
Subject: Re: [RFC] Whole Program Devirtualization

... adding Honza to CC who spent quite some time on devirtualization pass.

Martin


PING^2: [RFC] Whole Program Devirtualization

2021-09-13 Thread Feng Xue OS via Gcc
Ping again.

Thanks,
Feng


From: Feng Xue OS 
Sent: Wednesday, September 1, 2021 9:46 AM
To: Martin Liška; Jan Hubicka; gcc@gcc.gnu.org
Cc: JiangNing OS
Subject: PING: [RFC] Whole Program Devirtualization

Honza,

  How do you think about proposal in this RFC? Thanks a lot.

Best Regards,
Feng

From: Martin Liška 
Sent: Friday, August 20, 2021 9:45 PM
To: Feng Xue OS; gcc@gcc.gnu.org
Cc: JiangNing OS; Jan Hubicka
Subject: Re: [RFC] Whole Program Devirtualization

... adding Honza to CC who spent quite some time on devirtualization pass.

Martin