rfc: another switch optimization idea

2013-03-25 Thread Dinar Temirbulatov
Hi,
We noticed some performance gains if we are not using jump over some
simple switch statements. Here is the idea: Check whether the switch
statement can be expanded with conditional instructions. In that case
jump tables should be avoided since some branch instructions can be
eliminated in further passes (replaced by conditional execution).

   For example:
   switch (i)
   {
     case 1: sum += 1;
     case 2: sum += 3;
     case 3: sum += 5;
     case 4: sum += 10;
   }

Using jump tables the following code will be generated (ARM assembly):

   ldrcc pc, [pc, r0, lsl #2]
   b .L5
   .L0:
        .word L1
        .word L2
        .word L3
        .word L4

   .L1:
        add r3, #1
   .L2:
        add r3, #4
   .L3:
        add r3, #5
   .L4:
        add r3, #10
   .L5

Although this code has a constant complexity it can be improved by the
conditional execution to avoid implicit branching:

   cmp r0,1
   addeq r3, #1
   cmp r0,2
   addeq r3, #4
   cmp r0,3
   addeq r3, #5
   cmp r0,4
   addeq r3, #10

Although the assembly below requires more assembly instructions to be
executed, it doesn't violate the CPU pipeline (since no branching is
performed).

The original version of patch for was developed by Alexey Kravets. I
measured some performance improvements/regressions using spec 2000 int
benchmark on Samsumg's exynos 5250. Here is the result:

before:
                           Base      Base      Base      Peak
Peak      Peak
   Benchmarks    Ref Time  Run Time   Ratio    Ref Time  Run Time   Ratio
                        
  
   164.gzip          1400        287       487*     1400       288       485*
   175.vpr           1400        376       373*     1400       374       374*
   176.gcc           1100        121       912*     1100       118       933*
   181.mcf           1800        242       743*     1800       251       718*
   186.crafty        1000        159       628*     1000       165       608*
   197.parser        1800       347       518*     1800       329       547*
   252.eon           1300       960       135*     1300       960       135*
   253.perlbmk       1800      214       842*     1800       212       848*
   254.gap           1100       138       797*     1100       136       806*
   255.vortex        1900       253       750*     1900       255       744*
   256.bzip2         1500       237       632*     1500       230       653*
   300.twolf                                 X                             X
   SPECint_base2000                       561
   SPECint2000                                                          563

After:
   164.gzip          1400   286       490    *     1400   288       486    *
   175.vpr           1400   213       656    *     1400   215       650    *
   176.gcc           1100   119       923    *     1100   118       933    *
   181.mcf          1800   247       730    *     1800   251       717    *
   186.crafty        1000   145       688    *     1000   150       664    *
   197.parser       1800   296       608    *     1800   275       654    *
   252.eon                                   X                             X
   253.perlbmk     1800   206       872    *     1800   211       853    *
   254.gap           1100   133       825    *     1100   131       838    *
   255.vortex        1900   241       789    *     1900   239       797    *
   256.bzip2         1500   235       638    *     1500   226       663    *
   300.twolf                                 X                             X

The error in 252.eon was due to incorrect setup. Also "if (count >
3*PARAM_VALUE (PARAM_SWITCH_JUMP_TABLES_BB_OPS_LIMIT))" does not look
correct, and probably it is better to move this code in the earlier
stage just before the gimple expand and keep preference expand state
(jump-tables or not) for every switch statement to avoid dealing with
the RTL altogether.

                     thanks, Dinar.


switch.patch
Description: Binary data


Re: rfc: another switch optimization idea

2013-03-28 Thread Dinar Temirbulatov
sorry, The numbers were too good, something was wrong in my setup.
 thanks, Dinar,
>> before:
>>Base  Base  Base  Peak
>> Peak  Peak
>>BenchmarksRef Time  Run Time   RatioRef Time  Run Time   Ratio
>>         
>>   
>>164.gzip  1400287   487* 1400   288   485*
>>175.vpr   1400376   373* 1400   374   374*
>>176.gcc   1100121   912* 1100   118   933*
>>181.mcf   1800242   743* 1800   251   718*
>>186.crafty1000159   628* 1000   165   608*
>>197.parser1800   347   518* 1800   329   547*
>>252.eon   1300   960   135* 1300   960   135*
>>253.perlbmk   1800  214   842* 1800   212   848*
>>254.gap   1100   138   797* 1100   136   806*
>>255.vortex1900   253   750* 1900   255   744*
>>256.bzip2 1500   237   632* 1500   230   653*
>>300.twolf X X
>>SPECint_base2000   561
>>SPECint2000  563
>>
>> After:
>>164.gzip  1400   286   490* 1400   288   486*
>>175.vpr   1400   213   656* 1400   215   650*
>>176.gcc   1100   119   923* 1100   118   933*
>>181.mcf  1800   247   730* 1800   251   717*
>>186.crafty1000   145   688* 1000   150   664*
>>197.parser   1800   296   608* 1800   275   654*
>>252.eon   X X
>>253.perlbmk 1800   206   872* 1800   211   853*
>>254.gap   1100   133   825* 1100   131   838*
>>255.vortex1900   241   789* 1900   239   797*
>>256.bzip2 1500   235   638* 1500   226   663*
>>300.twolf X X
>>


Re: OpenACC support in 4.9

2013-05-11 Thread Dinar Temirbulatov
Another interesting use-case for OpenACC and OpenMP is mixing both
standard annotations for the same loop:
 // Compute matrix multiplication.
#pragma omp parallel for default(none) shared(A,B,C,size)
#pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) \
  pcopyout(C[0:size][0:size])

  for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
  float tmp = 0.;
  for (int k = 0; k < size; ++k) {
tmp += A[i][k] * B[k][j];
  }
  C[i][j] = tmp;
}
  }
This means that OpenACC pragmas should be parsed before OpenMP pass
(in case both standards were enabled), before the OpenMP pass would
change annotated GIMPLE statements irrecoverably. In my view this
use-case could be handles for example in this way:
We could add some temporary variable for example
"expand_gimple_with_openmp" and change the example above to something
like this just before the OpenMP pass:


if (expand_gimple_with_openmp) {
#pragma omp parallel for default(none) shared(A,B,C,size)
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
  float tmp = 0.;
  for (int k = 0; k < size; ++k) {
tmp += A[i][k] * B[k][j];
  }
  C[i][j] = tmp;
}
  }
else {
#pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) \
  pcopyout(C[0:size][0:size])

  for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
  float tmp = 0.;
  for (int k = 0; k < size; ++k) {
tmp += A[i][k] * B[k][j];
  }
  C[i][j] = tmp;
}
}
and later at the Graphite pass we could understand that our statement
is SCOP and we could produce kernel for this statement and then we
could assume that expand_gimple_with_openmp heuristic is false and the
OpenMP version of the loop could be eliminated or vice versa. But we
have to make sure that optimization passes would not change our
OpenACC gimple that it become unparalleled.
   thanks, Dinar.

On Fri, May 10, 2013 at 2:06 PM, Tobias Burnus  wrote:
> Jakub Jelinek wrote:
> [Fallback generation of CPU code]
>>
>> If one uses the OpenMP 4.0 accelerator pragmas, then that is the required
>> behavior, if the code is for whatever reason not possible to run on the
>> accelerator, it should be executed on host [...]
>
> (I haven't checked, but is this a compile time or run-time requirement?)
>
>
>> Otherwise, the OpenMP runtime as well as the pragmas have a way to choose
>> which accelerator you want to run something on, as device id (integer), so
>> the OpenMP runtime library should maintain the list of supported
>> accelerators (say if you have two Intel MIC cards, and two AMD GPGPU
>> devices), and probably we'll need a compiler switch to say for which kinds
>> of accelerators we want to generate code for, plus the runtime could have
>> dlopened plugins for each of the accelerator kinds.
>
>
> At least two OpenACC implementations I know fail hard when the GPU is not
> available (nonexisting or if the /dev/... has not the right permissions).
> And three of them fail at compile time with an error message if an
> expression within a device section is not possible (e.g. calling some
> nondevice/noninlinable function).
>
> While it is convenient to have CPU fallback, it would be nice to know
> whether some code actually uses the accelerator - both at compile time and
> at run time. Otherwise, one thinks the the GPU is used - without realizing
> that it isn't because, e.g. the device permissions are wrong - or one forgot
> to declare a certain function as target function.
>
> Besides having a flag which tells the compiler for which accelerator the
> code should be generated, also additional flags should be handled, e.g. for
> different versions of the accelerator. For instance, one accelerator model
> of the same series might support double-precision variables while another
> might not. - I assume that falling back to the CPU if the accelerator
> doesn't support a certain feature won't work and one will get an error in
> this case.
>
>
> Is there actually the need to handle multiple accelerators simultaneously?
> My impression is that both OpenACC and OpenMP 4 assume that there is only
> one kind of accelerator available besides the host. If I missed some fine
> print or something else  requires that there are multiple different
> accelerators, it will get more complicated - especially for those code
> section where the user didn't explicitly specify which one should be used.
>
>
> Finally, one should think about debugging. It is not really clear (to me)
> how to handle this best, but as the compiler generates quite some additional
> code (e.g. for copying the data around) and as printf debugging doesn't work
> on GPUs, it is not that easy. I wonder whether there should be an optional
> library like libgomp_debug which adds additional sanity checks (e.g. related
> to copying data to/from the GPU) and which allows to print diagnostic
> output, when one sets an environment variables.
>
> Tobias


bad reassociation with signed integer code after PR45232.

2012-09-24 Thread Dinar Temirbulatov
Hi,
I noticed some minor regression with singed integer operations in "the
proprietary" code since
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45232. Of course, I could
use "-fwrapv"  flag but my question is: why we could not add overflow
checking in for example int_const_binop() at fold-const.c for signed
integers restore original behavior of the reassociation pass?
  thanks, Dinar.


RFC: IPACP function cloning without LTO

2013-03-06 Thread Dinar Temirbulatov
Hi,
The current implementation of IPACP doesn't allowed to clone function
if caller(s) to that function is located in another object. Of course,
no such problems if we could utilized LTO. And it is very interesting
to have such functionality of compiler even without LTO. It could be
changed, if for example we could call to the cloned instance of that
function from the original instance of function in the function
prolog:
Here is what I mean:

int func(int a, .)
{
if (a==some_constant)
 func.constprop.0();

 thanks, Dinar.


Re: RFC: IPACP function cloning without LTO

2013-03-07 Thread Dinar Temirbulatov
On Wed, Mar 6, 2013 at 4:43 PM, Martin Jambor  wrote:
> Hi,
>
> On Wed, Mar 06, 2013 at 04:00:52PM +0400, Dinar Temirbulatov wrote:
>> Hi,
>> The current implementation of IPACP doesn't allowed to clone function
>> if caller(s) to that function is located in another object.
>
> That is not exactly true.  With -fipa-cp-clone (default at -O3),
> IPA-CP is happy to clone a function that is callable from outside of
> the current compilation unit.  Of course, only calls from within the
> CU are redirected without LTO.
yes, but that still would require manually preparing of CU for
selected number of objects.

> And code size may grow significantly,
> which is why IPA-CP does this only if it deems the estimated
> cost/benefit ratio to still be quite good.
>
>> Of course,
>> no such problems if we could utilized LTO. And it is very interesting
>> to have such functionality of compiler even without LTO. It could be
>> changed, if for example we could call to the cloned instance of that
>> function from the original instance of function in the function
>> prolog:
>> Here is what I mean:
>>
>> int func(int a, .)
>> {
>> if (a==some_constant)
>>  func.constprop.0();
>>
>>  thanks, Dinar.
>
> well, you could just as well put the quick version right into the
> original function (and execute the original in the else branch).  If
> it is small and you did this in an early pass, IPA-SPLIT might even
> help inliner to inline it into known callers.
yes, the function cloning is just one example here.

>
> The tough part, however, is determining when this is such a good idea.
> Do you have any particular situation in mind?
I don't have. But for function cloning for example
good_cloning_opportunity_p() is a good point to start.

>
> Thanks,
>
> Martin


Move STV(scalars_to_vector) RTL pass from i386 to target independent

2020-12-09 Thread Dinar Temirbulatov via Gcc
Hi,
I have observed that STV2 pass added ~20% on CPU2006 456.hmmer with mostly
by transforming V4SI operations. Looking at the pass itself, it looks like
it might be transformed into RTL architecture-independent, and the pass
deals only not wide integer operations. I think it might be useful on other
targets as well?
   Thanks, Dinar.