rfc: another switch optimization idea
Hi, We noticed some performance gains if we are not using jump over some simple switch statements. Here is the idea: Check whether the switch statement can be expanded with conditional instructions. In that case jump tables should be avoided since some branch instructions can be eliminated in further passes (replaced by conditional execution). For example: switch (i) { case 1: sum += 1; case 2: sum += 3; case 3: sum += 5; case 4: sum += 10; } Using jump tables the following code will be generated (ARM assembly): ldrcc pc, [pc, r0, lsl #2] b .L5 .L0: .word L1 .word L2 .word L3 .word L4 .L1: add r3, #1 .L2: add r3, #4 .L3: add r3, #5 .L4: add r3, #10 .L5 Although this code has a constant complexity it can be improved by the conditional execution to avoid implicit branching: cmp r0,1 addeq r3, #1 cmp r0,2 addeq r3, #4 cmp r0,3 addeq r3, #5 cmp r0,4 addeq r3, #10 Although the assembly below requires more assembly instructions to be executed, it doesn't violate the CPU pipeline (since no branching is performed). The original version of patch for was developed by Alexey Kravets. I measured some performance improvements/regressions using spec 2000 int benchmark on Samsumg's exynos 5250. Here is the result: before: Base Base Base Peak Peak Peak Benchmarks Ref Time Run Time Ratio Ref Time Run Time Ratio 164.gzip 1400 287 487* 1400 288 485* 175.vpr 1400 376 373* 1400 374 374* 176.gcc 1100 121 912* 1100 118 933* 181.mcf 1800 242 743* 1800 251 718* 186.crafty 1000 159 628* 1000 165 608* 197.parser 1800 347 518* 1800 329 547* 252.eon 1300 960 135* 1300 960 135* 253.perlbmk 1800 214 842* 1800 212 848* 254.gap 1100 138 797* 1100 136 806* 255.vortex 1900 253 750* 1900 255 744* 256.bzip2 1500 237 632* 1500 230 653* 300.twolf X X SPECint_base2000 561 SPECint2000 563 After: 164.gzip 1400 286 490 * 1400 288 486 * 175.vpr 1400 213 656 * 1400 215 650 * 176.gcc 1100 119 923 * 1100 118 933 * 181.mcf 1800 247 730 * 1800 251 717 * 186.crafty 1000 145 688 * 1000 150 664 * 197.parser 1800 296 608 * 1800 275 654 * 252.eon X X 253.perlbmk 1800 206 872 * 1800 211 853 * 254.gap 1100 133 825 * 1100 131 838 * 255.vortex 1900 241 789 * 1900 239 797 * 256.bzip2 1500 235 638 * 1500 226 663 * 300.twolf X X The error in 252.eon was due to incorrect setup. Also "if (count > 3*PARAM_VALUE (PARAM_SWITCH_JUMP_TABLES_BB_OPS_LIMIT))" does not look correct, and probably it is better to move this code in the earlier stage just before the gimple expand and keep preference expand state (jump-tables or not) for every switch statement to avoid dealing with the RTL altogether. thanks, Dinar. switch.patch Description: Binary data
Re: rfc: another switch optimization idea
sorry, The numbers were too good, something was wrong in my setup. thanks, Dinar, >> before: >>Base Base Base Peak >> Peak Peak >>BenchmarksRef Time Run Time RatioRef Time Run Time Ratio >> >> >>164.gzip 1400287 487* 1400 288 485* >>175.vpr 1400376 373* 1400 374 374* >>176.gcc 1100121 912* 1100 118 933* >>181.mcf 1800242 743* 1800 251 718* >>186.crafty1000159 628* 1000 165 608* >>197.parser1800 347 518* 1800 329 547* >>252.eon 1300 960 135* 1300 960 135* >>253.perlbmk 1800 214 842* 1800 212 848* >>254.gap 1100 138 797* 1100 136 806* >>255.vortex1900 253 750* 1900 255 744* >>256.bzip2 1500 237 632* 1500 230 653* >>300.twolf X X >>SPECint_base2000 561 >>SPECint2000 563 >> >> After: >>164.gzip 1400 286 490* 1400 288 486* >>175.vpr 1400 213 656* 1400 215 650* >>176.gcc 1100 119 923* 1100 118 933* >>181.mcf 1800 247 730* 1800 251 717* >>186.crafty1000 145 688* 1000 150 664* >>197.parser 1800 296 608* 1800 275 654* >>252.eon X X >>253.perlbmk 1800 206 872* 1800 211 853* >>254.gap 1100 133 825* 1100 131 838* >>255.vortex1900 241 789* 1900 239 797* >>256.bzip2 1500 235 638* 1500 226 663* >>300.twolf X X >>
Re: OpenACC support in 4.9
Another interesting use-case for OpenACC and OpenMP is mixing both standard annotations for the same loop: // Compute matrix multiplication. #pragma omp parallel for default(none) shared(A,B,C,size) #pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) \ pcopyout(C[0:size][0:size]) for (int i = 0; i < size; ++i) { for (int j = 0; j < size; ++j) { float tmp = 0.; for (int k = 0; k < size; ++k) { tmp += A[i][k] * B[k][j]; } C[i][j] = tmp; } } This means that OpenACC pragmas should be parsed before OpenMP pass (in case both standards were enabled), before the OpenMP pass would change annotated GIMPLE statements irrecoverably. In my view this use-case could be handles for example in this way: We could add some temporary variable for example "expand_gimple_with_openmp" and change the example above to something like this just before the OpenMP pass: if (expand_gimple_with_openmp) { #pragma omp parallel for default(none) shared(A,B,C,size) for (int i = 0; i < size; ++i) { for (int j = 0; j < size; ++j) { float tmp = 0.; for (int k = 0; k < size; ++k) { tmp += A[i][k] * B[k][j]; } C[i][j] = tmp; } } else { #pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) \ pcopyout(C[0:size][0:size]) for (int i = 0; i < size; ++i) { for (int j = 0; j < size; ++j) { float tmp = 0.; for (int k = 0; k < size; ++k) { tmp += A[i][k] * B[k][j]; } C[i][j] = tmp; } } and later at the Graphite pass we could understand that our statement is SCOP and we could produce kernel for this statement and then we could assume that expand_gimple_with_openmp heuristic is false and the OpenMP version of the loop could be eliminated or vice versa. But we have to make sure that optimization passes would not change our OpenACC gimple that it become unparalleled. thanks, Dinar. On Fri, May 10, 2013 at 2:06 PM, Tobias Burnus wrote: > Jakub Jelinek wrote: > [Fallback generation of CPU code] >> >> If one uses the OpenMP 4.0 accelerator pragmas, then that is the required >> behavior, if the code is for whatever reason not possible to run on the >> accelerator, it should be executed on host [...] > > (I haven't checked, but is this a compile time or run-time requirement?) > > >> Otherwise, the OpenMP runtime as well as the pragmas have a way to choose >> which accelerator you want to run something on, as device id (integer), so >> the OpenMP runtime library should maintain the list of supported >> accelerators (say if you have two Intel MIC cards, and two AMD GPGPU >> devices), and probably we'll need a compiler switch to say for which kinds >> of accelerators we want to generate code for, plus the runtime could have >> dlopened plugins for each of the accelerator kinds. > > > At least two OpenACC implementations I know fail hard when the GPU is not > available (nonexisting or if the /dev/... has not the right permissions). > And three of them fail at compile time with an error message if an > expression within a device section is not possible (e.g. calling some > nondevice/noninlinable function). > > While it is convenient to have CPU fallback, it would be nice to know > whether some code actually uses the accelerator - both at compile time and > at run time. Otherwise, one thinks the the GPU is used - without realizing > that it isn't because, e.g. the device permissions are wrong - or one forgot > to declare a certain function as target function. > > Besides having a flag which tells the compiler for which accelerator the > code should be generated, also additional flags should be handled, e.g. for > different versions of the accelerator. For instance, one accelerator model > of the same series might support double-precision variables while another > might not. - I assume that falling back to the CPU if the accelerator > doesn't support a certain feature won't work and one will get an error in > this case. > > > Is there actually the need to handle multiple accelerators simultaneously? > My impression is that both OpenACC and OpenMP 4 assume that there is only > one kind of accelerator available besides the host. If I missed some fine > print or something else requires that there are multiple different > accelerators, it will get more complicated - especially for those code > section where the user didn't explicitly specify which one should be used. > > > Finally, one should think about debugging. It is not really clear (to me) > how to handle this best, but as the compiler generates quite some additional > code (e.g. for copying the data around) and as printf debugging doesn't work > on GPUs, it is not that easy. I wonder whether there should be an optional > library like libgomp_debug which adds additional sanity checks (e.g. related > to copying data to/from the GPU) and which allows to print diagnostic > output, when one sets an environment variables. > > Tobias
bad reassociation with signed integer code after PR45232.
Hi, I noticed some minor regression with singed integer operations in "the proprietary" code since http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45232. Of course, I could use "-fwrapv" flag but my question is: why we could not add overflow checking in for example int_const_binop() at fold-const.c for signed integers restore original behavior of the reassociation pass? thanks, Dinar.
RFC: IPACP function cloning without LTO
Hi, The current implementation of IPACP doesn't allowed to clone function if caller(s) to that function is located in another object. Of course, no such problems if we could utilized LTO. And it is very interesting to have such functionality of compiler even without LTO. It could be changed, if for example we could call to the cloned instance of that function from the original instance of function in the function prolog: Here is what I mean: int func(int a, .) { if (a==some_constant) func.constprop.0(); thanks, Dinar.
Re: RFC: IPACP function cloning without LTO
On Wed, Mar 6, 2013 at 4:43 PM, Martin Jambor wrote: > Hi, > > On Wed, Mar 06, 2013 at 04:00:52PM +0400, Dinar Temirbulatov wrote: >> Hi, >> The current implementation of IPACP doesn't allowed to clone function >> if caller(s) to that function is located in another object. > > That is not exactly true. With -fipa-cp-clone (default at -O3), > IPA-CP is happy to clone a function that is callable from outside of > the current compilation unit. Of course, only calls from within the > CU are redirected without LTO. yes, but that still would require manually preparing of CU for selected number of objects. > And code size may grow significantly, > which is why IPA-CP does this only if it deems the estimated > cost/benefit ratio to still be quite good. > >> Of course, >> no such problems if we could utilized LTO. And it is very interesting >> to have such functionality of compiler even without LTO. It could be >> changed, if for example we could call to the cloned instance of that >> function from the original instance of function in the function >> prolog: >> Here is what I mean: >> >> int func(int a, .) >> { >> if (a==some_constant) >> func.constprop.0(); >> >> thanks, Dinar. > > well, you could just as well put the quick version right into the > original function (and execute the original in the else branch). If > it is small and you did this in an early pass, IPA-SPLIT might even > help inliner to inline it into known callers. yes, the function cloning is just one example here. > > The tough part, however, is determining when this is such a good idea. > Do you have any particular situation in mind? I don't have. But for function cloning for example good_cloning_opportunity_p() is a good point to start. > > Thanks, > > Martin
Move STV(scalars_to_vector) RTL pass from i386 to target independent
Hi, I have observed that STV2 pass added ~20% on CPU2006 456.hmmer with mostly by transforming V4SI operations. Looking at the pass itself, it looks like it might be transformed into RTL architecture-independent, and the pass deals only not wide integer operations. I think it might be useful on other targets as well? Thanks, Dinar.