Re: prevent optimisation of variables on SSA
Hi, I will try to explain in more detail: I am trying to implement support for transactions in GCC. In particular for a Software Transactional Memory library called tinySTM. Since transactions have the nice property that they can abort and execute again, we need some kind of checkpointing at the beginning of a transaction. The checkpointing is done two fold: on one hand setjmp is used to capture the register state on the other hand temporary variables are introduced by GCC to capture the value of live-in variables to the transaction and play the values back to the original variables in case of an abort. In terms of IR code a begin of a transaction is translated into this: txn_handle.12_4 = __builtin_stm_new (); jmp_buf.13_5 = __builtin_stm_get_env (txn_handle.12_4); ssj_value.14_6 = _setjmp (jmp_buf.13_5); if (ssj_value.14_6 == 0) goto (); else goto (); # SUCC: 4 (false) 3 (true) # BLOCK 3 # PRED: 2 (true) :; __builtin_printf ("SAVE_BB\n"); txn_save_m.19_15 = m_1; txn_save_a.20_20 = a_2; txn_save_b.21_25 = b_3; fake_var.18_13 = __builtin_stm_dummy ("RECOVER_BB\n"); if (fake_var.18_13) goto (); else goto ; # SUCC: 4 (true) 5 (false) # BLOCK 4 # PRED: 2 (false) 3 (true) # txn_save_b.21_26 = PHI # txn_save_a.20_21 = PHI # txn_save_m.19_16 = PHI :; __builtin_printf ("RECOVER_BB\n"); m_17 = txn_save_m.19_16; a_22 = txn_save_a.20_21; b_27 = txn_save_b.21_26; # SUCC: 5 (fallthru) # BLOCK 5 # PRED: 4 (fallthru) 3 (false) # m_18 = PHI __builtin_stm_start (txn_handle.12_4, jmp_buf.13_5, &0); the first part is acquiring a handle for the transactions from the STM, then a jump buffer is created by the STM and the setjmp is executed. Since the longjmp executed by the STM (in case of an abort) returns to this location, we simply evaluate the return value of the setjmp call and know what to execute. In case of returning from the call to setjmp we want to capture the values of the live-in variables, which are a,b and m in this example. In case we returned from a longjmp we want to restore the values from the temporary variables (txn_save_a/b/m) to the original variables (a/b/m). My problem with the SSA optimisations is that (after the SSA final_cleanup pass) in this example the initialisation of the variables a and b is moved into the SAVE_BB and RESTORE_BB basic blocks - both look like this: a = 5; b = 10; and the txn_save_* variables are optimised "away". Is there any way to prevent this kind of optimisations on SSA form for a particular group of variables? Here I would like to see the txn_save_* variables behave as a container for the values of the real variables and be sure that these variables are not touched by any optimisations. Thank you very much for your ideas! Regards, Martin
Re: prevent optimisation of variables on SSA
On Mon, Aug 18, 2008 at 1:15 PM, Martin Schindewolf <[EMAIL PROTECTED]> wrote: > Hi, > I will try to explain in more detail: > I am trying to implement support for transactions in GCC. In particular for > a Software Transactional Memory library called tinySTM. Since transactions > have the nice property that they can abort and execute again, we need some > kind of checkpointing at the beginning of a transaction. The checkpointing > is done two fold: on one hand setjmp is used to capture the register state > on the other hand temporary variables are introduced by GCC to capture the > value of live-in variables to the transaction and play the values back to > the original variables in case of an abort. In terms of IR code a begin of a > transaction is translated into this: > txn_handle.12_4 = __builtin_stm_new (); > jmp_buf.13_5 = __builtin_stm_get_env (txn_handle.12_4); > ssj_value.14_6 = _setjmp (jmp_buf.13_5); > if (ssj_value.14_6 == 0) > goto (); > else > goto (); > # SUCC: 4 (false) 3 (true) > > # BLOCK 3 > # PRED: 2 (true) > :; > __builtin_printf ("SAVE_BB\n"); > txn_save_m.19_15 = m_1; > txn_save_a.20_20 = a_2; > txn_save_b.21_25 = b_3; > fake_var.18_13 = __builtin_stm_dummy ("RECOVER_BB\n"); > if (fake_var.18_13) > goto (); > else > goto ; > # SUCC: 4 (true) 5 (false) > > # BLOCK 4 > # PRED: 2 (false) 3 (true) > # txn_save_b.21_26 = PHI > # txn_save_a.20_21 = PHI > # txn_save_m.19_16 = PHI > :; > __builtin_printf ("RECOVER_BB\n"); > m_17 = txn_save_m.19_16; > a_22 = txn_save_a.20_21; > b_27 = txn_save_b.21_26; > # SUCC: 5 (fallthru) > > # BLOCK 5 > # PRED: 4 (fallthru) 3 (false) > # m_18 = PHI > __builtin_stm_start (txn_handle.12_4, jmp_buf.13_5, &0); > > > the first part is acquiring a handle for the transactions from the STM, then > a jump buffer is created by the STM and the setjmp is executed. Since the > longjmp executed by the STM (in case of an abort) returns to this location, > we simply evaluate the return value of the setjmp call and know what to > execute. In case of returning from the call to setjmp we want to capture the > values of the live-in variables, which are a,b and m in this example. In > case we returned from a longjmp we want to restore the values from the > temporary variables (txn_save_a/b/m) to the original variables (a/b/m). > My problem with the SSA optimisations is that (after the SSA final_cleanup > pass) in this example the initialisation of the variables a and b is moved > into the SAVE_BB and RESTORE_BB basic blocks - both look like this: > a = 5; > b = 10; > and the txn_save_* variables are optimised "away". Is there any way to > prevent this kind of optimisations on SSA form for a particular group of > variables? Here I would like to see the txn_save_* variables behave as a > container for the values of the real variables and be sure that these > variables are not touched by any optimisations. Well, if you are inserting this abnormal control flow after going into SSA then you need to make sure all SSA variables affected are marked as SSA_NAME_OCCURS_IN_ABNORMAL_PHI and make sure that SSA names from a single base variable not have overlapping life-ranges. Which means what you are doing is not really going to be easy in SSA form and you should rather try to do it before going into SSA. And I would suggest you use the exception handling machinery of GCC to do it and not insert setjmp/longjmp calls yourself or try to track the complete program state. Richard. > Thank you very much for your ideas! > > Regards, > Martin > > >
Creating own Backend: Segmentation fault in mark_jump_label_1
Hello everybody, I create a gcc-backend. I have already created the *.md, *.h and *.c files and I have compiled the gcc which includes backend. But when I try to compile a simple c-File with my gcc I get a Segmentation fault. I tried to debug it but I don't get the point. The Error occures in mark_jump_label_1. It is called with a null pointer as the rtx x. gdb told me: Program received signal SIGSEGV, Segmentation fault. 0x081c5d48 in mark_jump_label_1 (x=0x0, insn=0xb7b77118, in_mem=0 '\0', is_target=0 '\0') at ../.././gcc/jump.c:987 987 RTX_CODE code = GET_CODE (x); (gdb) where #0 0x081c5d48 in mark_jump_label_1 (x=0x0, insn=0xb7b77118, in_mem=0 '\0', is_target=0 '\0') at ../.././gcc/jump.c:987 #1 0x081c60e0 in mark_jump_label_1 (x=0xb7b70e28, insn=0xb7b77118, in_mem=0 '\0', is_target=0 '\0') at ../.././gcc/jump.c:1108 #2 0x081c5d2f in mark_jump_label (x=0xb7b70e28, insn=0xb7b77118, in_mem=0) at ../.././gcc/jump.c:974 #3 0x081c4af4 in mark_all_labels (f=0xb7bdb120) at ../.././gcc/jump.c:194 #4 0x081c48f7 in rebuild_jump_labels (f=0xb7bdb120) at ../.././gcc/jump.c:86 #5 0x08418fdf in tree_expand_cfg () at ../.././gcc/cfgexpand.c:1935 #6 0x08203198 in execute_one_pass (pass=0x856a640) at ../.././gcc/passes.c:1122 #7 0x082032e4 in execute_pass_list (pass=0x856a640) at ../.././gcc/passes.c:1175 #8 0x082ca0fd in tree_rest_of_compilation (fndecl=0xb7bd7a10) at ../.././gcc/tree-optimize.c:404 #9 0x083db7bd in cgraph_expand_function (node=0xb7b73280) at ../.././gcc/cgraphunit.c:1157 #10 0x083da461 in cgraph_assemble_pending_functions () at ../.././gcc/cgraphunit.c:523 #11 0x083da72d in cgraph_finalize_function (decl=0xb7bd7a10, nested=0 '\0') at ../.././gcc/cgraphunit.c:641 #12 0x0805a470 in finish_function () at ../.././gcc/c-decl.c:6803 #13 0x080a7f9b in c_parser_declaration_or_fndef (parser=0xb7be2b28, fndef_ok=1 '\001', empty_ok=1 '\001', nested=0 '\0', start_attr_ok=1 '\001') at ../.././gcc/c-parser.c:1424 #14 0x080a7920 in c_parser_external_declaration (parser=0xb7be2b28) at ../.././gcc/c-parser.c:1179 #15 0x080a762c in c_parser_translation_unit (parser=0xb7be2b28) at ../.././gcc/c-parser.c:1081 #16 0x080b44c1 in c_parse_file () at ../.././gcc/c-parser.c:8011 #17 0x0809b675 in c_common_parse_file (set_yydebug=0) at ../.././gcc/c-opts.c:1291 #18 0x082861fa in compile_file () at ../.././gcc/toplev.c:1042 #19 0x08287bc4 in do_compile () at ../.././gcc/toplev.c:2247 #20 0x08287c26 in toplev_main (argc=2, argv=0xbfa608c4) at ../.././gcc/toplev.c:2279 #21 0x080bba42 in main (argc=Cannot access memory at address 0x0 ) at ../.././gcc/main.c:35 Which Target Makro or which machine description could create such an error? Thank you for your help! Balthasar signature.asc Description: OpenPGP digital signature
GCC 4.3.2 Status Report (2008-08-18)
Status == Although the number of open regression PRs is increasing, probably as more people start to use 4.3 branch, there are currently no open P1 PRs and it does not look like there are any serious PRs open for regressions relative to 4.3.0 or 4.3.1. Thus, I propose to make 4.3.2-rc1 tomorrow, with a release or -rc2 following about a week later. The process of fixing regressions can continue for subsequent 4.3 releases, which I expect at approximately two-month intervals until 4.4.0 is released. Starting 17:00 UTC tomorrow, all changes to 4.3 branch will need to be approved by a release manager until the 4.3.2 release is out. (Fixes that might be appropriate while in this state would include those for wrong-code or rejects-valid regressions relative to 4.3.0 or 4.3.1, or fixes for such regressions relative to previous releases that can be seen to be particularly safe.) Quality Data Priority # Change from Last Report --- --- P10 - 3 P2 116 + 3 P37 + 7 --- --- Total 123 + 7 Previous Report === http://gcc.gnu.org/ml/gcc/2008-08/msg00132.html I will send the next report for the 4.3 branch at the time of the 4.3.2 release or -rc2. -- Joseph S. Myers [EMAIL PROTECTED]
vectorizer question
The attached testcase yields (on a core2 duo, gcc trunk): gfortran -O3 -ftree-vectorize -ffast-math -march=native test.f90 time ./a.out real0m3.414s ifort -xT -O3 test.f90 time ./a.out real0m1.556s The assembly contains: ifort gfortran mulpd 140 0 mulsd 0280 so the reason seems that ifort vectorizes the following code (full testcase attached): SUBROUTINE collocate_core_6(res,coef_xyz,pol_x,pol_y,pol_z,cmax,kg,jg) IMPLICIT NONE INTEGER, PARAMETER :: wp = SELECTED_REAL_KIND ( 14, 200 ) integer, PARAMETER :: lp=6 real(wp), INTENT(OUT):: res integer, INTENT(IN) :: cmax,kg,jg real(wp), INTENT(IN):: pol_x(0:lp,-cmax:cmax) real(wp), INTENT(IN):: pol_y(1:2,0:lp,-cmax:0) real(wp), INTENT(IN):: pol_z(1:2,0:lp,-cmax:0) real(wp), INTENT(IN):: coef_xyz(((lp+1)*(lp+2)*(lp+3))/6) real(wp) :: coef_xy(2,(lp+1)*(lp+2)/2) real(wp) :: coef_x(4,0:lp) [...] coef_x(1:2,4)=coef_x(1:2,4)+coef_xy(1:2,12)*pol_y(1,1,jg) coef_x(3:4,4)=coef_x(3:4,4)+coef_xy(1:2,12)*pol_y(2,1,jg) coef_x(1:2,5)=coef_x(1:2,5)+coef_xy(1:2,13)*pol_y(1,1,jg) coef_x(3:4,5)=coef_x(3:4,5)+coef_xy(1:2,13)*pol_y(2,1,jg) coef_x(1:2,0)=coef_x(1:2,0)+coef_xy(1:2,14)*pol_y(1,2,jg) coef_x(3:4,0)=coef_x(3:4,0)+coef_xy(1:2,14)*pol_y(2,2,jg) coef_x(1:2,1)=coef_x(1:2,1)+coef_xy(1:2,15)*pol_y(1,2,jg) coef_x(3:4,1)=coef_x(3:4,1)+coef_xy(1:2,15)*pol_y(2,2,jg) coef_x(1:2,2)=coef_x(1:2,2)+coef_xy(1:2,16)*pol_y(1,2,jg) coef_x(3:4,2)=coef_x(3:4,2)+coef_xy(1:2,16)*pol_y(2,2,jg) coef_x(1:2,3)=coef_x(1:2,3)+coef_xy(1:2,17)*pol_y(1,2,jg) coef_x(3:4,3)=coef_x(3:4,3)+coef_xy(1:2,17)*pol_y(2,2,jg) coef_x(1:2,4)=coef_x(1:2,4)+coef_xy(1:2,18)*pol_y(1,2,jg) coef_x(3:4,4)=coef_x(3:4,4)+coef_xy(1:2,18)*pol_y(2,2,jg) coef_x(1:2,0)=coef_x(1:2,0)+coef_xy(1:2,19)*pol_y(1,3,jg) coef_x(3:4,0)=coef_x(3:4,0)+coef_xy(1:2,19)*pol_y(2,3,jg) [...] either it is able to interpret the short vectors as such, or it realizes that these very short implicit loops are nevertheless favourable for vectorization. Is there a trick to get gcc vectorize these loops, or is there some technology missing for this ? Should I file a PR for this (this is somewhat similar to PR31079 and PR31021)? Thanks in advance, Joost SUBROUTINE collocate_core_6(res,coef_xyz,pol_x,pol_y,pol_z,cmax,kg,jg) IMPLICIT NONE INTEGER, PARAMETER :: wp = SELECTED_REAL_KIND ( 14, 200 ) integer, PARAMETER :: lp=6 real(wp), INTENT(OUT):: res integer, INTENT(IN) :: cmax,kg,jg real(wp), INTENT(IN):: pol_x(0:lp,-cmax:cmax) real(wp), INTENT(IN):: pol_y(1:2,0:lp,-cmax:0) real(wp), INTENT(IN):: pol_z(1:2,0:lp,-cmax:0) real(wp), INTENT(IN):: coef_xyz(((lp+1)*(lp+2)*(lp+3))/6) real(wp) :: coef_xy(2,(lp+1)*(lp+2)/2) real(wp) :: coef_x(4,0:lp) coef_xy=0.0_wp coef_xy(:,1)=coef_xy(:,1)+coef_xyz(1)*pol_z(:,0,kg) coef_xy(:,2)=coef_xy(:,2)+coef_xyz(2)*pol_z(:,0,kg) coef_xy(:,3)=coef_xy(:,3)+coef_xyz(3)*pol_z(:,0,kg) coef_xy(:,4)=coef_xy(:,4)+coef_xyz(4)*pol_z(:,0,kg) coef_xy(:,5)=coef_xy(:,5)+coef_xyz(5)*pol_z(:,0,kg) coef_xy(:,6)=coef_xy(:,6)+coef_xyz(6)*pol_z(:,0,kg) coef_xy(:,7)=coef_xy(:,7)+coef_xyz(7)*pol_z(:,0,kg) coef_xy(:,8)=coef_xy(:,8)+coef_xyz(8)*pol_z(:,0,kg) coef_xy(:,9)=coef_xy(:,9)+coef_xyz(9)*pol_z(:,0,kg) coef_xy(:,10)=coef_xy(:,10)+coef_xyz(10)*pol_z(:,0,kg) coef_xy(:,11)=coef_xy(:,11)+coef_xyz(11)*pol_z(:,0,kg) coef_xy(:,12)=coef_xy(:,12)+coef_xyz(12)*pol_z(:,0,kg) coef_xy(:,13)=coef_xy(:,13)+coef_xyz(13)*pol_z(:,0,kg) coef_xy(:,14)=coef_xy(:,14)+coef_xyz(14)*pol_z(:,0,kg) coef_xy(:,15)=coef_xy(:,15)+coef_xyz(15)*pol_z(:,0,kg) coef_xy(:,16)=coef_xy(:,16)+coef_xyz(16)*pol_z(:,0,kg) coef_xy(:,17)=coef_xy(:,17)+coef_xyz(17)*pol_z(:,0,kg) coef_xy(:,18)=coef_xy(:,18)+coef_xyz(18)*pol_z(:,0,kg) coef_xy(:,19)=coef_xy(:,19)+coef_xyz(19)*pol_z(:,0,kg) coef_xy(:,20)=coef_xy(:,20)+coef_xyz(20)*pol_z(:,0,kg) coef_xy(:,21)=coef_xy(:,21)+coef_xyz(21)*pol_z(:,0,kg) coef_xy(:,22)=coef_xy(:,22)+coef_xyz(22)*pol_z(:,0,kg) coef_xy(:,23)=coef_xy(:,23)+coef_xyz(23)*pol_z(:,0,kg) coef_xy(:,24)=coef_xy(:,24)+coef_xyz(24)*pol_z(:,0,kg) coef_xy(:,25)=coef_xy(:,25)+coef_xyz(25)*pol_z(:,0,kg) coef_xy(:,26)=coef_xy(:,26)+coef_xyz(26)*pol_z(:,0,kg) coef_xy(:,27)=coef_xy(:,27)+coef_xyz(27)*pol_z(:,0,kg) coef_xy(:,28)=coef_xy(:,28)+coef_xyz(28)*pol_z(:,0,kg) coef_xy(:,1)=coef_xy(:,1)+coef_xyz(29)*pol_z(:,1,kg) coef_xy(:,2)=coef_xy(:,2)+coef_xyz(30)*pol_z(:,1,kg) coef_xy(:,3)=coef_xy(:,3)+coef_xyz(31)*pol_z(:,1,kg) coef_xy(:,4)=coef_xy(:,4)+coef_xyz(32)*pol_z(:,1,kg) coef_xy(:,5)=coef_xy(:,5)+coef_xyz(33)*pol_z(:,1,kg) coef_xy(:,6)=coef_xy(:,6)+coef_xyz(34)*pol_z(:,1,kg) coef_xy(:,8)=coef_xy(:,8)+coef_xyz(35)*pol_z(:,1,kg) coef_xy(:,9)=coef_xy(:,9)+coef_xyz(36)*pol_z(:,1,kg)
Re: vectorizer question
2008/8/18 VandeVondele Joost <[EMAIL PROTECTED]>: > > The attached testcase yields (on a core2 duo, gcc trunk): > >> gfortran -O3 -ftree-vectorize -ffast-math -march=native test.f90 >> time ./a.out > > real0m3.414s > >> ifort -xT -O3 test.f90 >> time ./a.out > > real0m1.556s > > The assembly contains: > >ifort gfortran > mulpd 140 0 > mulsd 0280 > > so the reason seems that ifort vectorizes the following code (full testcase > attached): > > SUBROUTINE collocate_core_6(res,coef_xyz,pol_x,pol_y,pol_z,cmax,kg,jg) > > IMPLICIT NONE > INTEGER, PARAMETER :: wp = SELECTED_REAL_KIND ( 14, 200 ) > integer, PARAMETER :: lp=6 >real(wp), INTENT(OUT):: res >integer, INTENT(IN) :: cmax,kg,jg >real(wp), INTENT(IN):: pol_x(0:lp,-cmax:cmax) >real(wp), INTENT(IN):: pol_y(1:2,0:lp,-cmax:0) >real(wp), INTENT(IN):: pol_z(1:2,0:lp,-cmax:0) >real(wp), INTENT(IN):: coef_xyz(((lp+1)*(lp+2)*(lp+3))/6) >real(wp) :: coef_xy(2,(lp+1)*(lp+2)/2) >real(wp) :: coef_x(4,0:lp) > > [...] >coef_x(1:2,4)=coef_x(1:2,4)+coef_xy(1:2,12)*pol_y(1,1,jg) >coef_x(3:4,4)=coef_x(3:4,4)+coef_xy(1:2,12)*pol_y(2,1,jg) >coef_x(1:2,5)=coef_x(1:2,5)+coef_xy(1:2,13)*pol_y(1,1,jg) >coef_x(3:4,5)=coef_x(3:4,5)+coef_xy(1:2,13)*pol_y(2,1,jg) >coef_x(1:2,0)=coef_x(1:2,0)+coef_xy(1:2,14)*pol_y(1,2,jg) >coef_x(3:4,0)=coef_x(3:4,0)+coef_xy(1:2,14)*pol_y(2,2,jg) >coef_x(1:2,1)=coef_x(1:2,1)+coef_xy(1:2,15)*pol_y(1,2,jg) >coef_x(3:4,1)=coef_x(3:4,1)+coef_xy(1:2,15)*pol_y(2,2,jg) >coef_x(1:2,2)=coef_x(1:2,2)+coef_xy(1:2,16)*pol_y(1,2,jg) >coef_x(3:4,2)=coef_x(3:4,2)+coef_xy(1:2,16)*pol_y(2,2,jg) >coef_x(1:2,3)=coef_x(1:2,3)+coef_xy(1:2,17)*pol_y(1,2,jg) >coef_x(3:4,3)=coef_x(3:4,3)+coef_xy(1:2,17)*pol_y(2,2,jg) >coef_x(1:2,4)=coef_x(1:2,4)+coef_xy(1:2,18)*pol_y(1,2,jg) >coef_x(3:4,4)=coef_x(3:4,4)+coef_xy(1:2,18)*pol_y(2,2,jg) >coef_x(1:2,0)=coef_x(1:2,0)+coef_xy(1:2,19)*pol_y(1,3,jg) >coef_x(3:4,0)=coef_x(3:4,0)+coef_xy(1:2,19)*pol_y(2,3,jg) > [...] > > either it is able to interpret the short vectors as such, or it realizes > that these very short implicit loops are nevertheless favourable for > vectorization. > > Is there a trick to get gcc vectorize these loops, or is there some > technology missing for this ? > > Should I file a PR for this (this is somewhat similar to PR31079 and > PR31021)? It would be nice to have a stand-alone testcase for this, so please file a bugreport. Thanks, Richard.
Re: GCC 4.3.2 Status Report (2008-08-18)
> Although the number of open regression PRs is increasing, probably as > more people start to use 4.3 branch, there are currently no open P1 > PRs and it does not look like there are any serious PRs open for > regressions relative to 4.3.0 or 4.3.1. PR rtl-opt/36998 is marked "blocker" though. -- Eric Botcazou
Re: GCC 4.3.2 Status Report (2008-08-18)
On Mon, Aug 18, 2008 at 5:30 PM, Eric Botcazou <[EMAIL PROTECTED]> wrote: >> Although the number of open regression PRs is increasing, probably as >> more people start to use 4.3 branch, there are currently no open P1 >> PRs and it does not look like there are any serious PRs open for >> regressions relative to 4.3.0 or 4.3.1. > > PR rtl-opt/36998 is marked "blocker" though. But is "fixed" on the branch and the trunk. Richard.
Re: vectorizer question
It would be nice to have a stand-alone testcase for this, so please file a bugreport. I've opened PR37150 for this. Thanks, Joost
Re: GCC 4.3.2 Status Report (2008-08-18)
> But is "fixed" on the branch and the trunk. Then it should probably not be "blocker" anymore. -- Eric Botcazou
Re: GCC 4.3.2 Status Report (2008-08-18)
On Mon, Aug 18, 2008 at 5:39 PM, Eric Botcazou <[EMAIL PROTECTED]> wrote: >> But is "fixed" on the branch and the trunk. > > Then it should probably not be "blocker" anymore. "blocker" doesn't have any meaning for the release management process, but yes, feel free to lower that. Richard.
Re: [PATCH]: GCC Scheduler support for R10000 on MIPS
Kumba wrote: Richard Sandiford wrote: OK otherwise. Do you have a copyright assignment on file? Nope. Is there something I need to fill out and e-mail to someone? Yes there is. I'm not sure if Richard can cause them to be sent to you, but certainly requesting copyright assignment documents on gcc@gcc.gnu.org would work. It can often take many weeks to get them processed, so starting as soon as possible would be a good idea. Do I need to put my name and the name of the author of the very original gcc-3.0 patch in this file as well? It would depend on if any of the original patch code remains. If so, probably a copyright assignment for the original author would be required as well (at least that is my understanding). David Daney
Re: [PATCH][RFT] Optimization pass-pipeline re-organization [3/n]
Richard Guenther wrote: The most interesting pass change is the removal of the first DOM/phi-cprop pair. DOM mostly deals with jump-threading at this place and for tramp3d catches 473 threads on top of the 2555 ones performed by the VRP pass that runs right before the first DOM. Ultimately, jump threading is all I really want DOM to do long term anyway -- the redundancy elimination performed by DOM can be better handled elsewhere. The interesting question is there redundancy elimination left in DOM that is unique and if not, can we get the same overall effect if we remove the redundancy elimination from DOM and instead using existing passes for redundancy elimination. Same overall effect doesn't mean identical code, but reasonably close and without incurring a significant compile-time hit. Regardless, I fully support eliminating the number of calls into DOM as well as trimming the set of transformations performed by DOM. jeff
Re: [PATCH][RFT] Optimization pass-pipeline re-organization [3/n]
On Mon, 18 Aug 2008, Jeff Law wrote: > Richard Guenther wrote: > > > > The most interesting pass change is the removal of the first > > DOM/phi-cprop pair. DOM mostly deals with jump-threading at this > > place and for tramp3d catches 473 threads on top of the 2555 ones > > performed by the VRP pass that runs right before the first DOM. > > > Ultimately, jump threading is all I really want DOM to do long term anyway -- Yes, I agree .. > the redundancy elimination performed by DOM can be better handled elsewhere. > The interesting question is there redundancy elimination left in DOM that is > unique and if not, can we get the same overall effect if we remove the > redundancy elimination from DOM and instead using existing passes for > redundancy elimination. Same overall effect doesn't mean identical code, but ... though I am not sure (I didn't investigate) how much of the redundancy elimination code feeds the jump threading parts. I was thinking of moving the jump threading parts over to SCCVN instead, given that VRP lacks capabilities regarding to symbolical conditions. > reasonably close and without incurring a significant compile-time hit. > > Regardless, I fully support eliminating the number of calls into DOM as well > as trimming the set of transformations performed by DOM. Thanks. It seems the last DOM pass doesn't do very much as well, so I'll be playing with removing that as a last step of cleaning up the pipeline. Even if that might need another run of FRE instead of DOM. Richard.
[graphite] Get graphite backend working again
Hi Sebastian, hi Jan, hi graphities, while trying to work on graphite I saw, that the last changes made the code generation fail for every case. 1. Cloog-PPL creates many unnecessary conditional statements. = Problem since: cloog-ppl change. At the moment cloog ppl generates at least one unnecessary condition in every loop: For example in (scop-0.c): for (s_1=0;s_1<=T_4-1;s_1++) { if ((s_1 >= 0) && (s_1 <= T_4-1)) { for (s_3=0;s_3<=199;s_3++) { if ((s_1 >= 0) && (s_1 <= T_4-1) && (s_3 >= 0) && (s_3 <= 199)) { if ((s_1 >= 0) && (s_1 <= T_4-1) && (s_3 >= 0) && (s_3 <= 199)) { if ((s_1 >= 0) && (s_1 <= T_4-1) && (s_3 >= 0) && (s_3 <= 199)) { S5(s_1,s_3) ; } } } } } } As the current code generation does not support conditional statements, we fail to generate code for them. Who will work on that one? Sebastian, do you think cloog ppl can be fixed easily? Or can we support conditional statements (guards) in graphite codegen. For me it seems, that most of the infrastructure is already available. 2. Gloog expects that only one edge leads to the SCoP. == Problem since: scop_limit_scops(), that fixed parameter detection. As at the moment every SCoP is surrounded by a loop: -- We limit all SCoPs to SCoPs, that are completely surrounded by a loop. Example: for (i | { | for (j | SCoP 1 for (k | } | * SCoP frontier, as this line is not surrounded by any loop. * for (l | SCoP 2 This is necessary as scalar evolution and parameter detection need a outermost loop to initialize parameters correctly. -- So the SCoP entry is always a loop header that has more than one predecessor edge and again we fail to generate code, as graphite expects at the moment only a single entry and exit edge for the SCoPs. My thoughts about this representation: I understand, that a SCoP should be single entry/single exit. But I am not sure, if the representation with edges is the right one. As far as I can see, we have something like this. The original code: 1 3 <- 3 = difficult bb |/ 2<- 2 = condition header - |\ | 4 5 <- 5 = loop header| | |\ | SCoP | |/ | 2 -> (8) | 6| | || | 7- |/ 8<- 8 = difficult bb So we have {2, 4, 5, 6, 7} in our SCoP. The single entry point is "bb_2". This means in the original code, all entry edges {1->2, 3->2} lead to "bb_2". The single exit point is "bb_8", as in the original code all edges leaving the SCoP {4->8, 7->8} lead to "bb_8". While in graphite, we freely move around all bbs in the SCoP and generate a new CFG part. 20 <- new generated to regenerate control flow |\ | 5 | |\ | |/ | 6 | | | 7 |/ 21 <- new generated to regenerate control flow |\ | 4 |/ 2 So how do we connect this? We need to know all bbs, that lead to the first bb in the new CFG part. (Here bb_20) These are scop->entry->preds. (Here bb_2->preds = {bb_1, bb_3}). And we have to know, what follows the last bbs of the new CFG. The next bb will be scop->exit. (Here bb_8) So I think a single edge is not sufficient to represent the scop entry. We could use a VEC of entry edges (Here {1->2, 3->2}) or we use a VEC of bbs leading to the SCoP (Here {1, 3}) or we keep the current solution and use the first bb in the original code (Here bb_2) and it's list of predecessors (Here bb_2->preds = {1->2, 3->2}). The exit seems to be easy. We need just a basic block to jump to. So we keep this bb in scop->exit. (Here bb_8) Is someone working on this one? Otherwise I will think about these. See you Tobias
[graphite] Get graphite backend working again
Hi Sebastian, hi Jan, hi graphities, while trying to work on graphite I saw, that the last changes made the code generation fail for every case. 1. Cloog-PPL creates many unnecessary conditional statements. = Problem since: cloog-ppl change. At the moment cloog ppl generates at least one unnecessary condition in every loop: For example in (scop-0.c): for (s_1=0;s_1<=T_4-1;s_1++) { if ((s_1 >= 0) && (s_1 <= T_4-1)) { for (s_3=0;s_3<=199;s_3++) { if ((s_1 >= 0) && (s_1 <= T_4-1) && (s_3 >= 0) && (s_3 <= 199)) { if ((s_1 >= 0) && (s_1 <= T_4-1) && (s_3 >= 0) && (s_3 <= 199)) { if ((s_1 >= 0) && (s_1 <= T_4-1) && (s_3 >= 0) && (s_3 <= 199)) { S5(s_1,s_3) ; } } } } } } As the current code generation does not support conditional statements, we fail to generate code for them. Who will work on that one? Sebastian, do you think cloog ppl can be fixed easily? Or can we support conditional statements (guards) in graphite codegen. For me it seems, that most of the infrastructure is already available. 2. Gloog expects that only one edge leads to the SCoP. == Problem since: scop_limit_scops(), that fixed parameter detection. As at the moment every SCoP is surrounded by a loop: -- We limit all SCoPs to SCoPs, that are completely surrounded by a loop. Example: for (i | { | for (j | SCoP 1 for (k | } | * SCoP frontier, as this line is not surrounded by any loop. * for (l | SCoP 2 This is necessary as scalar evolution and parameter detection need a outermost loop to initialize parameters correctly. -- So the SCoP entry is always a loop header that has more than one predecessor edge and again we fail to generate code, as graphite expects at the moment only a single entry and exit edge for the SCoPs. My thoughts about this representation: I understand, that a SCoP should be single entry/single exit. But I am not sure, if the representation with edges is the right one. As far as I can see, we have something like this. The original code: 1 3 <- 3 = difficult bb |/ 2<- 2 = condition header - |\ | 4 5 <- 5 = loop header| | |\ | SCoP | |/ | 2 -> (8) | 6| | || | 7- |/ 8<- 8 = difficult bb So we have {2, 4, 5, 6, 7} in our SCoP. The single entry point is "bb_2". This means in the original code, all entry edges {1->2, 3->2} lead to "bb_2". The single exit point is "bb_8", as in the original code all edges leaving the SCoP {4->8, 7->8} lead to "bb_8". While in graphite, we freely move around all bbs in the SCoP and generate a new CFG part. 20 <- new generated to regenerate control flow |\ | 5 | |\ | |/ | 6 | | | 7 |/ 21 <- new generated to regenerate control flow |\ | 4 |/ 2 So how do we connect this? We need to know all bbs, that lead to the first bb in the new CFG part. (Here bb_20) These are scop->entry->preds. (Here bb_2->preds = {bb_1, bb_3}). And we have to know, what follows the last bbs of the new CFG. The next bb will be scop->exit. (Here bb_8) So I think a single edge is not sufficient to represent the scop entry. We could use a VEC of entry edges (Here {1->2, 3->2}) or we use a VEC of bbs leading to the SCoP (Here {1, 3}) or we keep the current solution and use the first bb in the original code (Here bb_2) and it's list of predecessors (Here bb_2->preds = {1->2, 3->2}). The exit seems to be easy. We need just a basic block to jump to. So we keep this bb in scop->exit. (Here bb_8) Is someone working on this one? Otherwise I will think about these. See you Tobias
RE: [graphite] Get graphite backend working again
> As the current code generation does not support conditional > statements, > we fail to generate code for them. > > Who will work on that one? I am working on this. - Jan
Re: [PATCH]: GCC Scheduler support for R10000 on MIPS
David Daney wrote: Yes there is. I'm not sure if Richard can cause them to be sent to you, but certainly requesting copyright assignment documents on gcc@gcc.gnu.org would work. It can often take many weeks to get them processed, so starting as soon as possible would be a good idea. I'll submit a request outside of this thread later on then. Thanks for the info! As far as processing time goes, will that impact getting this patch committed? My understanding is end of August for new features for gcc-4.4. It would depend on if any of the original patch code remains. If so, probably a copyright assignment for the original author would be required as well (at least that is my understanding). That's an interesting thought then. I largely left his code intact while I used this patch throughout gcc-3.2, 3.3, and 3.4 (I simply updated it to apply to the subsequent releases in the 3.x series). I only converted it to DFA in gcc-4.0, which was largely a complete rewrite, using the original patch as a guide to learn how DFA changed things around. Plus, the core information comes from the Vr1 manual anyways, available off www.necel.com. Considering the very original patch was posted once to gcc-patches years ago for gcc-3.0 consideration, I would be surprised if the e-mail address of that submission still goes to him. Thoughts? -- Joshua Kinard Gentoo/MIPS [EMAIL PROTECTED] "The past tempts us, the present confuses us, the future frightens us. And our lives slip away, moment by moment, lost in that vast, terrible in-between." --Emperor Turhan, Centauri Republic
RE: [graphite] Get graphite backend working again [scalar variable handling]
Hi Jan, On Mon, 2008-08-18 at 17:37 -0500, Sjodin, Jan wrote: > > As the current code generation does not support conditional > > statements, > > we fail to generate code for them. > > > > Who will work on that one? > > I am working on this. Great! I would like to improve the way how we handle scalar variables and ivs during graphite transformation. I am not sure, if I got it right and where to put my code in the backend. So it would be great, if you could read over my ideas. The problem: In graphite we represent the complete loop structure using polyhedrons to apply our transformations. To go back to gimple we have to generate new loop structures and delete the old loop structures. One step is to remove old ivs and generate new ones. Example: 1. original bb: D.1918_7 = a[i_22]; D.1919_8 = D.1918_7 + 1; a[j_21] = D.1919_8; j_9 = j_21 + 1; 2. Add new ivs graphiteIV_1 = PHI (0, graphiteIV_2) D.1918_7 = a[i_22]; D.1919_8 = D.1918_7 + 1; a[j_21] = D.1919_8; j_9 = j_21 + 1; graphiteIV_2 = graphiteIV_1 + 1 3. Move ivs usages to the ivs. graphiteIV_1 = PHI (0, graphiteIV_2) D.1918_7 = a[i_22]; D.1919_8 = D.1918_7 + 1; a[graphiteIV_1] = D.1919_8; j_9 = j_21 + 1; graphiteIV_2 = graphiteIV_1 + 1 4. remove ivs (Here {j_9, j_21}) D.1918_7 = a[i_22];157235101 disconnected D.1919_8 = D.1918_7 + 1; a[graphiteIV_1] = D.1919_8; The problem seems to become more complicate, if there exist two or more ivs. How I would like to solve it: = During gimple->graphite transformation we ignore all scalar variables, that we can represent using scev. (These contain also all the ivs) Here {D.1918_7, D.1919_8} are not representable, as they depend on the value of a[i_22]. Therefore we can not ignore them. But {j_9, j_21} will be representable and we ignore them. During graphite we have all ivs virtually removed. The next step is during code generation: We remove all scalar variables, we ignored before, and recreate for the places, where they where used, new scalar variables, that depend on the new ivs. Why remove all and not just the ivs? There are two points. 1. Detection ivs is much work. So I prefer this general algorithm, as I hope it is easier to implement and runs faster. 2. If we can ignore the scalar variables, we can also ignore their dependencies. So we have more freedom to schedule the bbs. See you Tobi P.S.: To start working on my code I would like to get block-0.c running. Jan, can I help you with anything?