Hi, During the last weeks I have experimented a bit with the AVR back-end. IMO there presently are two areas where it is worth to concentrating efforts on: 1.) cc0 -> CCmode transition 2.) splitting of HI- and SI-mode operations so that the RTL finally gets some similarity with the actually existing AVR instructions.
I'd like to discuss with you on how to address these issues best because I think it's better to have a good plan of what to do before starting programing. Concerning 1.) Ian Lance Taylor has made a couple of suggestions on how to make the transition easier for the back-end maintainers. So it seems that there is already some activity around. For issue 2.) Richard Henderson has recently posted a patch that implements a "subreg-lowering" pass run directly after expand that gives the register allocator the freedom to handle and allocate the individual subreg expressions individually (could give some tremendous improvements for AVR, e.g. when dealing with expressions combining expressions with different modes). IMO the two issues 1.) and 2.) are somewhat linked since, IMO it would be a good idea to implement a cc0->CCmode transition method that does not make it difficult to use Richard Henderson's patch later-on. 1.) Concerning the cc0 -> CCmode transition, IIUC, there are two main problems: A) We must make sure the reload does not insert condition-code (CC) clobbering instructions in between a CC setter and a CC user (i.e. conditional branches). B) We must find a way to teach GCC to re-use as frequently as possible the condition codes that are set by arithmetic and logic operations. IMO, the easiest and most straight-forward solution of A) for the avr target would be to follow Ian's suggestion: > Ian Lance Taylor wrote: > >> Is there a way that makes it possible that only reload uses the patterns >> that save and restore the condition code while everywhere else the usual >> patterns are generated? >> In case of the AVR target, most of the time reload will be >> able to use processor instructions that do not require the save/restore >> operations (it could freely access addresses up to a constant offset of 64 >> from the stack pointer) and the costly operations would not show up very >> frequently (only for large offsets). > >You could do it by writing your insn patterns as expanders which check >reload_in_progress || reload_completed and under those circumstances >emit insns which do not clobber the condition codes. > >Ian I would expect that in the rare case that reload needs to access data beyond the 64-Byte boundary it could be tolerated to expand the memory access to a sequence of type "(set (temp_register) (SR)) (memory_move_instruction_instruction_inserted_by_reload) (set (SR) (temp_register))" One would have to confirm that, but I assume that the remaining passes after reload would be sufficiently smart to optimize away the two copy operations for saving and restoring the status register in case that they are not necessary.? E.g. I think that it is justified to assume that if above reload-generated memory access is followed by an instruction that clobbers the condition code, both of the status register operations that are embracing the memory move would be deleted. Possibly these two instructions would also be deleted, if the memory move instruction does not clobber the condition code. Doing this, I think that one could resolve the most difficult issue A) of cc0->CCmode conversion without having to face what has been called "the combinatorical pattern explosion" in the mailing list archives. Concerning issue B) Ian Lance Taylor suggests, IIUC, to implement a new pass after reload: After reload, it is known which alternative of the different instructions that possibly set CC has been chosen and one could find out whether some of the compare instructions could be deleted. IMO, if this new pass is available by the time the cc0->CCmode transition is implemented for AVR, one should probably try to use it. Meanwhile I'd like to suggest an approach that tries to remove the unnecessary compare instructions already before reload. I.e. what I am thinking about is to try to use the (G)CSE passes to identify situations where arithmetic instructions calculate the same condition code that is again calculated later-on by compare or test instructions. Disadvantage would be that at this time one would not be able to know which alternative of an instruction will be chosen by reload. In my present opinion, however, this is an issue that is not a very big problem for AVR. Also, IMO, one should probably try hard to identify the "single-bit-test-and-branch" patterns before reload. The condition-code re-use issue is the point, where, IMO, the link to the subreg-lowering 2.) shows up. After, e.g., breaking down a HI mode "sub" operation into two QI mode "sub" and "sub-with-carry"s at expand, I consider it to be extremely difficult to make the mid-end smart enough to identify that a the end of the QI "sub-with-carry" the condition code is set according to the corresponding HImode substract operation. For DImode operations the mid-end would already need to take 8 (!) Instructions into account for finding out what the calculated condition code actually represents. This, also, will be a major difficulty when considering Ian's suggested optimizer pass after reload. IMO condition code re-use will also be an issue for the bigger targets (e.g. x86) when dealing with DI mode operations. My suggestion to address this issue is: Let's not try to make the mid-end hyper smart, but let's give the back-end designer a possibility to guide the mid-end by inserting helpful information in the RTL. In order to illustrate what I am thinking about, let's take above example with the minus:HI. I'd suggest to expand a subhi3 operation with 3 register operands into a sequence like ; copying the input operands in individual QImode variables (set (reg:QI 1) (subreg:QI (operands[1]) 0)) (set (reg:QI 2) (subreg:QI (operands[1]) 1)) (set (reg:QI 3) (subreg:QI (operands[2]) 0)) (set (reg:QI 4) (subreg:QI (operands[2]) 1)) ; first add (parallel [ (set (reg:QI 5)(minus:QI (reg:QI 1) (reg:QI 3)) (set (reg:CC_carry SR) (unspec:CC_carry [(reg:QI 1) (reg:QI 3)] carry_from_sub_code))]) ; second add (with carry) (parallel [ (set (reg:QI 6) (minus:QI (minus:QI (reg:QI 2) (reg:QI 4)) (reg:CC_carry SR)) (set (reg:CC_cc SR) (unspec:CC_cary [(reg:QI 1) (reg:QI 3) (reg:CC_carry SR)] carry_from_sub_with_carry_code))]) ; write back the result into the output operands (set (subreg:QI (operands[0]) 0) (reg:QI 5)) (set (subreg:QI (operands[0]) 1) (reg:QI 6)) ; Additional "Marker" instructions to be used by GCSE (parallel [ (use (reg:CC_cc SR)) (set (reg:CC_cc SR) (compare:HI (operands[1]) (operands[2])) (note "please delete the entire embracing parallel instruction before register life-time analysis by a new pass: It pretends to use operands 1 and 2 while in fact this instruction does nothing except from giving hints to GCSE.") ]) (parallel [ (use (reg:CC_cc SR)) (set (reg:CC_cc SR) (compare:HI (operands[0]) (const_0)) (note "please delete the entire embracing parallel instruction before register life-time analysis by a new pass: It pretends to use operand[0] while in fact this instruction does nothing except from giving hints to GCSE.") ]) (parallel [ (use (operands[0])) (set (operands[0]) (minus:HI (operands[1]) (operands[2])) (note "please delete the entire embracing parallel instruction before register life-time analysis by a new pass: It pretends to use operands 1 and 2 while in fact this instruction does nothing except from giving hints to GCSE.") ]) IIUC, (g)cse would be well adapted to deal with the probably unnecessary move instructions without any serious problems. It also will be able to identify if a later compare instruction calculates the same condition code as the minus operation had calculated. When having to choose between (a) teaching the mid-end to understand what add/substract-with-carry instructions actually do and (b) using a method like above to give hints to the present mid-end, (b) is my favorite approach. The simplified pass sequence before reload would then look similar to - expand - Richard's subreg lowering pass - gcse and friends - ( possible place 1 for marker instruction removal ) - combine - ( possible place 2 for marker instruction removal ) - register allocation and reload I know that the RTL generated by the back-end will get much more complex in comparison to the present case. But IMO this would be part of the price one always will have to pay when exposing the complexity of the subreg-lowering to the RTL optimizers. In order to be able to identify the case of single-bit tests, it might be possible to implement a second to the optimizing procedure: By using a special optimizer switch it would be nice if one could change the pass flow before reload by - expand - gcse and friends - combine - ( marker instruction removal ) - Richard's subreg lowering pass - gcse and friends -re run - combine -re-run - register allocation and reload This way, it would be possibly easier for the combine pass to find out whether a conditional branch in fact is a "branch on a single bit condition" since it would still find the information in the RTL which kind of original aggregate-mode expressions (i.e. HImode and larger) correspond to the zoo of atomic (i.e. QImode) instructions expand had generated. I think that in case that above ideas work, it will be the AVR-back-end that would benefit most. But I also think that a similar approach could be very for all of the other targets that have only 16 bit instructions, e.g. like MSP430 and HC12. Possibly a similar procedure might also be helpful for the bigger targets when dealing with DI mode expressions! So I am having hope that addressing this issue justifies changes in gcc's mid end. Hoping to receive many comments and additional ideas from more competent people ... best regards Björn