On Wed, Jul 6, 2011 at 4:28 PM, William J. Schmidt <wschm...@linux.vnet.ibm.com> wrote: > On Wed, 2011-07-06 at 15:16 +0200, Richard Guenther wrote: >> On Tue, Jul 5, 2011 at 3:59 PM, William J. Schmidt >> <wschm...@linux.vnet.ibm.com> wrote: >> > (Sorry for the late response; yesterday was a holiday here.) >> > >> > On Mon, 2011-07-04 at 16:21 +0200, Richard Guenther wrote: >> >> On Thu, Jun 30, 2011 at 4:39 PM, William J. Schmidt >> >> <wschm...@linux.vnet.ibm.com> wrote: >> >> > This is the first of three patches related to lowering addressing >> >> > expressions to MEM_REFs and TARGET_MEM_REFs in late gimple. This patch >> >> > contains the new pass together with supporting changes in existing >> >> > modules. The second patch contains an independent change to the RTL >> >> > forward propagator to keep it from undoing an optimization made in the >> >> > first patch. The third patch contains new test cases and changes to >> >> > existing test cases. >> >> > >> >> > Although I've broken it up into three patches to make the review easier, >> >> > it would be best to commit at least the first and third together to >> >> > avoid regressions. The second can stand alone. >> >> > >> >> > I've done regression tests on powerpc64 and x86_64, and have asked >> >> > Andreas Krebbel to test against the IBM z (390) platform. I've done >> >> > performance regression testing on powerpc64. The only performance >> >> > regression of note is the 2% degradation to 188.ammp due to loss of >> >> > field disambiguation information. As discussed in another thread, >> >> > fixing this introduces more complexity than it's worth. >> >> >> >> Are there also performance improvements? What about code size? >> > >> > Yes, there are performance improvements. I've been running cpu2000 on >> > 32- and 64-bit powerpc64. Thirteen tests show measurable improvements >> > between 1% and 9%, with 187.facerec showing the largest improvements for >> > both 32 and 64. >> > >> > I don't have formal code size results, but anecdotally from code >> > crawling, I have seen code size either neutral or slightly improved. >> > The largest code size improvements I've seen were on 32-bit code where >> > the commoning allowed removal of a number of sign-extend and zero-extend >> > instructions that were otherwise not seen to be redundant. >> >> >> >> I tried to get an understanding to what kind of optimizations this patch >> >> produces based on the test of testcases you added, but I have a hard >> >> time here. Can you outline some please? >> >> >> > >> > The primary goal is to clean up code such as is shown in the original >> > post of PR46556. In late 2007 there were some changes made to address >> > canonicalization that caused the code gen to be suboptimal on powerpc64. >> > At that time you and others suggested a pattern recognizer prior to >> > expand as probably the best solution, similar to what IVopts is doing. >> >> The PR46556 case looks quite simple. > > It certainly is. I was personally curious whether there were other > suboptimal sequences that might be hiding out there, that a more general > approach might expose. There was a comment at the end of the bugzilla > about a pass to expose target addressing modes in gimple for this > purpose. When I first started looking at this, I looked for some > feedback from the community about whether that should be done, and got a > few favorable comments along with one negative one. So that's how we > got on this road... > >> >> > By using the same mem_ref generation machinery used by IVopts, together >> > with local CSE, the goal was to ensure base registers are properly >> > shared so that optimal code is generated, particularly for cases of >> > shared addressability to structures and arrays. I also observed cases >> > where it was useful to extend the sharing across the dominator tree. >> >> As you are doing IV selection per individual statement only, using >> the affine combination machinery looks quite a big hammer to me. >> Especially as it is hard to imagine what the side-effects are, apart >> from re-associating dependencies that do not fit the MEM-REF and >> making the MEM-REF as complicated as permitted by the target. >> >> What I thought originally when suggesting to do something similar >> to IVOPTs was to build a list of candidates and uses and optimize >> that set using a cost function similar to how IVOPTs does. >> > OK, reading back I can see that now... > >> Doing addressing-mode selection locally per statement seems like >> more a task for a few pattern matchers, for example in tree-ssa-forwprop.c >> (for its last invocation). One pattern would be that of PR46556, >> MEM[(p + ((n + 16)*4))] which we can transform to >> TARGET_MEM_REF[x + 64] with x = p + n*4 if ((n + 16)*4)) was >> a single-use. The TARGET_MEM_REF generation can easily >> re-use the address-description and target-availability checks from >> tree-ssa-address.c. I would be at least interested in whether >> handling the pattern from PR46556 alone (or maybe with a few >> similar other cases) is responsible for the performance improvements. >> > Hm, but I don't think forwprop sees the code in this form. At the time > the last pass of forwprop runs, the gimple for the original problem is: > > D.1997_3 = p_1(D)->a[n_2(D)]; > D.1998_4 = p_1(D)->c[n_2(D)]; > D.1999_5 = p_1(D)->b[n_2(D)]; > foo (D.1997_3, D.1998_4, D.1999_5); > > But perhaps this is just a matter of changing the patterns to recognize?
Ah, so we still have the ARRAY_REFs here. Yeah, well - then the issue boils down to get_inner_reference canonicalizing the offset according to what fold-const.c implements, and we simply emit the offset in that form (if you look at the normal_inner_ref: case in expand_expr_real*). Thus with the above form at expansion time the matching would be applied there (or during our RTL address-generation in fwprop.c ...). Another idea is of course to lower all handled_component_p memory accesses - something I did on the old mem-ref branch and I believe I also suggested at some point. Without such lowering all the address-generation isn't exposed to the GIMPLE optimizers. The simplest way of doing the lowering is to do sth like base = get_inner_reference (rhs, ..., &bitpos, &offset, ...); base = fold_build2 (POINTER_PLUS_EXPR, ..., base, offset); base = force_gimple_operand (base, ... is_gimple_mem_ref_addr); tem = fold_build2 (MEM_REF, TREE_TYPE (rhs), base, build_int_cst (get_alias_ptr_type (rhs), bitpos)); at some point. I didn't end up doing that because of the loss of alias information. > Without running experiments yet, my guess is that this would pick up > some of the benefit, but not all. As you say, the effects are difficult > to predict; I've seen some surprisingly good results from CSEing some > complex addressing, but I also had to implement several tweaks to avoid > detrimental effects (such as stepping on what IVopts already got right). > It may be that some of the surprising positives (such as removal of > extra sign-extend instructions) indicate some cleanup work that should > be done in the back end instead. Indeed - or at some other point in the tree middle-end. For example we do not re-associate POINTER_PLUS_EXPR at all, which loses quite some CSE opportunities for addresses (tree-ssa-reassoc.c would be the place to enhance). >> Ideally we'd of course have a cost driven machinery that considers >> (part of) the whole function. >> >> > Secondarily, I noticed that once this was cleaned up we still had the >> > suboptimal code generation for the zero-offset mem refs scenario >> > outlined in the code commentary. The extra logic to clean this up helps >> > keep register pressure down. I've observed some spill code reduction >> > when this is in place. Again, using expression availability from >> > dominating blocks helps here in some cases as well. >> >> Yeah, the case is quite odd and doesn't really fit existing optimizers >> given that the CSE opportunity is hidden within the TARGET_MEM_REF ... >> >> >> I still do not like the implementation of yet another CSE machinery >> >> given that we already have two. I think most of the need for CSE >> >> comes from the use of the affine combination framework and >> >> force_gimple_operand. In fact I'd be interested to see cases that >> >> are optimized that could not be handled by a combine-like >> >> pattern matcher? >> >> >> > >> > I'm somewhat puzzled by this comment. Back on 2/4, I posted a skeletal >> > outline for this pass and asked for comments. You indicated a concern >> > that the affine combination expansion would un-CSE a lot of expressions, >> > and that I should keep track of local candidates during this pass to >> > ensure that base registers, etc., would be properly shared. It seemed >> > to me that a quick and dirty CSE of addressing expressions would be the >> > best way to handle that. I posted another RFC on 2/24 with an early >> > implementation of CSE constrained to a single block, and indicated >> > shortly thereafter my intent to extend this to a dominator walk. I >> > didn't receive any negative comments about using CSE at that time; but >> > then I didn't get much response at all. >> >> Sorry about that - I agree that doing a local CSE is one possible >> solution, but when considering that we are only considering a statement >> at a time it looks like a quite bloated approach. A local CSE >> capability would be indeed useful also for other passes in GCC, like >> for example loop unrolling, which expose lots of garbage to later >> passes. I thought of re-using the machinery that DOM provides, avoding >> yet another CSE implementation. >> >> > >> > I probably should have pushed harder to get comments at that point, but >> > I was very new to the community and was feeling my way through the >> > protocols; I didn't feel comfortable pushing. >> >> And pushing doesn't help always either. > > That's been my general experience; I try to be careful about nagging > busy people. I think my timing was poor at the beginning of this > project, as you guys were quite busy with closing off the 4.6 release. > >> >> > In any case, yes, the primary need for CSE is as a result of the affine >> > combination framework, which creates a large amount of redundant code. >> > I can certainly take a look at Michael's suggestion to move the pass >> > earlier and allow pass-dominator to handle the cleanup, and add the >> > zero-offset mem-ref optimization to that or a subsequent pass. Do you >> > agree that this would be a preferable approach? >> >> It would definitely preferable to a new CSE implementation. I'm not >> sure the zero-offset optimization easily fits in DOM though. > > I took a look at this yesterday and I think it fits easily enough; about > the same as it fit into my prototype pass. The available expressions > logic is pretty similar to what I was doing, so transplanting the code > wouldn't be difficult. The only issue I saw is that there isn't a pure > lookup function that doesn't place an expression on the avail-exprs > stack, but that's not hard to add. > >> >> What I'd really like to better understand is what transformations are >> the profitable ones and if we can handle them statement-locally >> within our combine machinery (thus, tree-ssa-forwprop.c). I can >> cook up a forwprop patch to fix PR46556 in case you are interested >> in more details. >> > Thanks...let me study forwprop a bit first -- I'm trying to get my head > wrapped around as much of the middle end as I can, and this is a good > excuse to delve into another piece. > > I think perhaps the best way forward may be to use my prototype as a > baseline to compare against additions to the combine logic, so we can > tease out what some of the unexpected gains are, and think about how > best to achieve them more simply. Yeah. I will re-surrect the simple pass I had to lower all memory accesses, that will give us another toy to play with. Thanks, Richard. > I'll have a look at forwprop and pester you (lightly :) if I have > questions. > > Thanks, > Bill > >> Thanks, >> Richard. >> >> >> Thanks, >> >> Richard. >> > >> > >> > > > >