On Tue, Sep 24, 2013 at 5:31 AM, Jan Hubicka <hubi...@ucw.cz> wrote: >> Hi Honza, >> >> I am finally getting back to working on this after a few weeks of >> working on some other priorities. > > I am also trying to return to this, so good timming ;) > Martin has got smaller C++ programs (Inkscape) to not touch cold segment > during the startup with FDO (w/o partitioning). Firefox still does, I think > the problem are lost samples due to different linker decisions even with LTO. > (i.e. linker pick an object from .a libraryat profile-generate time that i > never > passes later.). > > I plan to look into that today. >> >> Did you mean to commit the above change? I see that it went in as part >> of r202258 but doesn't show up in the ChangeLog entry for that >> revision. > > Yes, I meant to check it in, but did not mean to do so w/o Changelog. I wil > fix that. >> >> > >> > In other cases it was mostly loop unrolling in combination with jump >> > threading. So >> > I modified my script to separately report when failure happens for test >> > trained >> > once and test trained hundred times. >> >> Thanks for the linker script. I reproduced your results. I looked at a >> couple cases. The first was one that failed after 1 training run only >> (20000910-2.c). It was due to jump threading, which you noted was a >> problem. For this one I think we can handle it in the partitioning, >> since there is an FDO insanity that we could probably treat more >> conservatively when splitting. > > We should fix the roundoff issues - when I was introducing the > frequency/probability/count system I made an assumptions that parts of > programs > with very low counts do not matter, since they are not part of hot spot (and I > cared only about the hot spot). Now we care about identifying unlikely > executed spots and we need to fix this. >> >> I looked at one that failed after 100 as well (20031204-1.c). In this >> case, it was due to expansion which was creating multiple branches/bbs >> from a logical OR and guessing incorrectly on how to assign the >> counts: >> >> if (octets == 4 && (*cp == ':' || *cp == '\0')) { >> >> The (*cp == ':' || *cp == '\0') part looked like the following going >> into RTL expansion: >> >> [20031204-1.c : 31:33] _29 = _28 == 58; >> [20031204-1.c : 31:33] _30 = _28 == 0; >> [20031204-1.c : 31:33] _31 = _29 | _30; >> [20031204-1.c : 31:18] if (_31 != 0) >> goto <bb 16>; >> else >> goto <bb 19>; >> >> where the result of the OR was always true, so bb 16 had a count of >> 100 and bb 19 a count of 0. When it was expanded, the expanded version >> of the above turned into 2 bbs with a branch in between. Both >> comparisons were done in the first bb, but the first bb checked >> whether the result of the *cp == '\0' compare was true, and if not >> branched to the check for whether the *cp == ':' compare was true. It >> gave the branch to the second check against ':' a count of 0, so that >> bb got a count of 0 and was split out, and put the count of 100 on the >> fall through assuming the compare with '\0' always evaluated to true. >> In reality, this OR condition was always true because *cp was ':', not >> '\0'. Therefore, the count of 0 on the second block with the check for >> ':' was incorrect, we ended up trying to execute it, and failed. >> >> Presumably we had the correct profile data for both blocks, but the >> accuracy was reduced when the OR was represented as a logical >> computation with a single branch. We could change the expansion code >> to do something different, e.g. treat as a 50-50 branch. But we would >> still end up with integer truncation issues when there was a single >> training run. But that could be dealt with conservatively in the >> bbpart code as I suggested for the jump threading issue above. I.e. a >> cold block with incoming non-cold edges conservatively not marked cold >> for splitting. >> >> > >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000422-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000910-2.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20020413-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20030903-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060420-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060905-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-2.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120808-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920726-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/990628-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr36093.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr37573.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c >> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c >> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c >> > >> > FAIL1 is failure after one run, FIAL is failure after 100 train runs. >> > We should take look at FAILs and see if there are bugs to fix. For FAIL1 >> > I think it is kind of design problem: while implementing counts&frequencies >> > the idea was that small counts do not matter, so integer arithmetic is all >> > right. >> > >> > I wonder if with current C++ wonderland we can't simply switch count >> > to a better representation. Either sreal or fixedpoint with capping >> > (the integer overflow issues are tiring, too). >> >> It also seems like we should be able to detect the profile insanities >> caused by integer truncation and handle them conservatively. That >> being said, I see some sreal uses already in the profile.c code, so >> presumably we could use this for the counts as well if it turns out to >> be necessary? > > Yes, I was thinking about this, too. We would need to do some evaulation of > compile time implications but switching counts from gcov_type to > profile_counter_t that is typedef to sreal seems sane idea. > > We could switch CFG code first. there should not be many hot spots where > counts are involved. We can offline the common calculation we already moved > to > macros that > > We will need to invent also REG representation for them. Now we have INT_LIST > for that we may have SREAL list and introduce SREAL as valid RTX argument. > This can be done incrementally. >> >> BTW, Rong also implemented his runtime patch to do the COMDAT profile >> merging. However, that ended up having some issues, that were solvable >> but would have caused us to lose all context sensitivity from COMDATS >> inlined during the profile-gen build. I am going to go back to solving >> this in the profile-use phase as we discussed in the separate thread >> on the COMDAT inlining patch I had been working on. > > Yes, lets move ahead with this, too. I think i should dig out the change > that made frequencies to be guessed again. > > As for COMDAT merging, i would like to see the patch. I am experimenting > now with a patch to also privatize COMDATs during -fprofile-generate to > avoid problems with lost profiles mentioned above. >
Do you mean you privatize every COMDAT function in the profile-generate? We discussed this idea internally and we thought it would not work for large applications (like in google) due to size. > As for context sensitivity, one approach would be to have two sets of > counters for every comdat - one merged globally and one counting local > instances. We can then privatize always and at profile read in stage > just clone every comdat and have two instances - one for offline copy > and one for inlining. > In my implementation, I also allow multiple sets of COMDAT profile co-existing in one compilation. Due to the auxiliary modules in LIPO, I actually have more than two. But I'm wondering how do you determine which profile to use for each call-site -- the inline decision may not be the same for profile-generate and profile-use compilation. > This is not different from how I always wanted to handle GNU extern inlines > (that also do have this issue - when you do not inline it, the unit do not see > any profile of it). > > We can just tie the two functions together so "inline" version stay prior > inlining and then have linker to redirect to inline version instead of offline > version in such cases. It already knows how to skip aliases and this is not > terribly different from that. > > Honza >> >> Thanks, >> Teresa >> >> > >> > Honza >> >> >> >> -- >> Teresa Johnson | Software Engineer | tejohn...@google.com | 408-460-2413