On Mon, May 19, 2014 at 9:14 AM, Uros Bizjak <ubiz...@gmail.com> wrote: > On Mon, May 19, 2014 at 6:48 AM, Jan Hubicka <hubi...@ucw.cz> wrote: >>> > Thanks for the pointer, there is indeed the recommendation in >>> > optimization manual [1], section 3.6.4, where it is said: >>> > >>> > --quote-- >>> > Misaligned data access can incur significant performance penalties. >>> > This is particularly true for cache line >>> > splits. The size of a cache line is 64 bytes in the Pentium 4 and >>> > other recent Intel processors, including >>> > processors based on Intel Core microarchitecture. >>> > An access to data unaligned on 64-byte boundary leads to two memory >>> > accesses and requires several >>> > ??ops to be executed (instead of one). Accesses that span 64-byte >>> > boundaries are likely to incur a large >>> > performance penalty, the cost of each stall generally are greater on >>> > machines with longer pipelines. >>> > >>> > ... >>> > >>> > A 64-byte or greater data structure or array should be aligned so that >>> > its base address is a multiple of 64. >>> > Sorting data in decreasing size order is one heuristic for assisting >>> > with natural alignment. As long as 16- >>> > byte boundaries (and cache lines) are never crossed, natural alignment >>> > is not strictly necessary (though >>> > it is an easy way to enforce this). >>> > --/quote-- >>> > >>> > So, this part has nothing to do with AVX512, but with cache line >>> > width. And we do have a --param "l1-cache-line-size=64", detected with >>> > -march=native that could come handy here. >>> > >>> > This part should be rewritten (and commented) with the information >>> > above in mind. >>> >>> Like in the patch below. Please note, that the block_tune setting for >>> the nocona is wrong, -march=native on my trusted old P4 returns: >>> >>> --param "l1-cache-size=16" --param "l1-cache-line-size=64" --param >>> "l2-cache-size=2048" "-mtune=nocona" >>> >>> which is consistent with the above quote from manual. >>> >>> 2014-01-02 Uros Bizjak <ubiz...@gmail.com> >>> >>> * config/i386/i386.c (ix86_data_alignment): Calculate max_align >>> from prefetch_block tune setting. >>> (nocona_cost): Correct size of prefetch block to 64. >>> >> Uros, >> I am looking into libreoffice size and the data alignment seems to make huge >> difference. Data section has grown from 5.8MB to 6.3MB in between GCC 4.8 >> and 4.9, >> while clang produces 5.2MB. >> >> The two patches I posted to not align vtables and RTTI reduces it to 5.7MB, >> but >> But perhaps we want to revisit the alignment rules. The optimization manuals >> usually care only about performance critical loops. Perhaps we can make the >> rules to align only bigger datastructures, or so at least for -O2. > > Based on the above quote, "Misaligned data access can incur > significant performance penalties." and the fact that this particular > alignment rule has some compatibility issues with previous versions of > gcc (these were later fixed by Jakub), I'd rather leave this rule as > is. However, if the access is from the cold section, we can perhaps > avoid extra alignment, while avoiding those compatibility issues. >
It is excessive to align struct foo { int x1; int x2; char x3; int x4; int x5; char x6; int x7; int x8; }; to 32 bytes and align struct foo { int x1; int x2; char x3; int x4; int x5; char x6; int x7[9]; int x8; }; to 64 bytes. What performance gain does it provide? -- H.J.