On Mon, May 19, 2014 at 9:14 AM, Uros Bizjak <ubiz...@gmail.com> wrote:
> On Mon, May 19, 2014 at 6:48 AM, Jan Hubicka <hubi...@ucw.cz> wrote:
>>> > Thanks for the pointer, there is indeed the recommendation in
>>> > optimization manual [1], section 3.6.4, where it is said:
>>> >
>>> > --quote--
>>> > Misaligned data access can incur significant performance penalties.
>>> > This is particularly true for cache line
>>> > splits. The size of a cache line is 64 bytes in the Pentium 4 and
>>> > other recent Intel processors, including
>>> > processors based on Intel Core microarchitecture.
>>> > An access to data unaligned on 64-byte boundary leads to two memory
>>> > accesses and requires several
>>> > ??ops to be executed (instead of one). Accesses that span 64-byte
>>> > boundaries are likely to incur a large
>>> > performance penalty, the cost of each stall generally are greater on
>>> > machines with longer pipelines.
>>> >
>>> > ...
>>> >
>>> > A 64-byte or greater data structure or array should be aligned so that
>>> > its base address is a multiple of 64.
>>> > Sorting data in decreasing size order is one heuristic for assisting
>>> > with natural alignment. As long as 16-
>>> > byte boundaries (and cache lines) are never crossed, natural alignment
>>> > is not strictly necessary (though
>>> > it is an easy way to enforce this).
>>> > --/quote--
>>> >
>>> > So, this part has nothing to do with AVX512, but with cache line
>>> > width. And we do have a --param "l1-cache-line-size=64", detected with
>>> > -march=native that could come handy here.
>>> >
>>> > This part should be rewritten (and commented) with the information
>>> > above in mind.
>>>
>>> Like in the patch below. Please note, that the block_tune setting for
>>> the nocona is wrong, -march=native on my trusted old P4 returns:
>>>
>>> --param "l1-cache-size=16" --param "l1-cache-line-size=64" --param
>>> "l2-cache-size=2048" "-mtune=nocona"
>>>
>>> which is consistent with the above quote from manual.
>>>
>>> 2014-01-02  Uros Bizjak  <ubiz...@gmail.com>
>>>
>>>     * config/i386/i386.c (ix86_data_alignment): Calculate max_align
>>>     from prefetch_block tune setting.
>>>     (nocona_cost): Correct size of prefetch block to 64.
>>>
>> Uros,
>> I am looking into libreoffice size and the data alignment seems to make huge
>> difference. Data section has grown from 5.8MB to 6.3MB in between GCC 4.8 
>> and 4.9,
>> while clang produces 5.2MB.
>>
>> The two patches I posted to not align vtables and RTTI reduces it to 5.7MB, 
>> but
>> But perhaps we want to revisit the alignment rules.  The optimization manuals
>> usually care only about performance critical loops.  Perhaps we can make the
>> rules to align only bigger datastructures, or so at least for -O2.
>
> Based on the above quote, "Misaligned data access can incur
> significant performance penalties." and the fact that this particular
> alignment rule has some compatibility issues with previous versions of
> gcc (these were later fixed by Jakub), I'd rather leave this rule as
> is. However, if the access is from the cold section, we can perhaps
> avoid extra alignment, while avoiding those compatibility issues.
>

It is excessive to align

struct foo
{
  int x1;
  int x2;
  char x3;
  int x4;
  int x5;
  char x6;
  int x7;
  int x8;
};

to 32 bytes and align

struct foo
{
  int x1;
  int x2;
  char x3;
  int x4;
  int x5;
  char x6;
  int x7[9];
  int x8;
};

to 64 bytes.  What performance gain does it provide?

-- 
H.J.

Reply via email to