On Tue, Dec 11, 2012 at 10:48 AM, Richard Earnshaw <rearn...@arm.com> wrote:
> On 11/12/12 09:45, Richard Biener wrote:
>>
>> On Mon, Dec 10, 2012 at 10:07 PM, Andi Kleen <a...@firstfloor.org> wrote:
>>>
>>> Jan Hubicka <hubi...@ucw.cz> writes:
>>>
>>>> Note that I think Core has similar characteristics - at least for string
>>>> operations
>>>> it fares well with unalignes accesses.
>>>
>>>
>>> Nehalem and later has very fast unaligned vector loads. There's still
>>> some
>>> penalty when they cross cache lines however.
>>>
>>> iirc the rule of thumb is to do unaligned for 128 bit vectors,
>>> but avoid it for 256bit vectors because the cache line cross
>>> penalty is larger on Sandy Bridge and more likely with the larger
>>> vectors.
>>
>>
>> Yes, I think the rule was that using the unaligned instruction variants
>> carries
>> no penalty when the actual access is aligned but that aligned accesses are
>> still faster than unaligned accesses.  Thus peeling for alignment _is_ a
>> win.
>> I also seem to remember that the story for unaligned stores vs. unaligned
>> loads
>> is usually different.
>
>
> Yes, it's generally the case that unaligned loads are slightly more
> expensive than unaligned stores, since the stores can often merge in a store
> buffer with little or no penalty.

It was the other way around on AMD CPUs AFAIK - unaligned stores forced
flushes of the store buffers.  Which is why the vectorizer first and
foremost tries
to align stores.

Richard.

> R.
>
>

Reply via email to