On Tue, Dec 11, 2012 at 10:48 AM, Richard Earnshaw <rearn...@arm.com> wrote: > On 11/12/12 09:45, Richard Biener wrote: >> >> On Mon, Dec 10, 2012 at 10:07 PM, Andi Kleen <a...@firstfloor.org> wrote: >>> >>> Jan Hubicka <hubi...@ucw.cz> writes: >>> >>>> Note that I think Core has similar characteristics - at least for string >>>> operations >>>> it fares well with unalignes accesses. >>> >>> >>> Nehalem and later has very fast unaligned vector loads. There's still >>> some >>> penalty when they cross cache lines however. >>> >>> iirc the rule of thumb is to do unaligned for 128 bit vectors, >>> but avoid it for 256bit vectors because the cache line cross >>> penalty is larger on Sandy Bridge and more likely with the larger >>> vectors. >> >> >> Yes, I think the rule was that using the unaligned instruction variants >> carries >> no penalty when the actual access is aligned but that aligned accesses are >> still faster than unaligned accesses. Thus peeling for alignment _is_ a >> win. >> I also seem to remember that the story for unaligned stores vs. unaligned >> loads >> is usually different. > > > Yes, it's generally the case that unaligned loads are slightly more > expensive than unaligned stores, since the stores can often merge in a store > buffer with little or no penalty.
It was the other way around on AMD CPUs AFAIK - unaligned stores forced flushes of the store buffers. Which is why the vectorizer first and foremost tries to align stores. Richard. > R. > >