Ian Romanick <i...@freedesktop.org> writes:

> On 03/11/2013 07:56 AM, Jose Fonseca wrote:
>> I'm surprised this is is faster.
>>
>> In particular, for big things we'll be touching memory twice.
>>
>> Did you measure the speed up?
>
> The second hit is cache-hot, so it may not be too expensive.  I suspect 
> memcpy is optimized to fill the cache in a more efficient manner than 
> the old loop.  Since the old loop did a read and a bit-wise or, it's 
> also possible the compiler generated some really dumb code.  We'd have 
> to look at the assembly output to know.

This is readpixels.  You are probably reading from uncached memory
(assuming the driver didn't do something clever), so you want the
biggest possible word read at a time (memcpy, not 32-bits in a loop), or
if you're on a core2 or better CPU, you want to use movntdqa for the
read so you get streaming performance.

If anyone's interested, there's some code in the movntdqa branch of my
tree (for the ugly old span code and pre-automake), and the movnt branch
of my tree (that does automake integration and is much prettier, but
movntdqa is the instruction you want)

Attachment: pgpSP0wnAdZWf.pgp
Description: PGP signature

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev

Reply via email to