On 03/11/2013 11:30 AM, Jose Fonseca wrote:
----- Original Message -----
On 03/11/2013 07:56 AM, Jose Fonseca wrote:
I'm surprised this is is faster.
In particular, for big things we'll be touching memory twice.
Did you measure the speed up?
The second hit is cache-hot, so it may not be too expensive.
Yes, but the size in question is 1900x1200, ie, 9MB, which will trash L1-L2
caches, and won't even fit on the L3 cache of several processors.
But it's doing it line-by-line, right? So 1900 * 4bpp is only ~8kb.
I'm afraid we'd be optimizing some cases at expense of others.
That is probably true either way. To optimize this for everything, we'd
need a lot more tests.
I think that at very least we should do this in 16KB/32KB or so chunks to avoid
trashing the lower level caches.
I suspect
memcpy is optimized to fill the cache in a more efficient manner than
the old loop. Since the old loop did a read and a bit-wise or, it's
also possible the compiler generated some really dumb code. We'd have
to look at the assembly output to know.
As Patrick suggests, there's probably an SSE2 method to do this even
faster. That may be worth investigating.
An SSE2 is quite easy with intrinsics:
_m128i pixels = _mm_loadu_si128((const __m128i *)src); // could use
_mm_load_si128 with some checks
pixels = _mm_or_si128(pixels, _mm_set1_epi32(0xff000000));
_mm_storeu_si128((__m128i *)dst, pixels);
src += sizeof(__m128i) / sizeof *src;
dst += sizeof(__m128i) / sizeof *dst;
the hard part is the runtime check for sse2 support...
We could start by doing something like this for 64-bit builds. SSE2 is
always available there. :) If we're using the intrinsics anyway, it's
probably even better to use PREFETCHNTA on the read.
Mesa has some code for detecting CPU capabilities, but I don't think it
has been updated in ages... It looks like src/mesa/x86/common_x86.c
detects MMX and SSE, but there's no code for anything after that.
mesa-dev mailing list