----- Original Message ----- > On 03/11/2013 11:30 AM, Jose Fonseca wrote: > > ----- Original Message ----- > >> On 03/11/2013 07:56 AM, Jose Fonseca wrote: > >>> I'm surprised this is is faster. > >>> > >>> In particular, for big things we'll be touching memory twice. > >>> > >>> Did you measure the speed up? > >> > >> The second hit is cache-hot, so it may not be too expensive. > > > > Yes, but the size in question is 1900x1200, ie, 9MB, which will trash L1-L2 > > caches, and won't even fit on the L3 cache of several processors. > > But it's doing it line-by-line, right? So 1900 * 4bpp is only ~8kb.
Oh I missed that. That looks quite sensible then. > > I'm afraid we'd be optimizing some cases at expense of others. > > That is probably true either way. To optimize this for everything, we'd > need a lot more tests. > > > I think that at very least we should do this in 16KB/32KB or so chunks to > > avoid trashing the lower level caches. > > > >> I suspect > >> memcpy is optimized to fill the cache in a more efficient manner than > >> the old loop. Since the old loop did a read and a bit-wise or, it's > >> also possible the compiler generated some really dumb code. We'd have > >> to look at the assembly output to know. > >> > >> As Patrick suggests, there's probably an SSE2 method to do this even > >> faster. That may be worth investigating. > > > > An SSE2 is quite easy with intrinsics: > > > > _m128i pixels = _mm_loadu_si128((const __m128i *)src); // could use > > _mm_load_si128 with some checks > > pixels = _mm_or_si128(pixels, _mm_set1_epi32(0xff000000)); > > _mm_storeu_si128((__m128i *)dst, pixels); > > src += sizeof(__m128i) / sizeof *src; > > dst += sizeof(__m128i) / sizeof *dst; > > > > the hard part is the runtime check for sse2 support... > > We could start by doing something like this for 64-bit builds. SSE2 is > always available there. :) If we're using the intrinsics anyway, it's > probably even better to use PREFETCHNTA on the read. Yes, that would avoid trashing the cache with one-time reads. > Mesa has some code for detecting CPU capabilities, but I don't think it > has been updated in ages... It looks like src/mesa/x86/common_x86.c > detects MMX and SSE, but there's no code for anything after that. Gallium too. Should move that into somwhere shareable... Jose _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev