On Mon, Mar 11, 2013 at 1:30 PM, Jose Fonseca <jfons...@vmware.com> wrote:
> ----- Original Message ----- > > On 03/11/2013 07:56 AM, Jose Fonseca wrote: > > > I'm surprised this is is faster. > > > > > > In particular, for big things we'll be touching memory twice. > > > > > > Did you measure the speed up? > > > > The second hit is cache-hot, so it may not be too expensive. > > Yes, but the size in question is 1900x1200, ie, 9MB, which will trash > L1-L2 caches, and won't even fit on the L3 cache of several processors. > > I'm afraid we'd be optimizing some cases at expense of others. > > I think that at very least we should do this in 16KB/32KB or so chunks to > avoid trashing the lower level caches. > > > I suspect > > memcpy is optimized to fill the cache in a more efficient manner than > > the old loop. Since the old loop did a read and a bit-wise or, it's > > also possible the compiler generated some really dumb code. We'd have > > to look at the assembly output to know. > > > > As Patrick suggests, there's probably an SSE2 method to do this even > > faster. That may be worth investigating. > > An SSE2 is quite easy with intrinsics: > > _m128i pixels = _mm_loadu_si128((const __m128i *)src); // could use > _mm_load_si128 with some checks > pixels = _mm_or_si128(pixels, _mm_set1_epi32(0xff000000)); > _mm_storeu_si128((__m128i *)dst, pixels); > src += sizeof(__m128i) / sizeof *src; > dst += sizeof(__m128i) / sizeof *dst; > > the hard part is the runtime check for sse2 support... > > At least for x86-64, there is no runtime check required as SSE2 is required. The mesa/x86 folder contains runtime CPU code detection already; I was just browsing it. Patrick
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev