------- Comment #8 from tbptbp at gmail dot com 2006-09-18 05:52 ------- Subject: Re: IV selection is messed up
On 17 Sep 2006 22:48:12 -0000, rakdver at gcc dot gnu dot org <[EMAIL PROTECTED]> wrote: > Regarding the "-fprefetch-loop-arrays's heuristic is way off the mark" part, > gcc badly overestimates the size of the loop (it guesses 300 insns). I will > check what I can do with that. Provided i understand what you meant, it's the other way around; with -fprefetch-loop-array gcc prefetch distance is much too short. If i remember correctly, that testcase takes a bunch of cycles per iteration on my k8 (opteron 252) and you have to prefetch at the very least 256 bytes away to make that profitable; it's less than 128 with gcc-4.2-20060908. That testcase is pretty silly anyway. Here's what i get with the real code and -fprefetch-loop-array 4011c2: movdqa (%ecx),%xmm2 4011c6: lea 0x10(%ecx),%eax 4011c9: movdqa %xmm6,%xmm4 4011cd: dec %edx 4011ce: movdqa %xmm2,%xmm0 4011d2: mov %eax,%ecx 4011d4: prefetcht0 (%eax) 4011d7: movdqa %xmm6,%xmm1 4011db: punpckldq %xmm2,%xmm0 4011df: punpckhdq %xmm2,%xmm2 4011e3: movdqa %xmm0,%xmm3 4011e7: punpcklqdq %xmm0,%xmm3 4011eb: punpckhqdq %xmm0,%xmm0 4011ef: pcmpgtd %xmm3,%xmm4 4011f3: pcmpgtd %xmm0,%xmm1 4011f7: paddd 0x10(%esp),%xmm4 4011fd: paddd %xmm1,%xmm4 401201: movdqa %xmm5,%xmm1 401205: pcmpgtd %xmm3,%xmm1 401209: movdqa %xmm1,%xmm3 40120d: movdqa %xmm5,%xmm1 401211: paddd %xmm7,%xmm3 401215: pcmpgtd %xmm0,%xmm1 401219: movdqa %xmm6,%xmm0 40121d: paddd %xmm1,%xmm3 401221: movdqa %xmm2,%xmm1 401225: punpcklqdq %xmm2,%xmm1 401229: punpckhqdq %xmm2,%xmm2 40122d: pcmpgtd %xmm1,%xmm0 401231: paddd %xmm0,%xmm4 401235: movdqa %xmm6,%xmm0 401239: pcmpgtd %xmm2,%xmm0 40123d: paddd %xmm0,%xmm4 401241: movdqa %xmm5,%xmm0 401245: movaps %xmm4,0x10(%esp) 40124a: pcmpgtd %xmm1,%xmm0 40124e: paddd %xmm0,%xmm3 401252: movdqa %xmm5,%xmm0 401256: pcmpgtd %xmm2,%xmm0 40125a: paddd %xmm0,%xmm3 40125e: movdqa %xmm3,%xmm7 401262: jne 4011c2 <kdlib::AEBH::streaming_sampling(kdlib::AEBH::streaming_node_t const&, kdlib::AEBH::sampler3D_t const&)+0x52> Each iteration takes about 8 cycles when not starved and prefetching isn't a win unless done at least 4 or 8 cachelines away, so this one is nothing but a hinderance. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=28919