------- Comment #8 from tbptbp at gmail dot com  2006-09-18 05:52 -------
Subject: Re:  IV selection is messed up

On 17 Sep 2006 22:48:12 -0000, rakdver at gcc dot gnu dot org
<[EMAIL PROTECTED]> wrote:
> Regarding the "-fprefetch-loop-arrays's heuristic is way off the mark" part,
> gcc badly overestimates the size of the loop (it guesses 300 insns).  I will
> check what I can do with that.
Provided i understand what you meant, it's the other way around; with
-fprefetch-loop-array gcc prefetch distance is much too short.
If i remember correctly, that testcase takes a bunch of cycles per
iteration on my k8 (opteron 252) and you have to prefetch at the very
least 256 bytes away to make that profitable; it's less than 128 with
gcc-4.2-20060908.

That testcase is pretty silly anyway.
Here's what i get with the real code and -fprefetch-loop-array

  4011c2:       movdqa (%ecx),%xmm2
  4011c6:       lea    0x10(%ecx),%eax
  4011c9:       movdqa %xmm6,%xmm4
  4011cd:       dec    %edx
  4011ce:       movdqa %xmm2,%xmm0
  4011d2:       mov    %eax,%ecx
  4011d4:       prefetcht0 (%eax)
  4011d7:       movdqa %xmm6,%xmm1
  4011db:       punpckldq %xmm2,%xmm0
  4011df:       punpckhdq %xmm2,%xmm2
  4011e3:       movdqa %xmm0,%xmm3
  4011e7:       punpcklqdq %xmm0,%xmm3
  4011eb:       punpckhqdq %xmm0,%xmm0
  4011ef:       pcmpgtd %xmm3,%xmm4
  4011f3:       pcmpgtd %xmm0,%xmm1
  4011f7:       paddd  0x10(%esp),%xmm4
  4011fd:       paddd  %xmm1,%xmm4
  401201:       movdqa %xmm5,%xmm1
  401205:       pcmpgtd %xmm3,%xmm1
  401209:       movdqa %xmm1,%xmm3
  40120d:       movdqa %xmm5,%xmm1
  401211:       paddd  %xmm7,%xmm3
  401215:       pcmpgtd %xmm0,%xmm1
  401219:       movdqa %xmm6,%xmm0
  40121d:       paddd  %xmm1,%xmm3
  401221:       movdqa %xmm2,%xmm1
  401225:       punpcklqdq %xmm2,%xmm1
  401229:       punpckhqdq %xmm2,%xmm2
  40122d:       pcmpgtd %xmm1,%xmm0
  401231:       paddd  %xmm0,%xmm4
  401235:       movdqa %xmm6,%xmm0
  401239:       pcmpgtd %xmm2,%xmm0
  40123d:       paddd  %xmm0,%xmm4
  401241:       movdqa %xmm5,%xmm0
  401245:       movaps %xmm4,0x10(%esp)
  40124a:       pcmpgtd %xmm1,%xmm0
  40124e:       paddd  %xmm0,%xmm3
  401252:       movdqa %xmm5,%xmm0
  401256:       pcmpgtd %xmm2,%xmm0
  40125a:       paddd  %xmm0,%xmm3
  40125e:       movdqa %xmm3,%xmm7
  401262:       jne    4011c2
<kdlib::AEBH::streaming_sampling(kdlib::AEBH::streaming_node_t const&,
kdlib::AEBH::sampler3D_t const&)+0x52>

Each iteration takes about 8 cycles when not starved and prefetching
isn't a win unless done at least 4 or 8 cachelines away, so this one
is nothing but a hinderance.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=28919

Reply via email to