On Mon, Mar 8, 2010 at 9:49 AM, Piotr Wyderski <piotr.wyder...@gmail.com> wrote: > I have the following code: > > struct bounding_box { > > pack4sf m_Mins; > pack4sf m_Maxs; > > void set(__v4sf v_mins, __v4sf v_maxs) { > > m_Mins = v_mins; > m_Maxs = v_maxs; > } > }; > > struct bin { > > bounding_box m_Box[3]; > pack4si m_NL; > pack4sf m_AL; > }; > > static const std::size_t bin_count = 16; > bin aBins[bin_count]; > > for(std::size_t i = 0; i != bin_count; ++i) { > > bin& b = aBins[i]; > > b.m_Box[0].set(g_VecInf, g_VecMinusInf); > b.m_Box[1].set(g_VecInf, g_VecMinusInf); > b.m_Box[2].set(g_VecInf, g_VecMinusInf); > b.m_NL = __v4si{ 0, 0, 0, 0 }; > } > > where pack4sf/si are union-based wrappers for __v4sf/si. > GCC 4.5 on Core i7/Cygwin with > > -O3 -fno-lto -msse -msse2 -mfpmath=sse -march=native -mtune=native > -fomit-frame-pointer > > completely unrolled the loop into 112 movdqa instructions, > which is "a bit" too agressive. Should I file a bug report? > The processor has an 18 instructions long prefetch queue > and the loop is perfectly predictable by the built-in branch > prediction circuitry, so translating it as is would result in huge > fetch/decode bandwidth reduction. Is there something like > "#pragma nounroll" to selectively disable this optimization?
No, only --param max-completely-peel-times (which is 16) or --param max-completely-peeled-insns (which probably should then be way lower than the current 400). Richard. > Best regards > Piotr Wyderski >