On Mon, Mar 8, 2010 at 9:49 AM, Piotr Wyderski <piotr.wyder...@gmail.com> wrote:
> I have the following code:
>
>    struct bounding_box {
>
>        pack4sf m_Mins;
>        pack4sf m_Maxs;
>
>        void set(__v4sf v_mins, __v4sf v_maxs) {
>
>            m_Mins = v_mins;
>            m_Maxs = v_maxs;
>        }
>    };
>
>    struct bin {
>
>        bounding_box m_Box[3];
>        pack4si      m_NL;
>        pack4sf      m_AL;
>    };
>
>    static const std::size_t bin_count = 16;
>    bin aBins[bin_count];
>
>    for(std::size_t i = 0; i != bin_count; ++i) {
>
>        bin& b = aBins[i];
>
>        b.m_Box[0].set(g_VecInf, g_VecMinusInf);
>        b.m_Box[1].set(g_VecInf, g_VecMinusInf);
>        b.m_Box[2].set(g_VecInf, g_VecMinusInf);
>        b.m_NL = __v4si{ 0, 0, 0, 0 };
>    }
>
> where pack4sf/si are union-based wrappers for __v4sf/si.
> GCC 4.5 on Core i7/Cygwin with
>
> -O3 -fno-lto -msse -msse2 -mfpmath=sse -march=native -mtune=native
> -fomit-frame-pointer
>
> completely unrolled the loop into 112 movdqa instructions,
> which is "a bit" too agressive. Should I file a bug report?
> The processor has an 18 instructions long prefetch queue
> and the loop is perfectly predictable by the built-in branch
> prediction circuitry, so translating it as is would result in huge
> fetch/decode bandwidth reduction. Is there something like
> "#pragma nounroll" to selectively disable this optimization?

No, only --param max-completely-peel-times (which is 16)
or --param max-completely-peeled-insns (which probably should
then be way lower than the current 400).

Richard.

> Best regards
> Piotr Wyderski
>

Reply via email to