On Fri, Nov 20, 2015 at 8:21 AM, Jim Wilson <jim.wil...@linaro.org> wrote:
> A cygwin hosted cross compiler to aarch64-linux, compiling a C version
> of linpack with -Ofast, produces code that runs 17% slower than a
> linux hosted compiler.  The problem shows up in the vect dump, where
> some different vectorization optimization decisions were made by the
> cygwin compiler than the linux compiler.  That happened because
> tree-vect-data-refs.c calls qsort in vect_analyze_data_ref_accesses,
> and the newlib and glibc qsort routines sort the list differently.  I
> can reproduce the same problem on linux by adding the newlib qsort
> sources to a gcc build.  For an x86_64 target, I see about a 30%
> performance loss using the newlib qsort.
>
> The qsort trouble turns out to be a problem in the qsort comparison
> function, dr_group_sort_cmp.  It does this
>   if (!operand_equal_p (DR_BASE_ADDRESS (dra), DR_BASE_ADDRESS (drb), 0))
>     {
>       cmp = compare_tree (DR_BASE_ADDRESS (dra), DR_BASE_ADDRESS (drb));
>       if (cmp != 0)
>         return cmp;
>     }
> operand_equal_p calls STRIP_NOPS, so it will consider two trees to be
> the same even if they have NOP_EXPR.  However, compare_tree is not
> calling STRIP_NOPS, so it handles trees with NOP_EXPRs differently
> than trees without.  The result is that depending on which array entry
> gets used as the qsort pivot point, you can get very different sorts.
> The newlib qsort happens to be accidentally choosing a bad pivot for
> this testcase.  The glibc qsort happens to be accidentally choosing a
> good pivot for this testcase.  This then triggers a cascading problem
> in vect_analyze_data_ref_accesses which assumes that array entries
> that pass the operand_equal_p test for the base address will end up
> adjacent, and will only vectorize in that case.
>
> For a contrived example, suppose we have four entries to sort: (plus Y
> 8), (mult A 4), (pointer_plus Z 16), and (nop (mult A 4)).  Suppose we
> choose the mult as the pivot point. The plus sorts before because
> tree_code plus is less than mult. The pointer_plus sorts after for the
> same reason. The nop sorts equal. So we end up with plus, mult, nop,
> pointer_plus. The mult and nop are then combined into the same
> vectorization group.  Now suppose we choose the pointer_plus as the
> pivot point. The plus and mult sort before. The nop sorts after. The
> final result is plus, mult, pointer_plus, nop. And we fail to
> vectorize as the mult and nop are not adjacent as they should be.
>
> When I modify compare_tree to call STRIP_NOPS, this problem goes away.
> I get the same sort from both the newlib and glibc qsort functions,
> and I get the same linpack performance from a cygwin hosted compiler
> and a linux hosted compiler.
>
> This patch was tested with an x86_64 bootstrap and make check.  There
> were no regressions.  I've also done a SPEC CPU2000 run with and
> without the patch on aarch64-linux, there is no performance change.
> And I've verified it by building linpack for aarch64-linux with cygwin
> hosted cross compiler, x86_64 hosted cross compiler, and an aarch64
> native compiler.

Ok.

Thanks,
Richard.

> Jim

Reply via email to