On Fri, Nov 20, 2015 at 8:21 AM, Jim Wilson <jim.wil...@linaro.org> wrote: > A cygwin hosted cross compiler to aarch64-linux, compiling a C version > of linpack with -Ofast, produces code that runs 17% slower than a > linux hosted compiler. The problem shows up in the vect dump, where > some different vectorization optimization decisions were made by the > cygwin compiler than the linux compiler. That happened because > tree-vect-data-refs.c calls qsort in vect_analyze_data_ref_accesses, > and the newlib and glibc qsort routines sort the list differently. I > can reproduce the same problem on linux by adding the newlib qsort > sources to a gcc build. For an x86_64 target, I see about a 30% > performance loss using the newlib qsort. > > The qsort trouble turns out to be a problem in the qsort comparison > function, dr_group_sort_cmp. It does this > if (!operand_equal_p (DR_BASE_ADDRESS (dra), DR_BASE_ADDRESS (drb), 0)) > { > cmp = compare_tree (DR_BASE_ADDRESS (dra), DR_BASE_ADDRESS (drb)); > if (cmp != 0) > return cmp; > } > operand_equal_p calls STRIP_NOPS, so it will consider two trees to be > the same even if they have NOP_EXPR. However, compare_tree is not > calling STRIP_NOPS, so it handles trees with NOP_EXPRs differently > than trees without. The result is that depending on which array entry > gets used as the qsort pivot point, you can get very different sorts. > The newlib qsort happens to be accidentally choosing a bad pivot for > this testcase. The glibc qsort happens to be accidentally choosing a > good pivot for this testcase. This then triggers a cascading problem > in vect_analyze_data_ref_accesses which assumes that array entries > that pass the operand_equal_p test for the base address will end up > adjacent, and will only vectorize in that case. > > For a contrived example, suppose we have four entries to sort: (plus Y > 8), (mult A 4), (pointer_plus Z 16), and (nop (mult A 4)). Suppose we > choose the mult as the pivot point. The plus sorts before because > tree_code plus is less than mult. The pointer_plus sorts after for the > same reason. The nop sorts equal. So we end up with plus, mult, nop, > pointer_plus. The mult and nop are then combined into the same > vectorization group. Now suppose we choose the pointer_plus as the > pivot point. The plus and mult sort before. The nop sorts after. The > final result is plus, mult, pointer_plus, nop. And we fail to > vectorize as the mult and nop are not adjacent as they should be. > > When I modify compare_tree to call STRIP_NOPS, this problem goes away. > I get the same sort from both the newlib and glibc qsort functions, > and I get the same linpack performance from a cygwin hosted compiler > and a linux hosted compiler. > > This patch was tested with an x86_64 bootstrap and make check. There > were no regressions. I've also done a SPEC CPU2000 run with and > without the patch on aarch64-linux, there is no performance change. > And I've verified it by building linpack for aarch64-linux with cygwin > hosted cross compiler, x86_64 hosted cross compiler, and an aarch64 > native compiler.
Ok. Thanks, Richard. > Jim