Hi,
This patch changes the minimum number of iterations of outer loops for the
runtime check which tests whether it is worthwhile to parallelize the loop
or not.
The current minimum number of iterations for all loops is MIN_PER_THREAD *
number of threads, when MIN_PER_THREAD is arbitrarily set to 100.
This prevents some of the promising loops of SPEC2006 from getting
parallelized.
I changed the minimum bound for outer loops, under the assumption that
even if there are not enough iterations, the fact that an outer loop
contains more loops, obtains enough work to get parallelized.
This indeed allowed for a lot more loops to get parallelized, resulting in
substantial performance improvements for SPEC2006 benchmarks, measured on
a Power7 6 core, 4 way SMT each.
I compared the trunk with O3 + autopar (parallelizing with 6 threads) vs.
the trunk with O3 minus vectorization.
None of the benchmarks shows any significant degradation.
The speedup shown for libquatum with autopar has been obtained with
previous versions of autopar, having no relation to this patch, but surely
not degraded by it either.
These are the speedups I collected:
462.libquantum 2.5 X
410.bwaves 3.3 X
436.cactusADM 4.5 X
459.GemsFDTD 1.27 X
481.wrf 1.25 X
Bootstrap and testsuite (with -ftree-parallelize-loops=4) pass
successfully.
spec-2006 showed no regressions.
OK for trunk?
Thanks,
razya
2012-05-08 Razya Ladelsky <[email protected]>
* tree-parloops.c (gen_parallel_loop): Change
many_iterations_cond for outer loops.
Index: tree-parloops.c
===================================================================
--- tree-parloops.c (revision 186667)
+++ tree-parloops.c (working copy)
@@ -1732,6 +1732,7 @@ gen_parallel_loop (struct loop *loop, htab_t reduc
unsigned prob;
location_t loc;
gimple cond_stmt;
+ unsigned int m_p_thread=2;
/* From
@@ -1792,9 +1793,15 @@ gen_parallel_loop (struct loop *loop, htab_t reduc
if (stmts)
gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);
- many_iterations_cond =
- fold_build2 (GE_EXPR, boolean_type_node,
- nit, build_int_cst (type, MIN_PER_THREAD * n_threads));
+ if (loop->inner)
+ m_p_thread=2;
+ else
+ m_p_thread=MIN_PER_THREAD;
+
+ many_iterations_cond =
+ fold_build2 (GE_EXPR, boolean_type_node,
+ nit, build_int_cst (type, m_p_thread * n_threads));
+
many_iterations_cond
= fold_build2 (TRUTH_AND_EXPR, boolean_type_node,
invert_truthvalue (unshare_expr (niter->may_be_zero)),
=