> -----Original Message----- > From: Jakub Jelinek <ja...@redhat.com> > Sent: 04 November 2024 21:44 > To: Prathamesh Kulkarni <prathame...@nvidia.com> > Cc: Richard Biener <richard.guent...@gmail.com>; Richard Biener > <rguent...@suse.de>; gcc@gcc.gnu.org; Thomas Schwinge > <tschwi...@baylibre.com> > Subject: Re: [RFC] Enabling SVE with offloading to nvptx > > External email: Use caution opening links or attachments > > > On Sat, Nov 02, 2024 at 03:53:34PM +0000, Prathamesh Kulkarni wrote: > > The attached patch adds a new bitfield needs_max_vf_lowering to > loop, > > and sets that in expand_omp_simd for loops that need delayed > lowering > > of safelen and omp simd arrays. The patch defines a new macro > > OMP_COMMON_MAX_VF (arbitrarily set to 16), as a placeholder value > for > > max_vf (instead of INT_MAX), and is later replaced by appropriate > > max_vf during omp_adjust_max_vf pass. Does that look OK ? > > No. > The thing is, if user doesn't specify safelen, it defaults to infinity > (which we represent as INT_MAX), if user specifies it, then that is > the maximum for it (currently in OpenMP specification it is just an > integral value, so can't be a poly int). > And then the lowering uses the max_vf as another limit, what the hw > can do at most and sizes the magic arrays with it. So, one needs to > use minimum of what user specified and what the hw can handle. > So using 16 as some magic value is just wrong, safelen(16) can be > specified in the source as well, or safelen(8), or safelen(32) or > safelen(123). > > Thus, the fact that the hw minimum hasn't been determined yet needs to > be represented in some other flag, not in loop->safelen value, and > before that is determined, loop->safelen should then represent what > the user wrote (or was implied) and the later pass should use minimum > from loop->safelen and the picked hw maximum. Of course if the picked > hw maximum is POLY_INT-ish, the big question is how to compare that > against the user supplied integer value, either one can just handle > the INT_MAX (aka > infinity) special case, or say query the backend on what is the > maximum value of the POLY_INT at runtime and only use the POLY_INT if > it is always known to be smaller or equal to the user supplied > safelen. > > Another thing (already mentioned in the thread Andrew referenced) is > that max_vf is used in two separate places. One is just to size of > the magic arrays and one of the operands of the minimum (the other is > user specified safelen). In this case, it is generally just fine to > pick later value than strictly necessary (as long as it is never > larger than user supplied safelen). > The other case is simd modifier on schedule clause. That value should > better be the right one or slightly larger, but not too much. > I think currently we just use the INTEGER_CST we pick as the maximum, > if this sizing is deferred, maybe it needs to be another internal > function that asks the value (though, it can refer to a loop vf in > another function, which complicates stuff). > > Regarding Richi's question, I'm afraid the OpenMP simd loop lowering > can't be delayed until some later pass. Hi Jakub, Thanks for the suggestions! The attached patch makes the following changes: (1) Delays setting of safelen for offloading by introducing a new bitfield needs_max_vf_lowering in loop, which is true with offloading enabled, and safelen is then set to min(safelen, max_vf) for the target later in omp_device_lower pass. Comparing user-specified safelen with poly_int max_vf may not be always possible at compile-time (say 32 and 16+16x), and even if we determine runtime VL based on -mcpu flags, I guess relying on that won't be portable ? The patch works around this by taking constant_lower_bound (max_vf), and comparing it with safelen instead, with the downside that constant_lower_bound(max_vf) will not be the optimal max_vf for SVE target if it implements SIMD width > 128 bits.
(2) Since max_vf is used as length of omp simd array, it gets streamed out to device, and device compilation fails during streaming-in if max_vf is poly_int (16+16x), and device's NUM_POLY_INT_COEFFS < 2 (which motivated my patch). The patch tries to address this by simply setting length to a placeholder value (INT_MAX?) in lower_rec_simd_input_clauses if offloading is enabled, and will be later set to appropriate value in omp_device_lower pass. (3) Andrew's patches seems to already fix the case for adjusting chunk_size for schedule clause with simd modifier by introducing a new internal function .GOMP_MAX_VF, which is then replaced by target's max_vf. To keep it consistent with safelen, the patch here uses constant_lower_bound (max_vf) too. Patch passes libgomp testing for AArch64/nvptx offloading (with and without GPU). Does it look OK ? Thanks, Prathamesh > > Jakub
Delay lowering safelen if offloading is enabled. gcc/ChangeLog: * cfgloop.h (loop): New bitfield needs_max_vf_lowering. * omp-expand.cc (expand_omp_simd): Set loop->needs_max_vf_lowering to result of is_in_offload_region. * omp-low.cc (lower_rec_simd_input_clauses): Set a placeholder value for length of omp simd array if offloading is enabled and max_vf is non-constant poly_int. * omp-offload.cc (adjust_max_vf): New function. (execute_omp_device_lower): Call adjust_max_vf, and use constant_lower_bound on result of omp_max_vf while replacing call to .GOMP_MAX_VF. Signed-off-by: Prathamesh Kulkarni <prathame...@nvidia.com> diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h index 30b5e40d0d9..f7b57e594fd 100644 --- a/gcc/cfgloop.h +++ b/gcc/cfgloop.h @@ -233,6 +233,12 @@ public: flag_finite_loops or similar pragmas state. */ unsigned finite_p : 1; + /* True if SIMD loop needs delayed lowering of artefacts like + safelen and length of omp simd arrays that depend on target's + max_vf. This is true for offloading, when max_vf is computed after + streaming out to device. */ + unsigned needs_max_vf_lowering: 1; + /* The number of times to unroll the loop. 0 means no information given, just do what we always do. A value of 1 means do not unroll the loop. A value of USHRT_MAX means unroll with no specific unrolling factor. diff --git a/gcc/omp-expand.cc b/gcc/omp-expand.cc index 80fb1843445..5946713ac3f 100644 --- a/gcc/omp-expand.cc +++ b/gcc/omp-expand.cc @@ -7171,6 +7171,8 @@ expand_omp_simd (struct omp_region *region, struct omp_for_data *fd) loop->latch = cont_bb; add_loop (loop, l1_bb->loop_father); loop->safelen = safelen_int; + loop->needs_max_vf_lowering = is_in_offload_region (region); + if (simduid) { loop->simduid = OMP_CLAUSE__SIMDUID__DECL (simduid); diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc index 70a2c108fbc..0f68876a2bc 100644 --- a/gcc/omp-low.cc +++ b/gcc/omp-low.cc @@ -4660,7 +4660,16 @@ lower_rec_simd_input_clauses (tree new_var, omp_context *ctx, } else { - tree atype = build_array_type_nelts (TREE_TYPE (new_var), sctx->max_vf); + poly_uint64 nelts; + /* FIXME: If offloading is enabled, and max_vf is poly_int, avoid + using it as length of omp simd array, and use a placeholder value + instead to avoid streaming out poly_int to device. The arrays + will be set to correct length later in omp_device_lower pass. */ + if (omp_maybe_offloaded_ctx (ctx) && !sctx->max_vf.is_constant ()) + nelts = INT_MAX; + else + nelts = sctx->max_vf; + tree atype = build_array_type_nelts (TREE_TYPE (new_var), nelts); tree avar = create_tmp_var_raw (atype); if (TREE_ADDRESSABLE (new_var)) TREE_ADDRESSABLE (avar) = 1; diff --git a/gcc/omp-offload.cc b/gcc/omp-offload.cc index 372b019f9d6..9d6467cc6a6 100644 --- a/gcc/omp-offload.cc +++ b/gcc/omp-offload.cc @@ -2618,6 +2618,75 @@ find_simtpriv_var_op (tree *tp, int *walk_subtrees, void *) return NULL_TREE; } +/* Compute max_vf for target, and accordingly set loop->safelen and length + of omp simd arrays. */ + +static void +adjust_max_vf (function *fun) +{ + if (!fun->has_simduid_loops) + return; + + poly_uint64 max_vf = omp_max_vf (false); + + /* FIXME: Since loop->safelen has to be an integer, it's not always possible + to compare against poly_int. For eg 32 and 16+16x are not comparable at + compile-time because 16+16x <= 32 for x < 2, but 16+16x > 32 for x >= 2. + Even if we could get runtime VL based on -mcpu/-march, that would not be + portable across other SVE archs. + + For now, use constant_lower_bound (max_vf), as a "safer approximation" to + max_vf that avoids these issues, with the downside that it will be + suboptimal max_vf for SVE archs implementing SIMD width > 128 bits. */ + + uint64_t max_vf_int; + if (!max_vf.is_constant (&max_vf_int)) + max_vf_int = constant_lower_bound (max_vf); + + calculate_dominance_info (CDI_DOMINATORS); + for (auto loop: loops_list (fun, 0)) + { + if (!loop->needs_max_vf_lowering) + continue; + + if (loop->safelen > max_vf_int) + loop->safelen = max_vf_int; + + basic_block *bbs = get_loop_body (loop); + for (unsigned i = 0; i < loop->num_nodes; i++) + for (auto gsi = gsi_start_bb (bbs[i]); !gsi_end_p (gsi); gsi_next (&gsi)) + { + gcall *stmt = dyn_cast<gcall *> (gsi_stmt (gsi)); + tree lhs = NULL_TREE; + + if (stmt + && gimple_call_internal_p (stmt) + && (gimple_call_internal_fn (stmt) == IFN_GOMP_SIMD_LANE) + && ((lhs = gimple_call_lhs (stmt)) != NULL_TREE)) + { + imm_use_iterator use_iter; + gimple *use_stmt; + + FOR_EACH_IMM_USE_STMT (use_stmt, use_iter, lhs) + { + gassign *assign_stmt = dyn_cast<gassign *> (use_stmt); + if (!assign_stmt) + continue; + + tree assign_stmt_lhs = gimple_assign_lhs (assign_stmt); + if (TREE_CODE (assign_stmt_lhs) == ARRAY_REF) + { + tree array = TREE_OPERAND (assign_stmt_lhs, 0); + tree& max = TYPE_MAX_VALUE (TYPE_DOMAIN (TREE_TYPE (array))); + max = size_int (loop->safelen - 1); + relayout_decl (array); + } + } + } + } + } +} + /* Cleanup uses of SIMT placeholder internal functions: on non-SIMT targets, VF is 1 and LANE is 0; on SIMT targets, VF is folded to a constant, and LANE is kept to be expanded to RTL later on. Also cleanup all other SIMT @@ -2627,6 +2696,8 @@ find_simtpriv_var_op (tree *tp, int *walk_subtrees, void *) static unsigned int execute_omp_device_lower () { + adjust_max_vf (cfun); + int vf = targetm.simt.vf ? targetm.simt.vf () : 1; bool regimplify = false; basic_block bb; @@ -2755,7 +2826,10 @@ execute_omp_device_lower () rhs = build_int_cst (type, vf); break; case IFN_GOMP_MAX_VF: - rhs = build_int_cst (type, omp_max_vf (false)); + /* Use constant_lower_bound, to keep max_vf consistent with + adjut_max_vf above. */ + rhs = build_int_cst (type, + constant_lower_bound (omp_max_vf (false))); break; case IFN_GOMP_SIMT_ORDERED_PRED: rhs = vf == 1 ? integer_zero_node : NULL_TREE;