On 27/12/2024 12:29, Prathamesh Kulkarni wrote:
-----Original Message-----
From: Jakub Jelinek <ja...@redhat.com>
Sent: 17 December 2024 19:09
To: Prathamesh Kulkarni <prathame...@nvidia.com>
Cc: Andrew Stubbs <a...@baylibre.com>; Richard Biener
<richard.guent...@gmail.com>; Richard Biener <rguent...@suse.de>;
gcc@gcc.gnu.org; Thomas Schwinge <tschwi...@baylibre.com>
Subject: Re: [RFC] Enabling SVE with offloading to nvptx
External email: Use caution opening links or attachments
On Mon, Dec 02, 2024 at 11:17:08AM +0000, Prathamesh Kulkarni wrote:
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -233,6 +233,12 @@ public:
flag_finite_loops or similar pragmas state. */
unsigned finite_p : 1;
+ /* True if SIMD loop needs delayed lowering of artefacts like
+ safelen and length of omp simd arrays that depend on target's
+ max_vf. This is true for offloading, when max_vf is computed
after
+ streaming out to device. */
+ unsigned needs_max_vf_lowering: 1;
Consistency, finite_p above uses space before :, the above line
doesn't.
--- a/gcc/omp-expand.cc
+++ b/gcc/omp-expand.cc
@@ -7170,6 +7170,10 @@ expand_omp_simd (struct omp_region *region,
struct omp_for_data *fd)
loop->latch = cont_bb;
add_loop (loop, l1_bb->loop_father);
loop->safelen = safelen_int;
+ loop->needs_max_vf_lowering = is_in_offload_region (region);
+ if (loop->needs_max_vf_lowering)
+ cfun->curr_properties &= ~PROP_gimple_lomp_dev;
Do you really need this for non-SVE arches?
I mean, could you not set loop->needs_max_vf_lowering if maximum
number of poly_int coeffs is 1? Or if omp_max_vf returns constant or
something similar?
Well, I guess the issue is not really about VLA vectors but when host and
device have
different max_vf, and selecting optimal max_vf is not really possible during
omp-low/omp-expand,
since we don't have device's target info available at this point. Andrew's
recent patch works around this
limitation by searching for "amdgcn" in OFFLOAD_TARGET_NAMES in omp_max_vf, but
I guess a more general solution
would be to delay lowering max_vf after streaming-out to device irrespective of
VLA/VLS vectors ?
For AArch64/nvptx offloading with SVE, where host is VLA and device is VLS, the
issue is more pronounced (failing to compile),
compared to offloading from VLS host to VLS device (selecting sub-optimal
max_vf).
That patch fixed a couple of cases. The name matching was only used for
the case where an oversized VF was harmless. The other case where making
the VF too large would reserve excess memory was deferred to the device
compiler.
In general, deferring decisions is probably a good idea, but it's not
always possible, or optimal, and in the above case it certainly wasn't
the easy option. There's already precedent for doing the name match in
the SIMT VF code (for NVPTX), so it was easier and sufficient to do the
same.
Andrew