On 27/12/2024 12:29, Prathamesh Kulkarni wrote:


-----Original Message-----
From: Jakub Jelinek <ja...@redhat.com>
Sent: 17 December 2024 19:09
To: Prathamesh Kulkarni <prathame...@nvidia.com>
Cc: Andrew Stubbs <a...@baylibre.com>; Richard Biener
<richard.guent...@gmail.com>; Richard Biener <rguent...@suse.de>;
gcc@gcc.gnu.org; Thomas Schwinge <tschwi...@baylibre.com>
Subject: Re: [RFC] Enabling SVE with offloading to nvptx

External email: Use caution opening links or attachments


On Mon, Dec 02, 2024 at 11:17:08AM +0000, Prathamesh Kulkarni wrote:
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -233,6 +233,12 @@ public:
       flag_finite_loops or similar pragmas state.  */
    unsigned finite_p : 1;

+  /* True if SIMD loop needs delayed lowering of artefacts like
+     safelen and length of omp simd arrays that depend on target's
+     max_vf.  This is true for offloading, when max_vf is computed
after
+     streaming out to device.  */
+  unsigned needs_max_vf_lowering: 1;

Consistency, finite_p above uses space before :, the above line
doesn't.

--- a/gcc/omp-expand.cc
+++ b/gcc/omp-expand.cc
@@ -7170,6 +7170,10 @@ expand_omp_simd (struct omp_region *region,
struct omp_for_data *fd)
        loop->latch = cont_bb;
        add_loop (loop, l1_bb->loop_father);
        loop->safelen = safelen_int;
+      loop->needs_max_vf_lowering = is_in_offload_region (region);
+      if (loop->needs_max_vf_lowering)
+     cfun->curr_properties &= ~PROP_gimple_lomp_dev;

Do you really need this for non-SVE arches?
I mean, could you not set loop->needs_max_vf_lowering if maximum
number of poly_int coeffs is 1?  Or if omp_max_vf returns constant or
something similar?
Well, I guess the issue is not really about VLA vectors but when host and 
device have
different max_vf, and selecting optimal max_vf is not really possible during 
omp-low/omp-expand,
since we don't have device's target info available at this point. Andrew's 
recent patch works around this
limitation by searching for "amdgcn" in OFFLOAD_TARGET_NAMES in omp_max_vf, but 
I guess a more general solution
would be to delay lowering max_vf after streaming-out to device irrespective of 
VLA/VLS vectors ?
For AArch64/nvptx offloading with SVE, where host is VLA and device is VLS, the 
issue is more pronounced (failing to compile),
compared to offloading from VLS host to VLS device (selecting sub-optimal 
max_vf).

That patch fixed a couple of cases. The name matching was only used for the case where an oversized VF was harmless. The other case where making the VF too large would reserve excess memory was deferred to the device compiler.

In general, deferring decisions is probably a good idea, but it's not always possible, or optimal, and in the above case it certainly wasn't the easy option. There's already precedent for doing the name match in the SIMT VF code (for NVPTX), so it was easier and sufficient to do the same.

Andrew

Reply via email to