On Thu, 13 Feb 2025 06:46:10 PST (-0800), jeffreya...@gmail.com wrote:


On 2/13/25 1:47 AM, Robin Dapp wrote:
Other thoughts?

The docs seem to hint TARGET_SCHED_CAN_SPECULATE_INSN is meant for stuff
we can't/don't model in the pipeline, but I have no idea how to model
the VL=0 case there.
Maybe so, but what Edwin is doing looks sensible enough.  It wouldn't be
the first time a hook got (ab)used in ways that weren't part of the
original intent.

I don't fully understand what's happening.  So the hoisting is being done
speculatively here?  And it just happens to be "bad" because that might
cause a VL=0 case.  But are we sure a lack of speculation cannot cause
such cases?
Yes/No.  The scheduler certainly has code to avoid hoisting when doing
so would  change semantics.  That's not what's happening here.

I'd have to put it in a debugger or read the full dumps with some crazy
scheduler dump verbosity setting to be sure, but what I suspect is
happening is the scheduler is processing a multi-block region
(effectively an extended basic block).   In this scenario the scheduler
can pull insns from a later block into an earlier block, including past
a conditional branch as long as it doesn't change program semantics.

(Sorry to keep crossing the threads here, there's just a lot in this one and stuff gets truncated.)

FWIW, that's what tripped up my "maybe there's a functional bug here" thought. It looks like the scheduling is seeing

   bne t0, x0, end
   vsetvli t1, t2, ...
   vsetvli x0, t2, ...
   ...
 end:
   vsetvli x0, t2, ...

and thinking it's safe to schedule that like

   vsetvli t1, t2, ...
   bne t0, x0, end
   vsetvli x0, t2, ...
   ...
 end:
   vsetvli x0, t2, ...

which I'd assumed is because the scheduler sees both execution paths overwriting the vector control registers and thus thinks it's safe to move the first vsetvli to execute speculatively. From reading "6. Configuration-Setting Instructions" in vector.md that seems intentional, though, so maybe it's all just fine?

Also, why doesn't the vsetvl pass fix the situation?  IMHO we need to
understand the problem more thoroughly before changing things.
In the end LCM minimizes the number of vsetvls and inserts them at the
"earliest" point.  If that is not sufficient I'd say we need modify
the constraints (maybe on a per-uarch basis)?
The vsevl pass is LCM based.  So it's not allowed to add a vsetvl on a
path that didn't have a vsetvl before.  Consider this simple graph.

     0
    / \
   2-->3

If we have need for a vsetvl in bb2, but not bb0 or bb3, then the vsetvl
will land in bb4.  bb0 is not a valid insertion point for the vsetvl
pass because the path 0->3 doesn't strictly need a vsetvl.  That's
inherent in the LCM algorithm (anticipatable).

The scheduler has no such limitations.  The scheduler might create a
scheduling region out of blocks 0 and 2.  In that scenario, insns from
block 2 may speculate into block 0 as long as doing so doesn't change
semantics.

Ya. The combination of the scheduler moving a vsetvli before the branch (IIUC from bb2 to bb0 here) and the vsetvli merging causes it to look like the whole vsetvli was moved before the branch.

I'm not sure why the scheduler doesn't move both vsetvli instructions to execute speculatively, but otherwise this seems to be behaving as designed. It's just tripping up the VL=0 cases for us.

On a separate note:  How about we move the vsetvl pass after sched2?
Then we could at least rely on LCM doing its work uninhibited and wouldn't
reorder vsetvls afterwards.  Or do we somehow rely on rtl_dce and BB
reorder to run afterwards?

That won't help with the problem here but might with others.
It's a double edged sword.  If you defer placement until after
scheduling, then the vsetvls can wreck havoc with whatever schedule that
sched2 came up with.  It won't matter much for out of order designs, but
potentially does for others.

Maybe that's a broad uarch split point here? For OOO designs we'd want to rely on HW scheduling and thus avoid hoisting possibly-expensive vsetvli instructions (where they'd need to execute in HW because of the side effects), while on in-order designs we'd want to aggressively schedule vsetvli instructions because we can't rely on HW scheduling to hide the latency.

In theory at sched2 time the insn stream should be fixed.  There are
practical/historical exceptions, but changes to the insn stream after
that point are discouraged.

We were just talking about this is our toolchain team meeting, and it seems like both GCC and LLVM are in similar spots here -- essentially the required set of vsetvli instructions depends very strongly on scheduling, so trying to do them independently is just always going to lead to sub-par results. It feels kind of like we want some scheduling-based cost feedback in the vsetvli pass (or the other way around if they're in the other order) to get better results.

Maybe that's too much of a time sink for the OOO machines, though? If we've got HW scheduling then the SW just has to be in the ballpark and everything should be fine.

Reply via email to