On Thu, 13 Feb 2025 07:38:13 PST (-0800), jeffreya...@gmail.com wrote:


On 2/13/25 8:19 AM, Robin Dapp wrote:
The vsevl pass is LCM based.  So it's not allowed to add a vsetvl on a
path that didn't have a vsetvl before.  Consider this simple graph.

      0
     / \
    2-->3

If we have need for a vsetvl in bb2, but not bb0 or bb3, then the vsetvl
will land in bb4.  bb0 is not a valid insertion point for the vsetvl
pass because the path 0->3 doesn't strictly need a vsetvl.  That's
inherent in the LCM algorithm (anticipatable).

Yeah, I remember the same issue with the rounding-mode setter placement.
Yes.  For VXRM placement, under the right circumstances we pretend there
is a need for the VXRM state at the first instruction in the first BB.
That enables very aggressive hoisting by LCM in those limited cases.




Wouldn't that be fixable by requiring a dummy/wildcard/dontcare vsetvl in bb3
(or any other block that doesn't require one)?  Such a dummy vsetvl would be
fusible with every other vsetvl.  If there are dummy vsetvls remaining after
LCM just delete them?

Just thinking out loud, the devil will be in the details.
But in Vineet's case they want to avoid speculation as that can result
in a vl=0 case.  If we had a dummy fusible vsetvl in bb3, then that
would allow movement into bb0 which is undesirable.

Ya, I think we confused everyone because there's really two vsetvli/branch movement things we've been talking about and they're kind of the opposite.

There's the issue this patch works around, where we found some vsetvli instances that set VL=0 in unrolled loops. That makes some of our hardware people upset. Turns out the reduced test case has the branches to early-out of the unrolled loop when VL would be 0, so just banning vsetvli speculation fixes the issue. It's kind of a indirect way to solve a uarch-specific problem, so who knows if it'll be worth doing.

Then there's the vsetvli loop-invarint hoisting / vector tail generation thing we were talking about in the meeting this week. Having the vsetvli in the loop made a different subset of our hardware people upset. That's kind of the opposite optimization, though we'd want to avoid the VL=0 case. They're both "Vineet's bug", the hardware people tend to call Vineet when they get upset ;)

WRT a question Palmer asked earlier in the thread.  I went back and
reviewed the code/docs around the hook Edwin is using.  My reading is a
bit different and that what Edwin is doing is perfectly fine.

Awesome, thanks. So I think if this is sane enough to run experiments we can at least try that out and see what happens.

Jeff



Reply via email to