Am 20.08.2014 um 14:33 schrieb Connor Abbott:
On Tue, Aug 19, 2014 at 11:57 PM, Christian König
<deathsim...@vodafone.de> wrote:
I think we can fix this by introducing new structured variants of the
branch instruction in a way that doesn't alter the fundamental structure
of the IR.  E.g. an if branch could look like:

ifbr i1 <cond>, label <iftrue>, label <iffalse>, label <join>

Where both branches are guaranteed to converge at <join>.  Sure, this
will require fixing many assumptions, but on the one hand it's not
immediately required (as we can address this problem for the time being
using the same solution AMD uses) and on the other hand it's still less
work than starting from scratch.

Well, I've wrote the structurizer pass in LLVM you are talking about here
and from my experience you really don't want any structured form of control
flow in the IR.

Structured control flow is just a specialized form of unstructured control
flow and even if it looks rather awkward at first glance it is indeed
simpler to destructurize the compiler generated control flow for
optimization and structurize again for instruction selection.
That's interesting. I still think that with the right infrastructure,
having structured control flow really isn't that bad, and it prevents
optimizations from doing work like optimizing "if (foo) { break; }"
into a single conditional branch when clearly that's not very
productive. I would suspect that LLVM just isn't very good at
structured control flow since it wasn't designed that way, and that's
why it seems hard to work with.

Well, maybe I should note that a lot of closed source driver are using LLVM for their internal IR representation and as far as I know they have more or less all a rather structured way of control flow.

The problem with LLVM really isn't it's IR, because it's not designed CPU centric like you obviously think, but rather more that LLVM doesn't have a stable interface and is a rather fast moving project.

Actually for example for R600 you do want to optimize a pattern like "if (foo) { break; }" into a conditional branch, cause if you look at the ISA you see that the LOOP_BREAK pattern is able to take an additional condition to apply to the current execution mask.

When you design an hardware independent IR looking at the backend hardware level like you do right now is actually the completely wrong approach. What you need to do is making the IR as simple as possible and then allow to do specialized operations on it to translate it into the desired machine code.

In other words the logic necessary for code generation shouldn't be inside the IR, cause then the IR is specialized to this specific problem. Instead the logic needs to be in the tools that surround the IR.

Regards,
Christian.


The only reason I've annotated the LLVM IR with specialized intrinsics for
the SI backend was laziness and I wouldn't do that again given the chance.

And it's very likely that these backends, which probably aren't using
SSA due to the aforementioned difficulties, will also benefit from
having modifiers already folded for them - this is something that's
already a problem for i965 vec4 backend and that NIR will help a lot.

Well, I have the impression that much of the reason why the i965 vec4
backend has lagged behind so much in comparison with the fs backend is
precisely because it's so annoying to optimize vec4 code.  It seems
painful to me that you have this built into the core instruction set so
generic optimization passes will have to be explicitly aware of it.  I
wouldn't be surprised if the i965 vec4 benefited at least as much from
scalarizing the code, performing optimizations there, and re-vectorizing
afterwards.
We thought about doing something like that, but I don't think it's
really that much of a burden when it comes to the rest of the IR. Most
of the difficulty of working with a vec4 representation comes from the
fact that instructions can partially update their outputs, and once we
convert to SSA that problem goes away since there are no partial
updates in SSA. Coming out of SSA is where the difficulty lies, but I
still think that's a solvable problem, just a difficult one. Plus,
there's the problem of how to do the vectorization - you could do it
in SSA, but then you still have the hard bit of coming out of SSA and
so you're back to square one, or you could do it once you're out of
SSA but then it's a lot harder to reason about since you're back to
having partial updates.


Completely agree.

Being able to do vectorization in an IR is important, but you shouldn't try
to handle backend specific swizzle operations and vectorizing restrictions
in the IR. Just looking at the swizzle restrictions of R600 for example and
I really can't imagine that you want to represent this in a common IR
between all different drivers.

Regards,
Christian.

Am 20.08.2014 um 08:33 schrieb Francisco Jerez:

Connor Abbott <cwabbo...@gmail.com> writes:

On Tue, Aug 19, 2014 at 11:40 AM, Francisco Jerez <curroje...@riseup.net>
wrote:

Tom Stellard <t...@stellard.net> writes:

On Tue, Aug 19, 2014 at 11:04:59AM -0400, Connor Abbott wrote:

On Mon, Aug 18, 2014 at 8:52 PM, Michel Dänzer <mic...@daenzer.net> wrote:

On 19.08.2014 01:28, Connor Abbott wrote:

On Mon, Aug 18, 2014 at 4:32 AM, Michel Dänzer <mic...@daenzer.net> wrote:

On 16.08.2014 09:12, Connor Abbott wrote:

I know what you might be thinking right now. "Wait, *another* IR? Don't
we already have like 5 of those, not counting all the driver-specific
ones? Isn't this stuff complicated enough already?" Well, there are some
pretty good reasons to start afresh (again...). In the years we've been
using GLSL IR, we've come to realize that, in fact, it's not what we
want *at all* to do optimizations on.

Did you evaluate using LLVM IR instead of inventing yet another one?


--
Earthling Michel Dänzer            |                  http://www.amd.com
Libre software enthusiast          |                Mesa and X developer

Yes. See

http://lists.freedesktop.org/archives/mesa-dev/2014-February/053502.html

and

http://lists.freedesktop.org/archives/mesa-dev/2014-February/053522.html

I know Ian can't deal with LLVM for some reason. I was wondering if
*you* evaluated it, and if so, why you rejected it.


--
Earthling Michel Dänzer            |                  http://www.amd.com
Libre software enthusiast          |                Mesa and X developer

Well, first of all, the fact that Ian and Ken don't want to use it
means that any plan to use LLVM for the Intel driver is dead in the
water anyways - you can translate NIR into LLVM if you want, but for
i965 we want to share optimizations between our 2 backends (FS and
vec4) that we can't do today in GLSL IR so this is what we want to use
for that, and since nobody else does anything with the core GLSL
compiler except when they have to, when we start moving things out of
GLSL IR this will probably replace GLSL IR as the infrastructure that
all Mesa drivers use. But with that in mind, here are a few reasons
why we wouldn't want to use LLVM:

* LLVM wasn't built to understand structured CFG's, meaning that you
need to re-structurize it using a pass that's fragile and prone to
break if some other pass "optimizes" the shader in a way that makes it
non-structured (i.e. not expressible in terms of loops and if
statements). This loss of information also means that passes that need
to know things like, for example, the loop nesting depth need to do an
analysis pass whereas with NIR you can just walk up the control flow
tree and count the number of loops we hit.

LLVM has a pass to structurize the CFG.  We use it in the radeon
drivers, and it is run after all of the other LLVM optimizations which have
no concept of structured CFG.  It's not bug free, but it works really
well even with all of the complex OpenCL kernels we throw at it.

Your point about losing information when the CFG is de-structurized is
valid, but for things like loop depth, I'm not sure why we couldn't write an
LLVM analysis pass for this (if one doesn't already exist).

I don't think this is such a big deal either.  At least the
structurization pass used on newer AMD hardware isn't "fragile" in the
way you seem to imply -- AFAIK (unlike the old AMDIL heuristic
algorithm) it's guaranteed to give you a valid structurized output no
matter what the previous optimization passes have done to the CFG,
modulo bugs.  I admit that the situation is nevertheless suboptimal.
Ideally this information wouldn't get lost along the way.  For the long
term we may want to represent structured control flow directly in the IR
as you say, I just don't see how reinventing the IR saves us any work if
we could just fix the existing one.

It seems to me that something like how we represent control flow is a
pretty fundamental part of the IR - it affects any optimization pass
that needs to do anything beyond adding and removing instructions. How
would you fix that, especially given that LLVM is primarily designed
for CPU's where you don't want to be restricted to structured control
flow at all? It seems like our goals (preserve the structure) conflict
with the way LLVM has been designed.

I think we can fix this by introducing new structured variants of the
branch instruction in a way that doesn't alter the fundamental structure
of the IR.  E.g. an if branch could look like:

ifbr i1 <cond>, label <iftrue>, label <iffalse>, label <join>

Where both branches are guaranteed to converge at <join>.  Sure, this
will require fixing many assumptions, but on the one hand it's not
immediately required (as we can address this problem for the time being
using the same solution AMD uses) and on the other hand it's still less
work than starting from scratch.

* LLVM doesn't do modifiers, meaning that we can't do optimizations
like "clamp(x, 0.0, 1.0) => mov.sat x" and "clamp(x, 0.25, 1.0) =>
max.sat(x, .25)" in a generic fashion.

The way to handle this with LLVM would be to add intrinsics to represent
the various modifiers and then fold them into instructions during
instruction selection.

IMHO this is a feature.  One of the things I don't like about NIR is
that it's still vec4-centric.  Most drivers are going to want something
else and different to each other, we cannot please all of them with one
single vector addressing model built into the core instruction set, so
I'd rather have modifiers, writemasks and swizzles represented as the
composition of separate instructions/intrinsics with simple and
well-defined semantics, which can be coalesced back into the real
instruction as Tom says (easy even if you don't use LLVM's instruction
selector as long as it's SSA form).

While NIR is vec4-centric, nothing's stopping you from splitting up
instructions and doing optimizations at the scalar level for scalar
ISA's - in fact, that's what I expect to happen. And for backends that
really do need to have swizzles and writemasks, coalescing these
things back into the original instruction is not at all trivial

It's a simple peephole optimization AFAICT:

val2 = alu-op(modifier(val1)) -> hardware-specific-extended-alu-op(val)
val2 = shuffle(val2, alu-op(val1)) ->
hardware-specific-alu-op-with-writemask(val2, val1)

- in fact, going into and out of SSA without introducing extra copies
even in situations like:

foo.xyz = ...
... = foo
foo.x = ...

is a problem that hasn't been solved yet publicly (it seems doable,
but difficult).

This problem is orthogonal to the mechanism you use to represent
swizzles and writemasks AFAICT.  How could having these modifiers built
into the core ISA help you with transforming vector ops in and out of
SSA?

So while we might not need swizzles and writemasks for most backends,
for the few that do need it (like, for example, the i965 vec4 backend)
it will be very nice to have one common lowering pass that solves this
hard problem, which would be impossible to do without having swizzles
and writemasks in the IR.

I disagree.  It would be possible if the IR is extensible enough for
back-ends to be able to represent their exotic vector addressing modes
as driver-defined machine instructions in a way that generic
optimization passes can still deal with them.

And it's very likely that these backends, which probably aren't using
SSA due to the aforementioned difficulties, will also benefit from
having modifiers already folded for them - this is something that's
already a problem for i965 vec4 backend and that NIR will help a lot.

Well, I have the impression that much of the reason why the i965 vec4
backend has lagged behind so much in comparison with the fs backend is
precisely because it's so annoying to optimize vec4 code.  It seems
painful to me that you have this built into the core instruction set so
generic optimization passes will have to be explicitly aware of it.  I
wouldn't be surprised if the i965 vec4 benefited at least as much from
scalarizing the code, performing optimizations there, and re-vectorizing
afterwards.

[...]



_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev



_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev

Reply via email to