And don't forget that explicit vec4 becomes immensely amusing once you add fp64/double to the problem.
OG. On Wed, Aug 20, 2014 at 4:01 PM, Francisco Jerez <curroje...@riseup.net> wrote: > Connor Abbott <cwabbo...@gmail.com> writes: > >> On Tue, Aug 19, 2014 at 11:33 PM, Francisco Jerez <curroje...@riseup.net> >> wrote: >>> Connor Abbott <cwabbo...@gmail.com> writes: >>> >>>> On Tue, Aug 19, 2014 at 11:40 AM, Francisco Jerez <curroje...@riseup.net> >>>> wrote: >>>>> Tom Stellard <t...@stellard.net> writes: >>>>> >>>>>> On Tue, Aug 19, 2014 at 11:04:59AM -0400, Connor Abbott wrote: >>>>>>> On Mon, Aug 18, 2014 at 8:52 PM, Michel Dänzer <mic...@daenzer.net> >>>>>>> wrote: >>>>>>> > On 19.08.2014 01:28, Connor Abbott wrote: >>>>>>> >> On Mon, Aug 18, 2014 at 4:32 AM, Michel Dänzer <mic...@daenzer.net> >>>>>>> >> wrote: >>>>>>> >>> On 16.08.2014 09:12, Connor Abbott wrote: >>>>>>> >>>> I know what you might be thinking right now. "Wait, *another* IR? >>>>>>> >>>> Don't >>>>>>> >>>> we already have like 5 of those, not counting all the >>>>>>> >>>> driver-specific >>>>>>> >>>> ones? Isn't this stuff complicated enough already?" Well, there >>>>>>> >>>> are some >>>>>>> >>>> pretty good reasons to start afresh (again...). In the years we've >>>>>>> >>>> been >>>>>>> >>>> using GLSL IR, we've come to realize that, in fact, it's not what >>>>>>> >>>> we >>>>>>> >>>> want *at all* to do optimizations on. >>>>>>> >>> >>>>>>> >>> Did you evaluate using LLVM IR instead of inventing yet another one? >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> -- >>>>>>> >>> Earthling Michel Dänzer | >>>>>>> >>> http://www.amd.com >>>>>>> >>> Libre software enthusiast | Mesa and X >>>>>>> >>> developer >>>>>>> >> >>>>>>> >> Yes. See >>>>>>> >> >>>>>>> >> http://lists.freedesktop.org/archives/mesa-dev/2014-February/053502.html >>>>>>> >> >>>>>>> >> and >>>>>>> >> >>>>>>> >> http://lists.freedesktop.org/archives/mesa-dev/2014-February/053522.html >>>>>>> > >>>>>>> > I know Ian can't deal with LLVM for some reason. I was wondering if >>>>>>> > *you* evaluated it, and if so, why you rejected it. >>>>>>> > >>>>>>> > >>>>>>> > -- >>>>>>> > Earthling Michel Dänzer | >>>>>>> > http://www.amd.com >>>>>>> > Libre software enthusiast | Mesa and X >>>>>>> > developer >>>>>>> >>>>>>> >>>>>>> Well, first of all, the fact that Ian and Ken don't want to use it >>>>>>> means that any plan to use LLVM for the Intel driver is dead in the >>>>>>> water anyways - you can translate NIR into LLVM if you want, but for >>>>>>> i965 we want to share optimizations between our 2 backends (FS and >>>>>>> vec4) that we can't do today in GLSL IR so this is what we want to use >>>>>>> for that, and since nobody else does anything with the core GLSL >>>>>>> compiler except when they have to, when we start moving things out of >>>>>>> GLSL IR this will probably replace GLSL IR as the infrastructure that >>>>>>> all Mesa drivers use. But with that in mind, here are a few reasons >>>>>>> why we wouldn't want to use LLVM: >>>>>>> >>>>>>> * LLVM wasn't built to understand structured CFG's, meaning that you >>>>>>> need to re-structurize it using a pass that's fragile and prone to >>>>>>> break if some other pass "optimizes" the shader in a way that makes it >>>>>>> non-structured (i.e. not expressible in terms of loops and if >>>>>>> statements). This loss of information also means that passes that need >>>>>>> to know things like, for example, the loop nesting depth need to do an >>>>>>> analysis pass whereas with NIR you can just walk up the control flow >>>>>>> tree and count the number of loops we hit. >>>>>>> >>>>>> >>>>>> LLVM has a pass to structurize the CFG. We use it in the radeon >>>>>> drivers, and it is run after all of the other LLVM optimizations which >>>>>> have >>>>>> no concept of structured CFG. It's not bug free, but it works really >>>>>> well even with all of the complex OpenCL kernels we throw at it. >>>>>> >>>>>> Your point about losing information when the CFG is de-structurized is >>>>>> valid, but for things like loop depth, I'm not sure why we couldn't >>>>>> write an >>>>>> LLVM analysis pass for this (if one doesn't already exist). >>>>>> >>>>> >>>>> I don't think this is such a big deal either. At least the >>>>> structurization pass used on newer AMD hardware isn't "fragile" in the >>>>> way you seem to imply -- AFAIK (unlike the old AMDIL heuristic >>>>> algorithm) it's guaranteed to give you a valid structurized output no >>>>> matter what the previous optimization passes have done to the CFG, >>>>> modulo bugs. I admit that the situation is nevertheless suboptimal. >>>>> Ideally this information wouldn't get lost along the way. For the long >>>>> term we may want to represent structured control flow directly in the IR >>>>> as you say, I just don't see how reinventing the IR saves us any work if >>>>> we could just fix the existing one. >>>> >>>> It seems to me that something like how we represent control flow is a >>>> pretty fundamental part of the IR - it affects any optimization pass >>>> that needs to do anything beyond adding and removing instructions. How >>>> would you fix that, especially given that LLVM is primarily designed >>>> for CPU's where you don't want to be restricted to structured control >>>> flow at all? It seems like our goals (preserve the structure) conflict >>>> with the way LLVM has been designed. >>>> >>> I think we can fix this by introducing new structured variants of the >>> branch instruction in a way that doesn't alter the fundamental structure >>> of the IR. E.g. an if branch could look like: >>> >>> ifbr i1 <cond>, label <iftrue>, label <iffalse>, label <join> >>> >>> Where both branches are guaranteed to converge at <join>. Sure, this >>> will require fixing many assumptions, but on the one hand it's not >>> immediately required (as we can address this problem for the time being >>> using the same solution AMD uses) and on the other hand it's still less >>> work than starting from scratch. >> >> I disagree with the "less work than starting from scratch" part, >> especially since it involves modifying it in a pretty invasive way, >> when we won't even need half of the things that it does for us. LLVM >> just isn't a solution to everything - there is no one-size-fits-all >> compiler. >> > > *Shrug* That's quite a strong statement. Honestly I haven't ruled out > the possibility of coming up with a decent IR by ourselves yet, but at > this point I feel like improving the LLVM framework to make it more > suitable for GPUs would be a much more promising use of my time than > working on NIR -- Even if starting from scratch sounds like a lot more > fun. > >>> >>>>> >>>>>>> * LLVM doesn't do modifiers, meaning that we can't do optimizations >>>>>>> like "clamp(x, 0.0, 1.0) => mov.sat x" and "clamp(x, 0.25, 1.0) => >>>>>>> max.sat(x, .25)" in a generic fashion. >>>>>>> >>>>>> >>>>>> The way to handle this with LLVM would be to add intrinsics to represent >>>>>> the various modifiers and then fold them into instructions during >>>>>> instruction selection. >>>>>> >>>>> >>>>> IMHO this is a feature. One of the things I don't like about NIR is >>>>> that it's still vec4-centric. Most drivers are going to want something >>>>> else and different to each other, we cannot please all of them with one >>>>> single vector addressing model built into the core instruction set, so >>>>> I'd rather have modifiers, writemasks and swizzles represented as the >>>>> composition of separate instructions/intrinsics with simple and >>>>> well-defined semantics, which can be coalesced back into the real >>>>> instruction as Tom says (easy even if you don't use LLVM's instruction >>>>> selector as long as it's SSA form). >>>> >>>> While NIR is vec4-centric, nothing's stopping you from splitting up >>>> instructions and doing optimizations at the scalar level for scalar >>>> ISA's - in fact, that's what I expect to happen. And for backends that >>>> really do need to have swizzles and writemasks, coalescing these >>>> things back into the original instruction is not at all trivial >>> >>> It's a simple peephole optimization AFAICT: >>> >>> val2 = alu-op(modifier(val1)) -> hardware-specific-extended-alu-op(val) >>> val2 = shuffle(val2, alu-op(val1)) -> >>> hardware-specific-alu-op-with-writemask(val2, val1) >> >> No, it's not. Imagine something like: >> >> vec4 foo = ... >> vec4 bar = ... >> vec4 baz = vec4(foo.xy, bar.zw) >> ... = foo >> ... = bar >> ... = baz >> >> where the vec4() is the shuffle instruction. In this case, you can't >> eliminate the shuffle - you need to insert writemasked moves when you >> come out of SSA: >> >> vec4 foo = ... >> vec4 bar = ... >> baz.xy = foo.xy >> baz.zw = bar.zw >> >> This basically comes down to something analogous to a register >> allocation problem, where in this case the scalar components that we >> want to put into a single vec4 (foo, bar, and baz) can't fit - we need >> to "spill" by inserting copies. Then, once we've done this, we have to >> convert it into a non-SSA form with registers, writemasks, and >> swizzles - something that would be easy to do in the IR -> backend >> translation, if it really were just a simple peephole, but in this >> case it's not and so you either have to consult the result of your >> analysis during the translation or have an IR that can represent >> swizzles, writemasks, and non-SSA registers for you like NIR does. Of >> course, LLVM will help with none of this because it's vectorization >> model is built around CPU vector processors like SSE, NEON, etc. and >> so AFAIK it has no concept of per-component liveness, and even if it >> did, this stuff is intimately tied to the out-of-SSA process itself so >> we would basically have to write it from scratch anyways. >> > > I think you keep mixing two unrelated problems: > 1/ How we represent vector addressing, writemasks and modifiers in the > core IR. > 2/ How we bring vector operations back into non-SSA form. > > Re 1 you propose making the vec4 model a central part of the IR rather > than using composition of simpler operations. Whatever we do, going > From one representation to the other is a simple peephole, which I never > meant would be a solution for 2. > > Re 2 I agree with you that it would ideally be taken care of by a shared > transformation pass because of its complexity, but I disagree that a > vec4-centric IR is required for this purpose, or even especially useful, > because different hardware has wildly different vector models with > different constraints and requiring a different representation, so I > think ideally we would have some mechanism for back-ends to provide > their own representation in the form of machine-specific instructions > accompanied with some machine-specific logic. > > _______________________________________________ > mesa-dev mailing list > mesa-dev@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/mesa-dev > _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev