On 08/20/2014 09:11 AM, Francisco Jerez wrote: > Connor Abbott <cwabbo...@gmail.com> writes: > >> On Wed, Aug 20, 2014 at 7:01 AM, Francisco Jerez <curroje...@riseup.net> >> wrote: >>> Connor Abbott <cwabbo...@gmail.com> writes: >>> >>>> On Tue, Aug 19, 2014 at 11:33 PM, Francisco Jerez <curroje...@riseup.net> >>>> wrote: >>>>> Connor Abbott <cwabbo...@gmail.com> writes: >>>>> >>>>>> On Tue, Aug 19, 2014 at 11:40 AM, Francisco Jerez >>>>>> <curroje...@riseup.net> wrote: >>>>>>> Tom Stellard <t...@stellard.net> writes: >>>>>>> >>>>>>>> On Tue, Aug 19, 2014 at 11:04:59AM -0400, Connor Abbott wrote: >>>>>>>>> On Mon, Aug 18, 2014 at 8:52 PM, Michel Dänzer <mic...@daenzer.net> >>>>>>>>> wrote: >>>>>>>>>> On 19.08.2014 01:28, Connor Abbott wrote: >>>>>>>>>>> On Mon, Aug 18, 2014 at 4:32 AM, Michel Dänzer <mic...@daenzer.net> >>>>>>>>>>> wrote: >>>>>>>>>>>> On 16.08.2014 09:12, Connor Abbott wrote: >>>>>>>>>>>>> I know what you might be thinking right now. "Wait, *another* IR? >>>>>>>>>>>>> Don't >>>>>>>>>>>>> we already have like 5 of those, not counting all the >>>>>>>>>>>>> driver-specific >>>>>>>>>>>>> ones? Isn't this stuff complicated enough already?" Well, there >>>>>>>>>>>>> are some >>>>>>>>>>>>> pretty good reasons to start afresh (again...). In the years >>>>>>>>>>>>> we've been >>>>>>>>>>>>> using GLSL IR, we've come to realize that, in fact, it's not what >>>>>>>>>>>>> we >>>>>>>>>>>>> want *at all* to do optimizations on. >>>>>>>>>>>> >>>>>>>>>>>> Did you evaluate using LLVM IR instead of inventing yet another >>>>>>>>>>>> one? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Earthling Michel Dänzer | >>>>>>>>>>>> http://www.amd.com >>>>>>>>>>>> Libre software enthusiast | Mesa and X >>>>>>>>>>>> developer >>>>>>>>>>> >>>>>>>>>>> Yes. See >>>>>>>>>>> >>>>>>>>>>> http://lists.freedesktop.org/archives/mesa-dev/2014-February/053502.html >>>>>>>>>>> >>>>>>>>>>> and >>>>>>>>>>> >>>>>>>>>>> http://lists.freedesktop.org/archives/mesa-dev/2014-February/053522.html >>>>>>>>>> >>>>>>>>>> I know Ian can't deal with LLVM for some reason. I was wondering if >>>>>>>>>> *you* evaluated it, and if so, why you rejected it. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Earthling Michel Dänzer | >>>>>>>>>> http://www.amd.com >>>>>>>>>> Libre software enthusiast | Mesa and X >>>>>>>>>> developer >>>>>>>>> >>>>>>>>> >>>>>>>>> Well, first of all, the fact that Ian and Ken don't want to use it >>>>>>>>> means that any plan to use LLVM for the Intel driver is dead in the >>>>>>>>> water anyways - you can translate NIR into LLVM if you want, but for >>>>>>>>> i965 we want to share optimizations between our 2 backends (FS and >>>>>>>>> vec4) that we can't do today in GLSL IR so this is what we want to use >>>>>>>>> for that, and since nobody else does anything with the core GLSL >>>>>>>>> compiler except when they have to, when we start moving things out of >>>>>>>>> GLSL IR this will probably replace GLSL IR as the infrastructure that >>>>>>>>> all Mesa drivers use. But with that in mind, here are a few reasons >>>>>>>>> why we wouldn't want to use LLVM: >>>>>>>>> >>>>>>>>> * LLVM wasn't built to understand structured CFG's, meaning that you >>>>>>>>> need to re-structurize it using a pass that's fragile and prone to >>>>>>>>> break if some other pass "optimizes" the shader in a way that makes it >>>>>>>>> non-structured (i.e. not expressible in terms of loops and if >>>>>>>>> statements). This loss of information also means that passes that need >>>>>>>>> to know things like, for example, the loop nesting depth need to do an >>>>>>>>> analysis pass whereas with NIR you can just walk up the control flow >>>>>>>>> tree and count the number of loops we hit. >>>>>>>>> >>>>>>>> >>>>>>>> LLVM has a pass to structurize the CFG. We use it in the radeon >>>>>>>> drivers, and it is run after all of the other LLVM optimizations which >>>>>>>> have >>>>>>>> no concept of structured CFG. It's not bug free, but it works really >>>>>>>> well even with all of the complex OpenCL kernels we throw at it. >>>>>>>> >>>>>>>> Your point about losing information when the CFG is de-structurized is >>>>>>>> valid, but for things like loop depth, I'm not sure why we couldn't >>>>>>>> write an >>>>>>>> LLVM analysis pass for this (if one doesn't already exist). >>>>>>>> >>>>>>> >>>>>>> I don't think this is such a big deal either. At least the >>>>>>> structurization pass used on newer AMD hardware isn't "fragile" in the >>>>>>> way you seem to imply -- AFAIK (unlike the old AMDIL heuristic >>>>>>> algorithm) it's guaranteed to give you a valid structurized output no >>>>>>> matter what the previous optimization passes have done to the CFG, >>>>>>> modulo bugs. I admit that the situation is nevertheless suboptimal. >>>>>>> Ideally this information wouldn't get lost along the way. For the long >>>>>>> term we may want to represent structured control flow directly in the IR >>>>>>> as you say, I just don't see how reinventing the IR saves us any work if >>>>>>> we could just fix the existing one. >>>>>> >>>>>> It seems to me that something like how we represent control flow is a >>>>>> pretty fundamental part of the IR - it affects any optimization pass >>>>>> that needs to do anything beyond adding and removing instructions. How >>>>>> would you fix that, especially given that LLVM is primarily designed >>>>>> for CPU's where you don't want to be restricted to structured control >>>>>> flow at all? It seems like our goals (preserve the structure) conflict >>>>>> with the way LLVM has been designed. >>>>>> >>>>> I think we can fix this by introducing new structured variants of the >>>>> branch instruction in a way that doesn't alter the fundamental structure >>>>> of the IR. E.g. an if branch could look like: >>>>> >>>>> ifbr i1 <cond>, label <iftrue>, label <iffalse>, label <join> >>>>> >>>>> Where both branches are guaranteed to converge at <join>. Sure, this >>>>> will require fixing many assumptions, but on the one hand it's not >>>>> immediately required (as we can address this problem for the time being >>>>> using the same solution AMD uses) and on the other hand it's still less >>>>> work than starting from scratch. >>>> >>>> I disagree with the "less work than starting from scratch" part, >>>> especially since it involves modifying it in a pretty invasive way, >>>> when we won't even need half of the things that it does for us. LLVM >>>> just isn't a solution to everything - there is no one-size-fits-all >>>> compiler. >>>> >>> >>> *Shrug* That's quite a strong statement. Honestly I haven't ruled out >>> the possibility of coming up with a decent IR by ourselves yet, but at >>> this point I feel like improving the LLVM framework to make it more >>> suitable for GPUs would be a much more promising use of my time than >>> working on NIR -- Even if starting from scratch sounds like a lot more >>> fun. >>> >>>>> >>>>>>> >>>>>>>>> * LLVM doesn't do modifiers, meaning that we can't do optimizations >>>>>>>>> like "clamp(x, 0.0, 1.0) => mov.sat x" and "clamp(x, 0.25, 1.0) => >>>>>>>>> max.sat(x, .25)" in a generic fashion. >>>>>>>>> >>>>>>>> >>>>>>>> The way to handle this with LLVM would be to add intrinsics to >>>>>>>> represent >>>>>>>> the various modifiers and then fold them into instructions during >>>>>>>> instruction selection. >>>>>>>> >>>>>>> >>>>>>> IMHO this is a feature. One of the things I don't like about NIR is >>>>>>> that it's still vec4-centric. Most drivers are going to want something >>>>>>> else and different to each other, we cannot please all of them with one >>>>>>> single vector addressing model built into the core instruction set, so >>>>>>> I'd rather have modifiers, writemasks and swizzles represented as the >>>>>>> composition of separate instructions/intrinsics with simple and >>>>>>> well-defined semantics, which can be coalesced back into the real >>>>>>> instruction as Tom says (easy even if you don't use LLVM's instruction >>>>>>> selector as long as it's SSA form). >>>>>> >>>>>> While NIR is vec4-centric, nothing's stopping you from splitting up >>>>>> instructions and doing optimizations at the scalar level for scalar >>>>>> ISA's - in fact, that's what I expect to happen. And for backends that >>>>>> really do need to have swizzles and writemasks, coalescing these >>>>>> things back into the original instruction is not at all trivial >>>>> >>>>> It's a simple peephole optimization AFAICT: >>>>> >>>>> val2 = alu-op(modifier(val1)) -> hardware-specific-extended-alu-op(val) >>>>> val2 = shuffle(val2, alu-op(val1)) -> >>>>> hardware-specific-alu-op-with-writemask(val2, val1) >>>> >>>> No, it's not. Imagine something like: >>>> >>>> vec4 foo = ... >>>> vec4 bar = ... >>>> vec4 baz = vec4(foo.xy, bar.zw) >>>> ... = foo >>>> ... = bar >>>> ... = baz >>>> >>>> where the vec4() is the shuffle instruction. In this case, you can't >>>> eliminate the shuffle - you need to insert writemasked moves when you >>>> come out of SSA: >>>> >>>> vec4 foo = ... >>>> vec4 bar = ... >>>> baz.xy = foo.xy >>>> baz.zw = bar.zw >>>> >>>> This basically comes down to something analogous to a register >>>> allocation problem, where in this case the scalar components that we >>>> want to put into a single vec4 (foo, bar, and baz) can't fit - we need >>>> to "spill" by inserting copies. Then, once we've done this, we have to >>>> convert it into a non-SSA form with registers, writemasks, and >>>> swizzles - something that would be easy to do in the IR -> backend >>>> translation, if it really were just a simple peephole, but in this >>>> case it's not and so you either have to consult the result of your >>>> analysis during the translation or have an IR that can represent >>>> swizzles, writemasks, and non-SSA registers for you like NIR does. Of >>>> course, LLVM will help with none of this because it's vectorization >>>> model is built around CPU vector processors like SSE, NEON, etc. and >>>> so AFAIK it has no concept of per-component liveness, and even if it >>>> did, this stuff is intimately tied to the out-of-SSA process itself so >>>> we would basically have to write it from scratch anyways. >>>> >>> >>> I think you keep mixing two unrelated problems: >>> 1/ How we represent vector addressing, writemasks and modifiers in the >>> core IR. >>> 2/ How we bring vector operations back into non-SSA form. >>> >>> Re 1 you propose making the vec4 model a central part of the IR rather >>> than using composition of simpler operations. Whatever we do, going >>> From one representation to the other is a simple peephole, which I never >>> meant would be a solution for 2. >>> >>> Re 2 I agree with you that it would ideally be taken care of by a shared >>> transformation pass because of its complexity, but I disagree that a >>> vec4-centric IR is required for this purpose, or even especially useful, >>> because different hardware has wildly different vector models with >>> different constraints and requiring a different representation, so I >>> think ideally we would have some mechanism for back-ends to provide >>> their own representation in the form of machine-specific instructions >>> accompanied with some machine-specific logic. >> >> I don't see why it's necessarily a bad idea to support the most >> flexible vector addressing model and then have backends that don't >> support it lower it to something they do support, or do their own >> transformation pass instead of the standard one which will lower to >> the normal model (full swizzling and writemasking). > > Vec4 is by no means the most flexible model (it's just flexible enough > to be annoying to deal with IMHO). Just look at intel's SIMD4x2 > register addressing modes, you can do dozens of tricks with them that > you cannot represent in terms of vec4 (using align1 vs align16 access > modes, differing horizontal and vertical strides, three different modes > of indirect addressing, bit-casting across components, etc.) -- It would > be crazy IMHO to design the core IR around that (or around some sort of > lowest common denominator of all the vector addressing models out there) > as you can always express the same semantics as a combination of simpler > blocks, and the backend needs some serious pattern-matching > infrastructure anyway to make full use of the hardware flexibility.
I think this underscores the expectation that each backend will have its own low-level IR and the expectation that each backend will perform additional optimizations on that low-level IR. >> And yes, it is certainly possible for backends to add their own >> machine-specific opcodes and intrinsics - it's a lot easier than it >> was with GLSL IR, as there's only one spot (nir_opcodes.h) that it has >> to be added to. > > _______________________________________________ > mesa-dev mailing list > mesa-dev@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/mesa-dev _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev