----- Original Message ----- > Am 27.02.2012 21:26, schrieb Jose Fonseca: > > > > > > ----- Original Message ----- > >> On Mon, Feb 20, 2012 at 01:50:43PM -0800, Jose Fonseca wrote: > >>> > >>> > >>> ----- Original Message ----- > >>>> > >>>> > >>>> ----- Original Message ----- > >>>>> On Sat, Feb 18, 2012 at 4:20 AM, Jose Fonseca > >>>>> <jfons...@vmware.com> > >>>>> wrote: > >>>>>> ----- Original Message ----- > >>>>>>> On Fri, Feb 17, 2012 at 9:46 PM, Jose Fonseca > >>>>>>> <jfons...@vmware.com> > >>>>>>> wrote: > >>>>>>>> Dave, > >>>>>>>> > >>>>>>>> Ideally there should be only one lp_build_mod() which will > >>>>>>>> invoke > >>>>>>>> LLVMBuildSRem or LLVMBuildURem depending on the value of > >>>>>>>> bld->type.sign. The point being that this allows the same > >>>>>>>> code > >>>>>>>> generation logic to seemingly target any type without > >>>>>>>> having > >>>>>>>> to > >>>>>>>> worry too much which target it is targeting. > >>>>>>> > >>>>>>> Yeah I agree with this for now, but I'm starting to think a > >>>>>>> lot > >>>>>>> of > >>>>>>> this stuff is redunant once I looked at what Tom has done. > >>>>>>> > >>>>>>> The thing is TGSI doesn't have that many crazy options where > >>>>>>> you > >>>>>>> are > >>>>>>> going to be targetting instructions at the wrong type, and > >>>>>>> wrapping > >>>>>>> all the basic llvm interfaces with an extra type layer seems > >>>>>>> to > >>>>>>> me > >>>>>>> long term like a waste of time. > >>>>>> > >>>>>> So far llvmpipe's TGSI->LLVM IR has only been targetting > >>>>>> floating > >>>>>> point SIMD instructions. > >>>>>> > >>>>>> But truth is that many simple fragment shaders can be > >>>>>> partially > >>>>>> done with 8bit and 16bit SIMD integers, if values are > >>>>>> represented > >>>>>> in 8bit unorm and 16 bit unorms. The throughput for these > >>>>>> will be > >>>>>> much higher, as not only we can squeeze more elements, they > >>>>>> take > >>>>>> less cycles, and the hardware has several arithmetic units. > >>>>>> > >>>>>> The point of those lp_build_xxx functions is to handle this > >>>>>> transparently. See, e.g., how lp_build_mul handles fixed > >>>>>> point. > >>>>>> Currently this is only used for blending, but the hope is to > >>>>>> eventually use it on TGSI translation of simple fragment > >>>>>> shaders. > >>>>>> > >>>>>> Maybe not the case for the desktop GPUs, but I also heard > >>>>>> that > >>>>>> some > >>>>>> low powered devices have shader engines w/ 8bit unorms. > >>>>>> > >>>>>> But of course, not all opcodes can be done correctly: and > >>>>>> URem/SRem > >>>>>> might not be one we care. > >>>>>> > >>>>>>> I'm happy for now to finish the integer support in the same > >>>>>>> style > >>>>>>> as > >>>>>>> the current code, but I think moving forward afterwards it > >>>>>>> might > >>>>>>> be > >>>>>>> worth investigating a more direct instruction emission > >>>>>>> scheme. > >>>>>> > >>>>>> If you wanna invoke LLVMBuildURem/LLVMBuildSRem directly from > >>>>>> tgsi > >>>>>> translation I'm fine with it. We can always generalize > >>>>>> > >>>>>>> Perhaps > >>>>>>> Tom can comment also from his experience. > >>>>>> > >>>>>> BTW, Tom, I just now noticed that there are two action > >>>>>> versions > >>>>>> for > >>>>>> add: > >>>>>> > >>>>>> /* TGSI_OPCODE_ADD (CPU Only) */ > >>>>>> static void > >>>>>> add_emit_cpu( > >>>>>> const struct lp_build_tgsi_action * action, > >>>>>> struct lp_build_tgsi_context * bld_base, > >>>>>> struct lp_build_emit_data * emit_data) > >>>>>> { > >>>>>> emit_data->output[emit_data->chan] = > >>>>>> lp_build_add(&bld_base->base, > >>>>>> emit_data->args[0], > >>>>>> emit_data->args[1]); > >>>>>> } > >>>>>> > >>>>>> /* TGSI_OPCODE_ADD */ > >>>>>> static void > >>>>>> add_emit( > >>>>>> const struct lp_build_tgsi_action * action, > >>>>>> struct lp_build_tgsi_context * bld_base, > >>>>>> struct lp_build_emit_data * emit_data) > >>>>>> { > >>>>>> emit_data->output[emit_data->chan] = LLVMBuildFAdd( > >>>>>> bld_base->base.gallivm->builder, > >>>>>> emit_data->args[0], > >>>>>> emit_data->args[1], ""); > >>>>>> } > >>>>>> > >>>>>> Why is this necessary? lp_build_add will already call > >>>>>> LLVMBuildFAdd > >>>>>> internally as appropriate. > >>>>>> > >>>>>> Is this because some of the functions in lp_bld_arit.c will > >>>>>> emit > >>>>>> x86 intrinsics? If so then a "no-x86-intrinsic" flag in the > >>>>>> build > >>>>>> context would achieve the same effect with less code > >>>>>> duplication. > >>>>>> > >>>>>> If possible I'd prefer a single version of these actions. If > >>>>>> not, > >>>>>> then I'd prefer have them split: lp_build_action_cpu.c and > >>>>>> lp_build_action_gpu. > >>>>> > >>>>> Yes, this is why a split them up. I can add that flag and > >>>>> merge > >>>>> the > >>>>> actions together. > >>>> > >>>> That would be nice. Thanks. > >>> > >>> Tom, actually I've been looking more at the code, thinking about > >>> this, and I'm not so sure what's best anymore. > >>> > >>> I'd appreciate your honest answer: do you think the stuff in > >>> lp_bld_arit.[ch] of any use for GPUs in general (or AMD's in > >>> particular), or is it just an hinderance? > >>> > >>> As I said before, for CPUs, this abstraction is useful, to allow > >>> to > >>> convert TGSI (and other fixed function state) -> fixed point SIMD > >>> instructions, which yield the highest throughput on CPUs. Because > >>> LLVM native types are not expressive enough for fixed function, > >>> etc. > >>> > >>> But if this is useless for GPUs (i.e, if LLVM's native types are > >>> sufficient), then we can make this abstraction a CPU only thing. > >>> > >> > >> I don't think the lp_bld_arit.c functions are really useful for > >> GPUs, > >> and I don't rely on any of them in the R600 backend. Also, I was > >> looking > >> through those functions again and the problem is more than just > >> x86 > >> intrinsics. Some of them assume vector types, which I don't use > >> at > >> all. > > > > Does that mean that the R600 generates/consumes only scalar > > expressions? > R600 (HD2xxx) up to Evergreen/Northern Islands (HD6xxx except HD69xx) > are VLIW5. So that's not exactly scalar but it doesn't quite fit any > simd vector model (as you can have 5 different instructions per > instruction slot). (Cayman, aka HD69xx is VLIW4, and the new GCN > chips > aka HD7xxx indeed use a scalar model, as does nvidia.). > The vectors as they are used by llvmpipe are of course there in gpus > too, but these are really mostly hidden (amd chips generally have a > logical simd width of 64 and nvidia 32 - amd calls this the wavefront > size and nvidia the warp size but in any case you still emit scalar > looking instructions which are really implicit vectors). > So I guess using explicit vectors isn't really helping matters. > Maybe with intel gpus it would fit better as they have sort of > configurable simd width with more control. No idea though if it would > actually be useful. > Older chips certainly have some more (AoS) simd aspects to them but > the > model doesn't quite fit neither. >
I see. Thanks for the explanation Roland. Jose _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev