Am 11.07.2013 19:41, schrieb Jose Fonseca: >>> Please use lp_build_polynomial. It tries to avoid data dependency. >>> Furthermore, if we start using FMA, then it's less one place to update. >> Ok. Are you sure it's worth avoiding data dependency at the cost of extra >> instructions (the way I built the polynomial, it's 6 instructions, and with >> lp_build_polynomial it would be 7)? > > I'm not sure for this particular polynomial order (you could benchmark). It > did make a significant improvement for log2/exp2 's polynomials at the time > James did this. > > If it's not worth it, then lp_build_polynomial should do a straight > polynomial for that order and lower. But lp_build_polynomial should still be > used no matter what. The expectation being that lp_build_polynomial will > emit the best code possible for any polynomial. Yes I guess for low order polys it won't make much difference either way. I couldn't measure any difference and if you just look at the poly sequence it's easy to see why. The code I did initially had a dependency chain of 3 muls, 3 adds (in clocks that would be 3*5 for the mul, 3*3 for add so 24 clocks on SNB). The polynomial build doesn't change the picture much, the dependency chain now has 3 muls, 2 adds which is 21 clocks (while another mul+add sequence can be done in parallel). If we'd use FMA though straightforward would definitely be preferred since it would be 3 FMAs (all dependent) whereas with dependency avoiding it would be 1 MUL + 3 FMAs, with a dependency chain of 1 MUL + 2 FMAs, and since MULs and FMAs have same latency it's essentially an extra mul for nothing. Still that's a tiny fish to fry... FWIW for 2nd degree polynomials the data-depency avoding sequence is always worse as it's going to be mul/mul/add/mul/add, all dependent anyway, whereas straightforward sequence would just be mul/add/mul/add. No such callers though.
>> I thought because r/g/b will be done in >> parallel anyway it wouldn't be much of an issue. Didn't measure it, though. >> I am actually not really sure if fma isn't already used, while this is >> an non-conformant optimization to optimize mul+add into fma some >> compilers do it by default anyway IIRC. > > If you want to add a new flag lp_build_polynomial to force a straightforward > polynomial expansion that's fine too No as I can't tell the difference I'll skip that :-). I think most of the time we really have no good idea if llvm (or the cpu itself) has any chance of scheduling around dependencies. Only for srgb->linear there's some rough idea it should probably be possible because of the 3 channels we're doing in parallel. Roland _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev