Am 11.07.2013 16:13, schrieb Brian Paul: > Just minor things below. Ok, I'll fix this (and the things Jose mentioned).
> > Reviewed-by: Brian Paul <bri...@vmware.com> > > On 07/10/2013 06:21 PM, srol...@vmware.com wrote: >> From: Roland Scheidegger <srol...@vmware.com> >> >> We had to disable fast rsqrt before because it wasn't precise enough etc. >> However in situations when we know we're not going to need more precision >> we can still use a fast rsqrt (which can be several times faster than >> the quite expensive sqrt). Hence introduce a new helper which does >> exactly >> that - it is probably not useful calling it in some situations if there's >> no fast rsqrt available so make it queryable if it's available too. >> --- >> src/gallium/auxiliary/gallivm/lp_bld_arit.c | 55 >> +++++++++++++++++++++++++++ >> src/gallium/auxiliary/gallivm/lp_bld_arit.h | 7 ++++ >> 2 files changed, 62 insertions(+) >> >> diff --git a/src/gallium/auxiliary/gallivm/lp_bld_arit.c >> b/src/gallium/auxiliary/gallivm/lp_bld_arit.c >> index 08aec79..2e84543 100644 >> --- a/src/gallium/auxiliary/gallivm/lp_bld_arit.c >> +++ b/src/gallium/auxiliary/gallivm/lp_bld_arit.c >> @@ -2361,6 +2361,61 @@ lp_build_rsqrt(struct lp_build_context *bld, >> return lp_build_rcp(bld, lp_build_sqrt(bld, a)); >> } >> >> +/** >> + * If there's a fast (inaccurate) rsqrt instruction available >> + * (caller may want to avoid to call rsqrt_fast if it's not available, >> + * i.e. for calculating x^0.5 it may do rsqrt_fast(x) * x but if >> + * unavailable it would result in sqrt/div/mul so obviously >> + * much better to just call sqrt, skipping both div and mul). >> + */ >> +boolean >> +lp_build_fast_rsqrt_available(struct lp_type type) { > > Opening brace on next line. > > >> + >> + assert(type.floating); >> + >> + if ((util_cpu_caps.has_sse && type.width == 32 && type.length == >> 4) || >> + (util_cpu_caps.has_avx && type.width == 32 && type.length == >> 8)) { >> + return true; >> + } >> + return false; >> +} >> + >> + >> +/** >> + * Generate 1/sqrt(a). >> + * Result is undefined for values < 0, infinity for +0. >> + * Precision is limited, only ~10 bits guaranteed >> + * (rsqrt 1.0 may not be 1.0, denorms may be flushed to 0). >> + */ >> +LLVMValueRef >> +lp_build_rsqrt_fast(struct lp_build_context *bld, >> + LLVMValueRef a) >> +{ >> + LLVMBuilderRef builder = bld->gallivm->builder; >> + const struct lp_type type = bld->type; >> + >> + assert(lp_check_value(type, a)); >> + >> + assert(type.floating); >> + >> + if ((util_cpu_caps.has_sse && type.width == 32 && type.length == >> 4) || >> + (util_cpu_caps.has_avx && type.width == 32 && type.length == >> 8)) { > > This looks like it could be "if (lp_build_fast_rsqrt_available(type)" > > >> + const char *intrinsic = NULL; >> + >> + if (type.length == 4) { >> + intrinsic = "llvm.x86.sse.rsqrt.ps"; >> + } >> + else { >> + intrinsic = "llvm.x86.avx.rsqrt.ps.256"; >> + } >> + return lp_build_intrinsic_unary(builder, intrinsic, >> bld->vec_type, a); >> + } >> + else { >> + debug_printf("%s: emulating fast rsqrt with rcp/sqrt\n", >> __FUNCTION__); > > Are you sure you want to leave the debug_printf() in? I left it in on purpose, I guess it's arguable - might want to revisit if there's really some caller which absolutely needs fast reciprocal square root. The problem is that we can't use it for the shader rsqrt - while these don't need full precision it's just too crappy and our attempts at fixing it up were not really all that succesful. So this leaves only internal code, and I suspect most of the time you could possibly avoid needing it. The emulation really is slow with sqrt/div - both operations are (on pretty much all cpus I believe) non-pipelined instructions taking quite some time (both about the same usually), both executed by divide unit. In that light it makes sense if callers try to avoid using it. But it can easily be changed if there's some caller really needing fast rsqrt (normalization is the obvious use case), so far all callers (other than linear_to_srgb) need the "normal" rsqrt. I actually initially wanted to change the interface of lp_build_rsqrt() instead to use a precision argument (or flag) instead. But that way it couldn't be part of easy unary operation testing in lp_test code, plus at least the linear_to_srgb code really wants to avoid calling it if it's not available so it seemed cleaner to just have a separate fast rsqrt (along with the capability to query if it's available). Roland > > >> + } >> + return lp_build_rcp(bld, lp_build_sqrt(bld, a)); >> +} >> + >> >> /** >> * Generate sin(a) using SSE2 >> diff --git a/src/gallium/auxiliary/gallivm/lp_bld_arit.h >> b/src/gallium/auxiliary/gallivm/lp_bld_arit.h >> index 966796c..d53e471 100644 >> --- a/src/gallium/auxiliary/gallivm/lp_bld_arit.h >> +++ b/src/gallium/auxiliary/gallivm/lp_bld_arit.h >> @@ -231,6 +231,13 @@ LLVMValueRef >> lp_build_rsqrt(struct lp_build_context *bld, >> LLVMValueRef a); >> >> +boolean >> +lp_build_fast_rsqrt_available(struct lp_type type); >> + >> +LLVMValueRef >> +lp_build_rsqrt_fast(struct lp_build_context *bld, >> + LLVMValueRef a); >> + >> LLVMValueRef >> lp_build_cos(struct lp_build_context *bld, >> LLVMValueRef a); >> _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev