On Thu, Jul 16, 2015 at 8:49 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Meh. I don't like the assumption that non-GCC compilers will be smart > enough to optimize away the useless-to-them if() tests this adds. > Please refactor that so that there is exactly 0 new code when the > intrinsic doesn't exist.
I imagined that there was some value in copying the GCC intrinsic's behavior, and actually evaluating the "addr" expression even in the event of no platform support. On reflection, I suppose that that isn't actually a particularly useful property for Postgres. There will only ever be a handful of callers. Attached revision does not rely on such optimization occurring on platforms that lack __builtin_prefetch(). This allowed me to decouple availability from actual use, in the style of posix_fadvise(), so that one can manually disable memory prefetching within pg_config_manual.h. Clang is compatibile with __builtin_prefetch() intrinsic, FWIW. I'm not sure if it's worth trying to make the wrapper portable across a variety of supported compilers. If we were to attempt it, we would not be the first. I note that ICC's memref_control has an identical interface to __builtin_prefetch(). -- Peter Geoghegan
From 357e3b65c4f1511746a529e41fe180f0c7668d70 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan <peter.geoghega...@gmail.com> Date: Sun, 12 Jul 2015 13:14:01 -0700 Subject: [PATCH] Prefetch from memtuples array in tuplesort Testing shows that prefetching the "tuple proper" of a slightly later SortTuple in the memtuples array during each of many sequential, in-logical-order SortTuple fetches speeds up various sorting intense operations considerably. For example, B-Tree index builds are accelerated as leaf pages are created from the memtuples array. (i.e. The operation following actually "performing" the sort, but before a tuplesort_end() call is made as a B-Tree spool is destroyed.) Similarly, ordered set aggregates (all cases except the datumsort case with a pass-by-value type), and regular heap tuplesorts benefit to about the same degree. The optimization is only used when sorts fit in memory, though. Also, prefetch a few places ahead within the analogous "fetching" point in tuplestore.c. This appears to offer similar benefits in certain cases. For example, queries involving large common table expressions significantly benefit. --- config/c-compiler.m4 | 17 +++++++++++++++++ configure | 30 ++++++++++++++++++++++++++++++ configure.in | 1 + src/backend/utils/sort/tuplesort.c | 20 ++++++++++++++++++++ src/backend/utils/sort/tuplestore.c | 13 +++++++++++++ src/include/c.h | 14 ++++++++++++++ src/include/pg_config.h.in | 3 +++ src/include/pg_config.h.win32 | 3 +++ src/include/pg_config_manual.h | 10 ++++++++++ 9 files changed, 111 insertions(+) diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 050bfa5..5776201 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -287,6 +287,23 @@ fi])# PGAC_C_BUILTIN_UNREACHABLE +# PGAC_C_BUILTIN_PREFETCH +# ------------------------- +# Check if the C compiler understands __builtin_prefetch(), +# and define HAVE__BUILTIN_PREFETCH if so. +AC_DEFUN([PGAC_C_BUILTIN_PREFETCH], +[AC_CACHE_CHECK(for __builtin_prefetch, pgac_cv__builtin_prefetch, +[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], +[int i = 0;__builtin_prefetch(&i, 0, 3);])], +[pgac_cv__builtin_prefetch=yes], +[pgac_cv__builtin_prefetch=no])]) +if test x"$pgac_cv__builtin_prefetch" = xyes ; then +AC_DEFINE(HAVE__BUILTIN_PREFETCH, 1, + [Define to 1 if your compiler understands __builtin_prefetch.]) +fi])# PGAC_C_BUILTIN_PREFETCH + + + # PGAC_C_VA_ARGS # -------------- # Check if the C compiler understands C99-style variadic macros, diff --git a/configure b/configure index 2176d65..5fdd6bd 100755 --- a/configure +++ b/configure @@ -11283,6 +11283,36 @@ if test x"$pgac_cv__builtin_unreachable" = xyes ; then $as_echo "#define HAVE__BUILTIN_UNREACHABLE 1" >>confdefs.h fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_prefetch" >&5 +$as_echo_n "checking for __builtin_prefetch... " >&6; } +if ${pgac_cv__builtin_prefetch+:} false; then : + $as_echo_n "(cached) " >&6 +else + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + +int +main () +{ +int i = 0;__builtin_prefetch(&i, 0, 3); + ; + return 0; +} +_ACEOF +if ac_fn_c_try_compile "$LINENO"; then : + pgac_cv__builtin_prefetch=yes +else + pgac_cv__builtin_prefetch=no +fi +rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_prefetch" >&5 +$as_echo "$pgac_cv__builtin_prefetch" >&6; } +if test x"$pgac_cv__builtin_prefetch" = xyes ; then + +$as_echo "#define HAVE__BUILTIN_PREFETCH 1" >>confdefs.h + +fi { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __VA_ARGS__" >&5 $as_echo_n "checking for __VA_ARGS__... " >&6; } if ${pgac_cv__va_args+:} false; then : diff --git a/configure.in b/configure.in index ece5e22..26c2ffe 100644 --- a/configure.in +++ b/configure.in @@ -1319,6 +1319,7 @@ PGAC_C_TYPES_COMPATIBLE PGAC_C_BUILTIN_BSWAP32 PGAC_C_BUILTIN_CONSTANT_P PGAC_C_BUILTIN_UNREACHABLE +PGAC_C_BUILTIN_PREFETCH PGAC_C_VA_ARGS PGAC_STRUCT_TIMEZONE PGAC_UNION_SEMUN diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 435041a..8c8c0e0 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -1663,6 +1663,26 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward, if (state->current < state->memtupcount) { *stup = state->memtuples[state->current++]; + + /* + * Perform memory prefetch of "tuple proper" of the + * SortTuple that's three places ahead of current + * (which is returned to caller). Testing shows that + * this significantly boosts the performance for + * TSS_SORTEDINMEM "forward" callers by hiding memory + * latency behind their processing of returned tuples. + * + * Don't do this for pass-by-value datum sorts; even + * though hinting a NULL address does not affect + * correctness, it would have a noticeable overhead + * here. + */ +#ifdef USE_MEM_PREFETCH + if (stup->tuple != NULL && + state->current + 2 < state->memtupcount) + pg_rfetch(state->memtuples[state->current + 2].tuple); +#endif + return true; } state->eof_reached = true; diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c index 627b281..3cc12a9 100644 --- a/src/backend/utils/sort/tuplestore.c +++ b/src/backend/utils/sort/tuplestore.c @@ -895,6 +895,19 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward, return NULL; if (readptr->current < state->memtupcount) { + /* + * Perform memory prefetch of tuple that's three places + * ahead of current (which is returned to caller). + * Testing shows that this significantly boosts the + * performance for TSS_INMEM "forward" callers by + * hiding memory latency behind their processing of + * returned tuples. + */ +#ifdef USE_MEM_PREFETCH + if (readptr->current + 3 < state->memtupcount) + pg_rfetch(state->memtuples[readptr->current + 3]); +#endif + /* We have another tuple, so return it */ return state->memtuples[readptr->current++]; } diff --git a/src/include/c.h b/src/include/c.h index 92c5202..835b98e 100644 --- a/src/include/c.h +++ b/src/include/c.h @@ -919,6 +919,20 @@ typedef NameData *Name; #define pg_unreachable() abort() #endif +/* + * Prefetch support -- Support memory prefetching hints on some platforms. + * + * pg_rfetch() is specialized for the case where an array is accessed + * sequentially, and we can prefetch a pointer within the next element (or an + * even later element) in order to hide memory latency. This case involves + * prefetching addresses with low temporal locality. Note that it's rather + * difficult to get any kind of speedup using pg_rfetch(); any use of the + * intrinsic should be carefully tested. Also note that it's okay to pass it + * an invalid or NULL address, although it's best avoided. + */ +#if defined(USE_MEM_PREFETCH) +#define pg_rfetch(addr) __builtin_prefetch((addr), 0, 0) +#endif /* * Function inlining support -- Allow modules to define functions that may be diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in index 5688f75..6aad69b 100644 --- a/src/include/pg_config.h.in +++ b/src/include/pg_config.h.in @@ -669,6 +669,9 @@ /* Define to 1 if your compiler understands __builtin_constant_p. */ #undef HAVE__BUILTIN_CONSTANT_P +/* Define to 1 if your compiler understands __builtin_prefetch. */ +#undef HAVE__BUILTIN_PREFETCH + /* Define to 1 if your compiler understands __builtin_types_compatible_p. */ #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32 index 22bbb91..3e729c5 100644 --- a/src/include/pg_config.h.win32 +++ b/src/include/pg_config.h.win32 @@ -523,6 +523,9 @@ /* Define to 1 if your compiler understands __builtin_constant_p. */ /* #undef HAVE__BUILTIN_CONSTANT_P */ +/* Define to 1 if your compiler understands __builtin_prefetch. */ +#undef HAVE__BUILTIN_PREFETCH + /* Define to 1 if your compiler understands __builtin_types_compatible_p. */ /* #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P */ diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h index e278fa0..4c7b1d5 100644 --- a/src/include/pg_config_manual.h +++ b/src/include/pg_config_manual.h @@ -153,6 +153,16 @@ #endif /* + * USE_MEM_PREFETCH controls whether Postgres will attempt to use memory + * prefetching. Usually the automatic configure tests are sufficient, but + * it's conceivable that using prefetching is counter-productive on some + * platforms. If necessary you can remove the #define here. + */ +#ifdef HAVE__BUILTIN_PREFETCH +#define USE_MEM_PREFETCH +#endif + +/* * USE_SSL code should be compiled only when compiling with an SSL * implementation. (Currently, only OpenSSL is supported, but we might add * more implementations in the future.) -- 1.9.1
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers