On Fri, Apr 29, 2011 at 6:23 PM, Xinliang David Li <davi...@google.com> wrote: > Here is the background for this feature: > > 1) People relies on function multi-version to explore hw features and > squeeze performance, but there is no standard ways of doing so, either > a) using indirect function calls with function pointers set at program > initialization; b) using manual dispatch at each callsite; b) using > features like IFUNC. The dispatch mechanism needs to be promoted to > the language level and becomes the first class citizen;
You are not doing that, you are inventing a new (crude) GCC extension. > 2) Most importantly, the above non standard approaches block > interprocedural optimizations such as inlining. With the introduction > of buitlin_dispatch and the clear semantics, compiler can do more > aggressive optimization. I don't think so, but the previous mail lacked detail on why the new scheme would be better. > Multi-way dispatch will be good, but most of the use cases we have > seen is 2-way -- a fast path + a default path. > > Who are the targeted consumer of the feature? > > 1) For developers who like to MV function manually; > 2) For user directed function cloning > > e.g, > int foo (...) __attribute__((clone ("sse")): > > 3) Automatic function cloning determined by compiler. > > >> >> I am not sure that combining the function choice and its invocation >> to a single builtin is good. First, you use variadic arguments for >> the actual function arguments which will cause vararg promotions >> to apply and thus it will be impossible to dispatch to functions >> taking single precision float values (and dependent on ABI details >> many more similar issues). Second, you restrict yourself to >> only two versions of a function - that looks quite limited and this >> restriction is not easy to lift with your proposed interface. > > See above. Multi-way dispatch can be added similarly. Not with the specified syntax. So you'd end up with _two_ language extensions. That's bad (and unacceptable, IMHO). > >> >> That's a nice idea, but why not simply process all functions with >> a const attribute and no arguments this way? IMHO >> >> int testsse (void) __attribute__((const)); >> >> is as good and avoids the new attribute completely. The pass would >> also benefit other uses of this idiom (giving a way to have global >> dynamic initializers in C). The functions may not be strictly 'const' >> in a way the compiler autodetects this attribute but it presents >> exactly the promises to the compiler that allow this optimization. >> >> Thus, please split this piece out into a separate patch and use >> the const attribute. > > > Sounds good. > >> >>> What happens with cloning, -fclone-hot-version-paths ? >>> ----------------------------------------------------- >> >> Now, here you lost me somewhat, because I didn't look into the >> patch details and I am missing an example on how the lowered >> IL looks before that cloning. So for now I suppose this >> -fclone-hot-version-paths >> is to expose direct calls to the IL. If you would lower __builtin_dispatch >> to a style like >> >> int sel = selector (); >> switch (sel) >> { >> case 0: >> fn = popcnt; >> break; >> case 1: >> fn = popcnt_sse4; >> break; >> } >> res = (*fn) (25); >> >> then rather than a new pass specialized for __builtin_dispatch existing >> optimization passes that are able to do tracing like VRP and DOM >> via jump-threading or the tracer pass should be able to do this >> optimization for you. If they do not use FDO in a good way it is better >> to improve them for this case instead of writing a new pass. > > What you describe may not work > > 1) the function selection may happen in a different function; Well, sure. I propose to have the above as lowered form. If people deliberately obfuscate code it's their fault. Your scheme simply makes that obfuscation impossible (or less likely possible). So 1) is not a valid argument. > 2) Compiler will need to hoist the selection into the program > initializer to avoid overhead ? Isn't that the point of the "const" function call optimization which I said was a good thing anyway? So, after that it would be int sel = some_global_static; ... > As an example of why dispatch hoisting and call chain cloning is needed: > > void foo(); > void bar(); > > void doit_v1(); > void doit_v2(); > bool check_v () __attribute__((const)); > > int test(); > > > void bar() > { > .... > for (.....) > { > foo(); > .... > } > } > > void foo() > { > ... > for (...) > { > __builtin_dispatch(check_v, doit_v1, doit_v2,...); > ... > } > } > > > int test () > { > .. > bar (); > } > > > The feature testing and dispatch is embedded in a 2-deep loop nest > crossing function boundaries. The call paths test --> bar --> foo > needs to be cloned. This is done by hoisting dispatch up the call > chain -- it ends up : > > > void bar_v1() > { > .... > for (..) > { > foo_v1 (); > } > .. > } > > void bar_v2 () > { > ... > for (..) > { > foo_v2(); > } > .. > } > > void foo_v1 () > { > .. > for () > { > doit_v1() > } > } > > void foo_v2 () > { > .. > for () > { > doit_v2() > } > } > > int test () > { > __builtin_dispatch(check_v, bar_v1, bar_v2); > .. > > } So - and why is this not a good optimization when instead of __builtin_dispatch there is int sel = some_global_const_var; switch (sel) { case 1: fn = bar_v1; break; case 2: fn = bar_v2; break; } (*fn)(); ? My point is that such optimization is completely independent of that dispatch thing. The above could as well be a selection to use different input arrays, one causing alias analysis issues and one not. Thus, a __builtin_dispatch centric optimization pass is the wrong way to go. Now, with FDO I'd expect the foo is inlined into bar (because foo is deemed hot), then you only need to deal with loop unswitching, which should be easy to drive from FDO. Richard. > > > Thanks, > > David > > > >> >> Note that we already have special FDO support for indirect to direct >> call promotion, so that might work already. >> >> Thus, with all this, __builtin_dispatch would be more like syntactic >> sugar to avoid writing above switch statement yourself. If you restrict >> that sugar to a constant number of candidates it can be replaced by >> a macro (or several ones, specialized for the number of candidates). >> >> Richard. >> >>> For the call graph that looks like this before cloning : >>> >>> func_cold ----> func_hot ----> findOnes ----> IsPopCntSupported ----> popcnt >>> | >>> -----> no_popcnt >>> >>> where popcnt and no_popcnt are the multi-versioned functions, then after >>> cloning : >>> >>> func_cold ---->IsPopCntSupported ----> func_hot_clone0 ----> >>> findOnes_clone0 ----> popcnt >>> | >>> ----> func_hot_clone1 ----> findOnes_clone1 >>> ----> no_popcnt >>> >>> >>> >>> >>> Flags : -fclone-hot-version-paths does function unswitching via cloning. >>> --param=num-mversn-clones=<num> allows to specify the number of >>> functions that should be cloned. >>> --param=mversn-clone-depth=<num> allows to specify the length of >>> the call graph path that should be cloned. num = 0 implies only >>> leaf node that contains the __builtin_dispatch statement must be >>> cloned. >>> ============================================================================================ >>> >>> Google ref 45947, 45986 >>> >>> 2011-04-28 Sriraman Tallam <tmsri...@google.com> >>> >>> * c-family/c-common.c (revision 173122) >>> (handle_version_selector_attribute): New function. >>> (c_common_attribute_table): New attribute "version_selector". >>> * tree-pass.h (revision 173122) >>> (pass_tree_convert_builtin_dispatch): New pass. >>> (pass_ipa_multiversion_dispatch): New pass. >>> * testsuite/gcc.dg/mversn7.c (revision 0): New test case. >>> * testsuite/gcc.dg/mversn4.c (revision 0): New test case. >>> * testsuite/gcc.dg/mversn4.h (revision 0): New test case. >>> * testsuite/gcc.dg/mversn4a.c (revision 0): New test case. >>> * testsuite/gcc.dg/torture/mversn1.c (revision 0): New test case. >>> * testsuite/gcc.dg/mversn2.c (revision 0): New test case. >>> * testsuite/gcc.dg/mversn6.c (revision 0): New test case. >>> * testsuite/gcc.dg/mversn3.c (revision 0): New test case. >>> * testsuite/g++.dg/mversn8.C (revision 0): New test case. >>> * testsuite/g++.dg/mversn10a.C (revision 0): New test case. >>> * testsuite/g++.dg/mversn14a.C (revision 0): New test case. >>> * testsuite/g++.dg/tree-prof/mversn13.C (revision 0): New test case. >>> * testsuite/g++.dg/tree-prof/mversn15.C (revision 0): New test case. >>> * testsuite/g++.dg/tree-prof/mversn15a.C (revision 0): New >>> test case. >>> * testsuite/g++.dg/mversn9.C (revision 0): New test case. >>> * testsuite/g++.dg/mversn10.C (revision 0): New test case. >>> * testsuite/g++.dg/mversn12.C (revision 0): New test case. >>> * testsuite/g++.dg/mversn14.C (revision 0): New test case. >>> * testsuite/g++.dg/mversn16.C (revision 0): New test case. >>> * testsuite/g++.dg/torture/mversn11.C (revision 0): New test case. >>> * testsuite/g++.dg/torture/mversn5.C (revision 0): New test case. >>> * testsuite/g++.dg/torture/mversn5.h (revision 0): New test case. >>> * testsuite/g++.dg/torture/mversn5a.C (revision 0): New test case. >>> * builtin-types.def (revision 173122) (BT_PTR_FN_INT): New >>> pointer type. >>> (BT_FN_INT_PTR_FN_INT_PTR_PTR_VAR): New function type for >>> __builtin_dispatch. >>> * builtins.def (revision 173122) (BUILT_IN_DISPATCH): New builtin to >>> support multi-version calls. >>> * mversn-dispatch.c (revision 0): New file. >>> * timevar.def (revision 173122) (TV_MVERSN_DISPATCH): New time var. >>> * common.opt (revision 173122) (fclone-hot-version-paths): New >>> flag. >>> * Makefile.in (revision 173122) (mversn-dispatch.o): New rule. >>> * passes.c (revision 173122) (init_optimization_passes): Add >>> the new >>> multi-version and dispatch passes to the pass list. >>> * params.def (revision 173122) (PARAM_NUMBER_OF_MVERSN_CLONES): >>> Define. >>> (PARAM_MVERSN_CLONE_CGRAPH_DEPTH): Define. >>> * doc/invoke.texi (revision 173122) (mversn-clone-depth): >>> Document. >>> (num-mversn-clones): Document. >>> (fclone-hot-version-paths): Document. >> >