On Thu, Jun 04, 2015 at 02:34:40PM -0700, Andi Kleen wrote: > The compiler has much more information than the headers. > > - It can do alias analysis, so to avoid needing to handle overlap > and similar.
Could but it could also export that information which would benefit third parties. > - It can (sometimes) determine alignment, which is important > information for tuning. In general case yes, but here its useless. As most functions are aligned to 16 bytes in less than 10% of calls you shouldn't add cold branch to handle aligned data. Also as I mentioned bugs before gcc now doesn't handle alignment well so it doesn't optimize following to zero for aligned code. align = ((uintptr_t) x) % 16; If it done so then you don't need go to gcc, just check alignment with __builtin_constant_p(((uintptr_t) x) % 16) && ((uintptr_t) x) % 16 == 0 > - With profile feedback it can use value histograms to determine the > best code. > Problem is that histograms are not enough as I mentioned before. For profiling you need to measure useful data which differs per function and should be done in userspace. For best code you need to know things like percentage of cache lines in L1, L2 and L3 cache cache to select correct memset. On ivy bridge I got that Using rep stosq for memset(x,0,4096) is 20% slower than libcall for L1 cache resident data while 50% faster for data outside cache. How do you teach compiler that? Switch to 16 byte blocks here to see graphs. http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/memset_profile/results_rand/result.html http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/memset_profile/results_rand_nocache/result.html Likewise on memcpy I got that rte_memcpy is faster on copies of L1 cache data. That isn't very useful as you cannot have many 8kb input and output buffers both in L1 cache. Reason is it uses 256-byte loopp That becomes nil for L2 cache and problem for L3 cache where it is slower. http://kam.mff.cuni.cz/~ondra/benchmark_string/haswell/memcpy_profile/results_rand/result.html http://kam.mff.cuni.cz/~ondra/benchmark_string/haswell/memcpy_profile/results_rand_L2/result.html http://kam.mff.cuni.cz/~ondra/benchmark_string/haswell/memcpy_profile/results_rand_L3/result.html Likewise for strcmp+co you need to know probabilities in which bytes match occurs and depending on than first add 0-4 bytewise checks followed by maybe 8byte checks and libcall. > It may not use all of this today, but it could. >