Re: Builtin expansion versus headers optimization: Reductions

Ondřej Bílka Fri, 05 Jun 2015 02:03:13 -0700

On Thu, Jun 04, 2015 at 02:34:40PM -0700, Andi Kleen wrote:
> The compiler has much more information than the headers.
> 
> - It can do alias analysis, so to avoid needing to handle overlap
> and similar.


Could but it could also export that information which would benefit
third parties.

> - It can (sometimes) determine alignment, which is important
> information for tuning.

In general case yes, but here its useless. As most functions are aligned
to 16 bytes in less than 10% of calls you shouldn't add cold branch to
handle aligned data.

Also as I mentioned bugs before gcc now doesn't handle alignment well so
it doesn't optimize following to zero for aligned code.

 align = ((uintptr_t) x) % 16;

If it done so then you don't need go to gcc, just check alignment with
__builtin_constant_p(((uintptr_t) x) % 16) && ((uintptr_t) x) % 16 == 0

> - With profile feedback it can use value histograms to determine the
> best code.
> 
Problem is that histograms are not enough as I mentioned before. For
profiling you need to measure useful data which differs per function and
should be done in userspace.

For best code you need to know things like percentage of cache lines in L1,
L2 and L3 cache cache to select correct memset. 

On ivy bridge I got that Using rep stosq for memset(x,0,4096) is 20%
slower than libcall for L1 cache resident data while 50% faster for data
outside cache. How do you teach compiler that?

Switch to 16 byte blocks here to see graphs.

http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/memset_profile/results_rand/result.html
http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/memset_profile/results_rand_nocache/result.html

Likewise on memcpy I got that rte_memcpy is faster on copies of L1 cache data.
That isn't very useful as you cannot have many 8kb input and output
buffers both in L1 cache. Reason is it uses 256-byte loopp That becomes nil for 
L2 cache and problem for L3 cache where it is slower.

http://kam.mff.cuni.cz/~ondra/benchmark_string/haswell/memcpy_profile/results_rand/result.html
http://kam.mff.cuni.cz/~ondra/benchmark_string/haswell/memcpy_profile/results_rand_L2/result.html
http://kam.mff.cuni.cz/~ondra/benchmark_string/haswell/memcpy_profile/results_rand_L3/result.html

Likewise for strcmp+co you need to know probabilities in which bytes
match occurs and depending on than first add 0-4 bytewise checks
followed by maybe 8byte checks and libcall.

> It may not use all of this today, but it could.
>

Re: Builtin expansion versus headers optimization: Reductions

Reply via email to