[Bug tree-optimization/50246] New: SRA: Writes to class members are not combined

2011-08-30 Thread justin at fathomdb dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50246

 Bug #: 50246
   Summary: SRA: Writes to class members are not combined
Classification: Unclassified
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: jus...@fathomdb.com


Created attachment 25147
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25147
test program

Modeling this bug report on the very similar Bug 36318; this issue is still
around though (at least with "gcc version 4.6.1 (Debian 4.6.1-4)", uname -r
"3.0.0-1-amd64")

In this test file:

class P {
public:
char s1; char s2; 
P(char i) : s1(i), s2(i) {}
};

void f(char j, P* p) {
*p = P(j);
}

The constructor's writes to the two fields of P should be combined, resulting
in a single load instead of two loads.

When run with -O3 (or -Os) two loads are produced:
cc -c -O3 -g test.cc -o test.o
objdump -dS test.o

...
 <_Z1fcP1P>:
char s1; char s2; 
P(char i) : s1(i), s2(i) {}
};

void f(char j, P* p) {
*p = P(j);
   0:   40 88 3emov%dil,(%rsi)
   3:   40 88 7e 01 mov%dil,0x1(%rsi)
}
   7:   c3  retq   


-fno-tree-sra fixes the issue:
cc -c -O3 -fno-tree-sra -g test.cc -o test.o
objdump -dS test.o

...
 <_Z1fcP1P>:
class P {
public:
char s1; char s2; 
P(char i) : s1(i), s2(i) {}
   0:   31 c0   xor%eax,%eax
   2:   48 89 famov%rdi,%rdx
   5:   40 88 f8mov%dil,%al
   8:   88 d4   mov%dl,%ah
};

void f(char j, P* p) {
*p = P(j);
   a:   66 89 06mov%ax,(%rsi)
}
   d:   c3  retq   

I'm not sure that this test case is actually valid, in that it's not
necessarily obvious that the single load is better (given the code is bigger). 
This is my attempt at a highly reduced test-case from a much more severe
real-world problem I encountered: class P { uint16_t a; uint8_t b; }, calling
std::vector::push_back results in a 16 bit write, an 8 bit write, and then a 32
bit read on the same address, which results in a serious performance hotspot, I
believe because the CPU can't figure out the memory dependencies.  Doing manual
bit-packing into a uint32_t (with the same memory layout) dramatically improves
the performance there.  If this test isn't valid, let me know why not and I'll
try to find a reduction that is valid!


[Bug libstdc++/50257] New: unordered_map slow initialization due to huge __prime_list

2011-08-31 Thread justin at fathomdb dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50257

 Bug #: 50257
   Summary: unordered_map slow initialization due to huge
__prime_list
Classification: Unclassified
   Product: gcc
   Version: 4.6.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libstdc++
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: jus...@fathomdb.com


The constructor for std::unordered_map calls _Prime_rehash_policy::_M_next_bkt
to determine the initial size of the hashtable.  That method computes the size
by looking for the next largest number in a large list of primes (in
src/hashtable-aux.cc), which it does using lower_bound.  However, this list is
very big - I counted it as 310 elements of 8 bytes each on a 64 bit machine, or
~ 2.4KB, accessed in a way (lower_bound) that probably causes lots of cache
misses.  This lower_bound call comes up as a performance hotspot on my
profiles.

I think that:

1) the initial granularity is probably overkill.  The inital entries are:
 2ul, 3ul, 5ul, 7ul, 11ul, 13ul, 17ul, 19ul, 23ul, 29ul, 31ul,
37ul, 41ul, 43ul, 47ul, 53ul, 59ul, 61ul, 67ul, 71ul, 73ul, 79ul,
83ul, 89ul, 97ul, 103ul, 109ul, 113ul, 127ul, 137ul, 139ul, 149ul,
157ul, 167ul, 179ul, 193ul, 199ul, 211ul, 227ul, 241ul, 257ul,...
Overall performance would probably be better/comparable with e.g. 11ul, 31ul,
127ul, 241ul.

2) we should special-case the first values, or (at the very least) the value
when size is specified in the unordered_map constructor (I think that's 10).

I also found this prime list, which looks a lot more reasonable in terms of the
density: http://gcc.gnu.org/ml/gcc-patches/2009-05/msg01293.html

One of the problems is that this list of primes serves three purposes:
1) Initial allocation when no size is specified (for which it is slow)
2) Initial allocation when the user is specifying the size (when we probably
want a size as close as possible to the target number)
3) Resizing (when again we want a size as close as possible to the size we've
computed)

I think that this is probably a good list for purpose #2 and #3, but it is
pretty painful for #1.  Unfortunately #1 is the probably the most common use
case, I would think.  An alternative fix would be to have two different
_Prime_rehash_policy strategies: one for people that know the size of their
hashtable, and one for people that just want sensible defaults.  I don't think
_Prime_rehash_policy is a good match for the second group, and I suspect the
second group is the majority.

Is this the right place to raise this, or would the mailing list be better?


[Bug libstdc++/50257] [C++0x] unordered_map slow initialization due to huge __prime_list

2011-09-01 Thread justin at fathomdb dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50257

--- Comment #5 from Justin SB  2011-09-01 17:06:04 
UTC ---
Created attachment 25167
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25167
Test case

Test case that stresses unordered_map construction


[Bug libstdc++/50257] [C++0x] unordered_map slow initialization due to huge __prime_list

2011-09-01 Thread justin at fathomdb dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50257

--- Comment #6 from Justin SB  2011-09-01 17:33:32 
UTC ---
Thanks so much for the help.  I created a test case (attached) and tried out
the patch (the second version); it is a big improvement.

Here are the timings:
g++ -g -O3 --std=c++0x test.cc

Without patch:
user0m13.201s

With patch:
user0m10.237s

---

The special casing is obviously a bit of a hack, so I wondered if we could do
better by doing a linear scan if the value is less than a threshold (which
would have a much nicer memory pattern).  It was a small improvement, but
nowhere near what the special-casing of the value 10 does:

With linear scan for small n:
user0m12.561s

I therefore believe the real win is because the special case allows the
compiler to optimize away the entire lookup in __prime_list, because the
function is inlined and the value of n is known.

---

There is a comment in hashtable_policy:

  // XXX This is a hack.  There's no good reason for any of
  // _Prime_rehash_policy's member functions to be inline.

However, I think that (with the patch) this isn't true any more.  There's a big
performance win by avoiding touching __prime_list, so at least the special case
should remain inline.

---

Based on your malloc comment, I'm using tcmalloc which is a bit faster than the
default allocator, which is probably why it showed up; the patch still knocks
the same ~3 seconds off the time:
g++ -g -O3 --std=c++0x test.cc -ltcmalloc

Without patch, with -ltcmalloc
user0m10.173s

With patch, with -ltcmalloc
user0m7.288s


---

In terms of the (non cumulative) cpu cycles spent in the lookup vs new/delete:

With -ltcmalloc:
Without patch: 44% iter() / 48% memory.
With patch: 34% iter() / 63% memory

With stock malloc:
Without patch: 38% iter() / 58% memory
With patch: 18% iter() / 79% memory

So memory is indeed a big cost, but as I was using tcmalloc I was able to see
that the constructor was surprisingly expensive (spending about as much time in
the constructor as I was in allocation or deallocation)

---

TLDR: The patch works for me, and I think it's a significant win, because it
allows the compiler to optimize __prime_list away entirely.


[Bug libstdc++/50257] [C++0x] unordered_map slow initialization due to huge __prime_list

2011-09-05 Thread justin at fathomdb dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50257

--- Comment #9 from Justin SB  2011-09-06 04:04:12 
UTC ---
I think I personally prefer the original patches for readability, but I guess
it all depends on what gets optimized away.  But that caused me to think: how
about relying on constexpr?  We should be able to precompute the lookup for any
constant, while having the original runtime behaviour (no extra checks) for any
non-constant.

I tried the code below, and while it seems to be correct and satisfy the
constexpr rules, it doesn't get optimized away to a constant with gcc 4.6 (and
I'm ashamed to say I can't get the git version of gcc to build at the moment)

If this could be made to work, I think this could be a great approach, in that
we're then not trading off the constant/non-constant cases against each other.

(Optimizing small non-constexpr values as you've done in this latest patch
might also be helpful for performance, but I don't know if that really happens
much in practice; I suspect if a caller goes to the trouble of specifying a
small size they know that the hash is very unlikely to grow beyond that size)

---

#include 

using namespace std;

constexpr int primes[] = { 2, 3, 5, 7, 11, 13}; 

constexpr int primes_lower_bound(int __first, int __last, int __val)
{
  return (__first == __last)
? primes[__first]
: (primes[__first + ((__last - __first) >> 1)] < __val)
  ? primes_lower_bound(__first + ((__last - __first) >> 1) + 1, __last,
__val) 
  : primes_lower_bound(__first, __first + ((__last - __first) >> 1),
__val);
}

static_assert( primes_lower_bound( 0, 6, 10) == 11, "Sanity check error");

constexpr int next_prime(int k) {
return primes_lower_bound( 0, 6, k);
}

int main() {
cout << next_prime(10) << endl;
return 0;
}


[Bug libstdc++/50257] [C++0x] unordered_map slow initialization due to huge __prime_list

2011-09-06 Thread justin at fathomdb dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50257

--- Comment #15 from Justin SB  2011-09-06 15:55:31 
UTC ---
That patch works for me - thanks;  gcc optimizes away the array lookup.

I do still think that constexpr could be helpful here: every lookup for an
explicit (or default) unordered_map size would then be optimized away (not just
those <= 11), but constexpr is still new to me so I'm not sure of the
behavioural nuances.  I do think it should definitely be optimized away in my
test case, but that's a separate issue.

Thanks again for fixing.