[Bug tree-optimization/50246] New: SRA: Writes to class members are not combined
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50246 Bug #: 50246 Summary: SRA: Writes to class members are not combined Classification: Unclassified Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: jus...@fathomdb.com Created attachment 25147 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25147 test program Modeling this bug report on the very similar Bug 36318; this issue is still around though (at least with "gcc version 4.6.1 (Debian 4.6.1-4)", uname -r "3.0.0-1-amd64") In this test file: class P { public: char s1; char s2; P(char i) : s1(i), s2(i) {} }; void f(char j, P* p) { *p = P(j); } The constructor's writes to the two fields of P should be combined, resulting in a single load instead of two loads. When run with -O3 (or -Os) two loads are produced: cc -c -O3 -g test.cc -o test.o objdump -dS test.o ... <_Z1fcP1P>: char s1; char s2; P(char i) : s1(i), s2(i) {} }; void f(char j, P* p) { *p = P(j); 0: 40 88 3emov%dil,(%rsi) 3: 40 88 7e 01 mov%dil,0x1(%rsi) } 7: c3 retq -fno-tree-sra fixes the issue: cc -c -O3 -fno-tree-sra -g test.cc -o test.o objdump -dS test.o ... <_Z1fcP1P>: class P { public: char s1; char s2; P(char i) : s1(i), s2(i) {} 0: 31 c0 xor%eax,%eax 2: 48 89 famov%rdi,%rdx 5: 40 88 f8mov%dil,%al 8: 88 d4 mov%dl,%ah }; void f(char j, P* p) { *p = P(j); a: 66 89 06mov%ax,(%rsi) } d: c3 retq I'm not sure that this test case is actually valid, in that it's not necessarily obvious that the single load is better (given the code is bigger). This is my attempt at a highly reduced test-case from a much more severe real-world problem I encountered: class P { uint16_t a; uint8_t b; }, calling std::vector::push_back results in a 16 bit write, an 8 bit write, and then a 32 bit read on the same address, which results in a serious performance hotspot, I believe because the CPU can't figure out the memory dependencies. Doing manual bit-packing into a uint32_t (with the same memory layout) dramatically improves the performance there. If this test isn't valid, let me know why not and I'll try to find a reduction that is valid!
[Bug libstdc++/50257] New: unordered_map slow initialization due to huge __prime_list
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50257 Bug #: 50257 Summary: unordered_map slow initialization due to huge __prime_list Classification: Unclassified Product: gcc Version: 4.6.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ AssignedTo: unassig...@gcc.gnu.org ReportedBy: jus...@fathomdb.com The constructor for std::unordered_map calls _Prime_rehash_policy::_M_next_bkt to determine the initial size of the hashtable. That method computes the size by looking for the next largest number in a large list of primes (in src/hashtable-aux.cc), which it does using lower_bound. However, this list is very big - I counted it as 310 elements of 8 bytes each on a 64 bit machine, or ~ 2.4KB, accessed in a way (lower_bound) that probably causes lots of cache misses. This lower_bound call comes up as a performance hotspot on my profiles. I think that: 1) the initial granularity is probably overkill. The inital entries are: 2ul, 3ul, 5ul, 7ul, 11ul, 13ul, 17ul, 19ul, 23ul, 29ul, 31ul, 37ul, 41ul, 43ul, 47ul, 53ul, 59ul, 61ul, 67ul, 71ul, 73ul, 79ul, 83ul, 89ul, 97ul, 103ul, 109ul, 113ul, 127ul, 137ul, 139ul, 149ul, 157ul, 167ul, 179ul, 193ul, 199ul, 211ul, 227ul, 241ul, 257ul,... Overall performance would probably be better/comparable with e.g. 11ul, 31ul, 127ul, 241ul. 2) we should special-case the first values, or (at the very least) the value when size is specified in the unordered_map constructor (I think that's 10). I also found this prime list, which looks a lot more reasonable in terms of the density: http://gcc.gnu.org/ml/gcc-patches/2009-05/msg01293.html One of the problems is that this list of primes serves three purposes: 1) Initial allocation when no size is specified (for which it is slow) 2) Initial allocation when the user is specifying the size (when we probably want a size as close as possible to the target number) 3) Resizing (when again we want a size as close as possible to the size we've computed) I think that this is probably a good list for purpose #2 and #3, but it is pretty painful for #1. Unfortunately #1 is the probably the most common use case, I would think. An alternative fix would be to have two different _Prime_rehash_policy strategies: one for people that know the size of their hashtable, and one for people that just want sensible defaults. I don't think _Prime_rehash_policy is a good match for the second group, and I suspect the second group is the majority. Is this the right place to raise this, or would the mailing list be better?
[Bug libstdc++/50257] [C++0x] unordered_map slow initialization due to huge __prime_list
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50257 --- Comment #5 from Justin SB 2011-09-01 17:06:04 UTC --- Created attachment 25167 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25167 Test case Test case that stresses unordered_map construction
[Bug libstdc++/50257] [C++0x] unordered_map slow initialization due to huge __prime_list
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50257 --- Comment #6 from Justin SB 2011-09-01 17:33:32 UTC --- Thanks so much for the help. I created a test case (attached) and tried out the patch (the second version); it is a big improvement. Here are the timings: g++ -g -O3 --std=c++0x test.cc Without patch: user0m13.201s With patch: user0m10.237s --- The special casing is obviously a bit of a hack, so I wondered if we could do better by doing a linear scan if the value is less than a threshold (which would have a much nicer memory pattern). It was a small improvement, but nowhere near what the special-casing of the value 10 does: With linear scan for small n: user0m12.561s I therefore believe the real win is because the special case allows the compiler to optimize away the entire lookup in __prime_list, because the function is inlined and the value of n is known. --- There is a comment in hashtable_policy: // XXX This is a hack. There's no good reason for any of // _Prime_rehash_policy's member functions to be inline. However, I think that (with the patch) this isn't true any more. There's a big performance win by avoiding touching __prime_list, so at least the special case should remain inline. --- Based on your malloc comment, I'm using tcmalloc which is a bit faster than the default allocator, which is probably why it showed up; the patch still knocks the same ~3 seconds off the time: g++ -g -O3 --std=c++0x test.cc -ltcmalloc Without patch, with -ltcmalloc user0m10.173s With patch, with -ltcmalloc user0m7.288s --- In terms of the (non cumulative) cpu cycles spent in the lookup vs new/delete: With -ltcmalloc: Without patch: 44% iter() / 48% memory. With patch: 34% iter() / 63% memory With stock malloc: Without patch: 38% iter() / 58% memory With patch: 18% iter() / 79% memory So memory is indeed a big cost, but as I was using tcmalloc I was able to see that the constructor was surprisingly expensive (spending about as much time in the constructor as I was in allocation or deallocation) --- TLDR: The patch works for me, and I think it's a significant win, because it allows the compiler to optimize __prime_list away entirely.
[Bug libstdc++/50257] [C++0x] unordered_map slow initialization due to huge __prime_list
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50257 --- Comment #9 from Justin SB 2011-09-06 04:04:12 UTC --- I think I personally prefer the original patches for readability, but I guess it all depends on what gets optimized away. But that caused me to think: how about relying on constexpr? We should be able to precompute the lookup for any constant, while having the original runtime behaviour (no extra checks) for any non-constant. I tried the code below, and while it seems to be correct and satisfy the constexpr rules, it doesn't get optimized away to a constant with gcc 4.6 (and I'm ashamed to say I can't get the git version of gcc to build at the moment) If this could be made to work, I think this could be a great approach, in that we're then not trading off the constant/non-constant cases against each other. (Optimizing small non-constexpr values as you've done in this latest patch might also be helpful for performance, but I don't know if that really happens much in practice; I suspect if a caller goes to the trouble of specifying a small size they know that the hash is very unlikely to grow beyond that size) --- #include using namespace std; constexpr int primes[] = { 2, 3, 5, 7, 11, 13}; constexpr int primes_lower_bound(int __first, int __last, int __val) { return (__first == __last) ? primes[__first] : (primes[__first + ((__last - __first) >> 1)] < __val) ? primes_lower_bound(__first + ((__last - __first) >> 1) + 1, __last, __val) : primes_lower_bound(__first, __first + ((__last - __first) >> 1), __val); } static_assert( primes_lower_bound( 0, 6, 10) == 11, "Sanity check error"); constexpr int next_prime(int k) { return primes_lower_bound( 0, 6, k); } int main() { cout << next_prime(10) << endl; return 0; }
[Bug libstdc++/50257] [C++0x] unordered_map slow initialization due to huge __prime_list
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50257 --- Comment #15 from Justin SB 2011-09-06 15:55:31 UTC --- That patch works for me - thanks; gcc optimizes away the array lookup. I do still think that constexpr could be helpful here: every lookup for an explicit (or default) unordered_map size would then be optimized away (not just those <= 11), but constexpr is still new to me so I'm not sure of the behavioural nuances. I do think it should definitely be optimized away in my test case, but that's a separate issue. Thanks again for fixing.