Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)

François Dumont Tue, 16 Jul 2019 09:40:43 -0700

Hi

I eventually spent much more time working on the inplace_mergeperformance bench.


    And the results do not confirm the theory:

Before patch:

inplace_merge.cc bench 1 / 1 memory 243r 227u 17s 1216mem 5pfinplace_merge.cc bench 1 / 4 memory 297r 278u 18s 480mem 0pfinplace_merge.cc bench 1 /64 memory 373r 354u 18s 480mem 0pfinplace_merge.cc bench 0 / 1 memory 12r 11u 0s 480mem 0pf


After the patch to reduce memory allocation:

inplace_merge.cc bench 1 / 1 memory 245r 227u 18s 1216mem 0pfinplace_merge.cc bench 1 / 4 memory 292r 273u 19s 480mem 0pfinplace_merge.cc bench 1 /64 memory 373r 356u 17s 480mem 0pfinplace_merge.cc bench 0 / 1 memory 11r 11u 0s 480mem 0pf


With the __len1 > __len2 condition change:

inplace_merge.cc bench 1 / 1 memory 244r 225u 20s 1216mem 0pfinplace_merge.cc bench 1 / 4 memory 379r 361u 17s 480mem 0pfinplace_merge.cc bench 1 /64 memory 640r 625u 16s 480mem 0pfinplace_merge.cc bench 0 / 1 memory 11r 11u 0s 480mem 0pf

When there is no memory limit the results are identical of course.Otherwise as soon as memory is limited performance starts to decreasewith the condition change on __len1 vs __len2.

Could you give the bench you use to demonstrate the enhancement ? I alsowonder why your patch doesn't change consistently the same condition in__merge_without_buffer ?

For the moment I'd like to propose the attached patch that is to say thereduction on the amount of allocated memory and the new/modified benches.

Note that as soon as I forbid any memory allocation I also reduce thesize of the range to merge cause the implementation rely on recursivityand so could end-up in a stack overflow. Maybe I need to check forsimulator targets like several other tests ? Unless simulators do notrun the performance tests ?

Regarding this stack overflow issue, is there some routine to find outhow many levels of function calls can be added before reaching the stackoverflow ? I could perhaps call __builtin_alloca and check the resultbut that smells. If I could find out this we could fallback on aniterative approach to complete the merge.


    PR libstdc++/83938
    * include/bits/stl_algo.h:
    (__inplace_merge): Take temporary buffer length from smallest range.
    (__stable_sort): Limit temporary buffer length.
    * include/bits/stl_tempbuf.h (get_temporary_buffer): Change __len
    computation in the loop.
    * testsuite/25_algorithms/inplace_merge/1.cc (test3): Test all possible
    pivot positions.
    * testsuite/performance/25_algorithms/inplace_merge.cc: New.
    * testsuite/performance/25_algorithms/stable_sort.cc: Rework to allow
    execution with different memory constraints.

Ok to commit ?

François

On 6/9/19 4:27 PM, François Dumont wrote:

On 12/21/18 9:57 PM, Jonathan Wakely wrote:
On 29/10/18 07:06 +0100, François Dumont wrote:
Hi

    Some feedback regarding this patch ?
Sorry this got missed, please resubmit during stage 1.

You haven't CC'd the original patch author (chang jc) to give them a
chance to comment on your proposed changes to the patch.

The attached PDF on PR libstdc++/83938 has extensive discussion of the
performance issue, but I don't see any for your version. Some
performance benchmarks for your version would be helpful.
Here is this patch again.
This time it is much closer to John's original one, I just kept mychange on the size of the temporary buffer which doesn't need to be aslarge as it is currently allocated, especially with John's patch.
The performance tests are showing the good improvements, attached arethe before/after. Surprisingly I do not see any change regardingallocated memory, I guess measures are not accurate enough. Howeverworking on them also show that current implementation is fragile. Ifyou reduce the allowed allocated memory too much you end up with astack overflow because of the recursive implementation.
I already have a patch to keep on trying to allocate memory as long asabove a given threshold rather than 0 but I am also working on makingimplementation non-recursive. We'll see if even a buffer of size 1 isbetter than no buffer at all then.
    PR libstdc++/83938
    * include/bits/stl_algo.h:
(__merge_adaptive): When buffer too small consider smallest rangefirst
    and cut based on buffer size.
    (__inplace_merge): Take temporary buffer length from smallest range.
    (__stable_sort): Limit temporary buffer length.
    * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function
    to reduce requested buffer length on allocation failure.
    * testsuite/performance/25_algorithms/inplace_merge.cc: New.
    * testsuite/performance/25_algorithms/stable_sort.cc: Rework to allow
    execution with different level of memory.

François

diff --git a/libstdc++-v3/include/bits/stl_algo.h b/libstdc++-v3/include/bits/stl_algo.h
index ec651e2cc45..b396437443f 100644
--- a/libstdc++-v3/include/bits/stl_algo.h
+++ b/libstdc++-v3/include/bits/stl_algo.h
@@ -2530,7 +2530,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
       const _DistanceType __len2 = std::distance(__middle, __last);
 
       typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> _TmpBuf;
-      _TmpBuf __buf(__first, __len1 + __len2);
+      _TmpBuf __buf(__first, std::min(__len1, __len2));
 
       if (__buf.begin() == 0)
 	std::__merge_without_buffer
@@ -2738,6 +2738,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 	  std::__merge_sort_with_buffer(__first, __middle, __buffer, __comp);
 	  std::__merge_sort_with_buffer(__middle, __last, __buffer, __comp);
 	}
+
       std::__merge_adaptive(__first, __middle, __last,
 			    _Distance(__middle - __first),
 			    _Distance(__last - __middle),
@@ -5023,8 +5024,11 @@ _GLIBCXX_BEGIN_NAMESPACE_ALGO
       typedef typename iterator_traits<_RandomAccessIterator>::difference_type
 	_DistanceType;
 
+      if (__first == __last)
+	return;
+
       typedef _Temporary_buffer<_RandomAccessIterator, _ValueType> _TmpBuf;
-      _TmpBuf __buf(__first, std::distance(__first, __last));
+      _TmpBuf __buf(__first, (__last - __first + 1) / 2);
 
       if (__buf.begin() == 0)
 	std::__inplace_stable_sort(__first, __last, __comp);
diff --git a/libstdc++-v3/include/bits/stl_tempbuf.h b/libstdc++-v3/include/bits/stl_tempbuf.h
index fa78ee53eb4..b6ad9ee6a46 100644
--- a/libstdc++-v3/include/bits/stl_tempbuf.h
+++ b/libstdc++-v3/include/bits/stl_tempbuf.h
@@ -95,7 +95,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 							std::nothrow));
 	  if (__tmp != 0)
 	    return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
-	  __len /= 2;
+	  __len = __len == 1 ? 0 : ((__len + 1) / 2);
 	}
       return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
     }
diff --git a/libstdc++-v3/testsuite/25_algorithms/inplace_merge/1.cc b/libstdc++-v3/testsuite/25_algorithms/inplace_merge/1.cc
index 0b0c9c59640..9c393ce8fe3 100644
--- a/libstdc++-v3/testsuite/25_algorithms/inplace_merge/1.cc
+++ b/libstdc++-v3/testsuite/25_algorithms/inplace_merge/1.cc
@@ -28,7 +28,7 @@ using std::inplace_merge;
 
 typedef test_container<int, bidirectional_iterator_wrapper> container;
 
-void 
+void
 test1()
 {
   int array[] = { 1 };
@@ -39,7 +39,7 @@ test1()
   inplace_merge(con2.begin(), con2.begin(), con2.end());
 }
 
-void 
+void
 test2()
 {
   int array[] = { 0, 2, 4, 1, 3, 5 };
@@ -60,30 +60,34 @@ struct S
   { return a < _s.a; }
 };
 
-void 
+void
 test3()
 {
   S s[8];
-  s[0].a = 0;
-  s[1].a = 1;
-  s[2].a = 2;
-  s[3].a = 3;
-  s[4].a = 0;
-  s[5].a = 1;
-  s[6].a = 2;
-  s[7].a = 3;
-
-  s[0].b = 0;
-  s[1].b = 1;
-  s[2].b = 2;
-  s[3].b = 3;
-  s[4].b = 4;
-  s[5].b = 5;
-  s[6].b = 6;
-  s[7].b = 7;
-
-  inplace_merge(s, s + 4, s + 8);
-  VERIFY( s[0].b == 0 && s[1].b == 4 && s[2].b == 1 && s[3].b == 5 );
+  for (int pivot_idx = 0; pivot_idx < 8; ++pivot_idx)
+    {
+      int bval = 0;
+      for (int i = 0; i != pivot_idx; ++i)
+	{
+	  s[i].a = i;
+	  s[i].b = bval++;
+	}
+
+      for (int i = pivot_idx; i != 8; ++i)
+	{
+	  s[i].a = i - pivot_idx;
+	  s[i].b = bval++;
+	}
+
+      inplace_merge(s, s + pivot_idx, s + 8);
+
+      for (int i = 1; i < 8; ++i)
+	{
+	  VERIFY( !(s[i] < s[i - 1]) );
+	  if (s[i - 1].a == s[i].a)
+	    VERIFY( s[i - 1].b < s[i].b );
+	}
+    }
 }
 
 int 
diff --git a/libstdc++-v3/testsuite/performance/25_algorithms/inplace_merge.cc b/libstdc++-v3/testsuite/performance/25_algorithms/inplace_merge.cc
new file mode 100644
index 00000000000..99328b6ea4c
--- /dev/null
+++ b/libstdc++-v3/testsuite/performance/25_algorithms/inplace_merge.cc
@@ -0,0 +1,289 @@
+// Copyright (C) 2019 Free Software Foundation, Inc.
+//
+// This file is part of the GNU ISO C++ Library.  This library is free
+// software; you can redistribute it and/or modify it under the
+// terms of the GNU General Public License as published by the
+// Free Software Foundation; either version 3, or (at your option)
+// any later version.
+
+// This library is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+
+// You should have received a copy of the GNU General Public License along
+// with this library; see the file COPYING3.  If not see
+// <http://www.gnu.org/licenses/>.
+
+#include <vector>
+#include <algorithm>
+#include <cmath>
+
+#include <testsuite_new_operators.h>
+#include <testsuite_performance.h>
+
+const int max_size = 10000000;
+const int small_size = 200000;
+const int front_pivot_idx = 10;
+int middle_pivot_idx = max_size / 2;
+int back_pivot_idx = max_size - front_pivot_idx;
+
+void bench(int mem_threshold, int pivot_index,
+	   std::vector<int> revv,
+	   std::vector<int> fwdv,
+	   std::vector<int> wstv,
+	   std::vector<int> rndv)
+{
+  using namespace __gnu_test;
+
+  time_counter time;
+  resource_counter resource;
+
+  set_new_limit(mem_threshold);
+
+  start_counters(time, resource);
+  std::inplace_merge(revv.begin(), revv.begin() + pivot_index, revv.end());
+  stop_counters(time, resource);
+
+  set_new_limit(~size_t(0));
+
+  report_performance(__FILE__, "reverse", time, resource);
+  clear_counters(time, resource);
+
+  set_new_limit(mem_threshold);
+
+  start_counters(time, resource);
+  std::inplace_merge(fwdv.begin(), fwdv.begin() + pivot_index, fwdv.end());
+  stop_counters(time, resource);
+
+  set_new_limit(~size_t(0));
+
+  report_performance(__FILE__, "forward", time, resource);
+  clear_counters(time, resource);
+
+  set_new_limit(mem_threshold);
+
+  start_counters(time, resource);
+  std::inplace_merge(wstv.begin(), wstv.begin() + pivot_index, wstv.end());
+  stop_counters(time, resource);
+
+  set_new_limit(~size_t(0));
+
+  report_performance(__FILE__, "worst", time, resource);
+  clear_counters(time, resource);
+
+  set_new_limit(mem_threshold);
+
+  start_counters(time, resource);
+  std::inplace_merge(rndv.begin(), rndv.begin() + pivot_index, rndv.end());
+  stop_counters(time, resource);
+
+  set_new_limit(~size_t(0));
+  report_performance(__FILE__, "random", time, resource);
+}
+
+void mem_bench(double mem_ratio,
+	       const std::vector<int>& front_revv,
+	       const std::vector<int>& middle_revv,
+	       const std::vector<int>& back_revv,
+	       const std::vector<int>& fwdv,
+	       const std::vector<int>& front_wstv,
+	       const std::vector<int>& middle_wstv,
+	       const std::vector<int>& back_wstv,
+	       const std::vector<int>& front_rndv,
+	       const std::vector<int>& middle_rndv,
+	       const std::vector<int>& back_rndv)
+{
+  using namespace __gnu_test;
+
+  time_counter time;
+  resource_counter resource;
+
+  int max_mem = (int)std::ceil(front_pivot_idx * mem_ratio) * sizeof(int);
+  start_counters(time, resource);
+  bench(max_mem, front_pivot_idx, front_revv, fwdv, front_wstv, front_rndv);
+  stop_counters(time, resource);
+  report_performance(__FILE__, "front pivot", time, resource);
+  clear_counters(time, resource);
+
+  max_mem = (int)std::ceil(middle_pivot_idx * mem_ratio) * sizeof(int);
+  start_counters(time, resource);
+  bench(max_mem, middle_pivot_idx, middle_revv, fwdv, middle_wstv, middle_rndv);
+  stop_counters(time, resource);
+  report_performance(__FILE__, "middle pivot", time, resource);
+  clear_counters(time, resource);
+
+  max_mem = (int)std::ceil(front_pivot_idx * mem_ratio) * sizeof(int);
+  start_counters(time, resource);
+  bench(max_mem, back_pivot_idx, back_revv, fwdv, back_wstv, back_rndv);
+  stop_counters(time, resource);
+  report_performance(__FILE__, "back pivot", time, resource);
+}
+
+void init_reverse(std::vector<int>& v, size_t pivot_index)
+{
+  int val = 0;
+  for (size_t i = pivot_index; i != v.size(); ++i)
+    v[i] = val++;
+  for (size_t i = 0; i != pivot_index; ++i)
+    v[i] = val++;
+}
+
+void init_forward(std::vector<int>& v)
+{
+  int val = 0;
+  for (size_t i = 0; i != v.size(); ++i)
+    v[i] = val++;
+}
+
+void init_worst(std::vector<int>& v, size_t pivot_index)
+{
+  int val = 0;
+  if (pivot_index > v.size() / 2)
+    {
+      for (size_t i = 0; i != pivot_index; val += 2, ++i)
+	v[i] = val;
+      val = 1;
+    }
+  else
+    {
+      for (size_t i = pivot_index; i != v.size(); val += 2, ++i)
+	v[i] = val;
+      ++val;
+    }
+
+  if (pivot_index > v.size() / 2)
+    for (size_t i = pivot_index; i != v.size(); val += 2, ++i)
+      v[i] = val;
+  else
+    for (size_t i = 0; i != pivot_index; val += 2, ++i)
+      v[i] = val;
+}
+
+void init_random(std::vector<int>& v)
+{
+  // a simple pseudo-random series which does not rely on rand() and friends
+  v[0] = 0;
+  for (size_t i = 1; i != v.size(); ++i)
+    v[i] = (v[i-1] + 110211473) * 745988807;
+}
+
+void reduce_size(std::vector<int>& front_v,
+		 std::vector<int>& middle_v,
+		 std::vector<int>& back_v)
+{
+  front_v.resize(small_size);
+  middle_v.erase(middle_v.begin() + small_size / 2,
+		 middle_v.begin() + max_size / 2);
+  middle_v.resize(small_size);
+  back_v.erase(back_v.begin(), back_v.end() - small_size);
+}
+
+int main()
+{
+  using namespace __gnu_test;
+
+  // No constraint to build vectors.
+  set_new_limit(~size_t(0));
+
+  std::vector<int> front_revv(max_size);
+  init_reverse(front_revv, front_pivot_idx);
+
+  std::vector<int> middle_revv(max_size);
+  init_reverse(middle_revv, middle_pivot_idx);
+
+  std::vector<int> back_revv(max_size);
+  init_reverse(back_revv, back_pivot_idx);
+
+  std::vector<int> fwdv(max_size);
+  init_forward(fwdv);
+
+  std::vector<int> front_wstv(max_size);
+  init_worst(front_wstv, front_pivot_idx);
+
+  std::vector<int> middle_wstv(max_size);
+  init_worst(middle_wstv, middle_pivot_idx);
+
+  std::vector<int> back_wstv(max_size);
+  init_worst(back_wstv, back_pivot_idx);
+
+  std::vector<int> front_rndv(max_size);
+  init_random(front_rndv);
+  std::vector<int> middle_rndv(front_rndv);
+  std::vector<int> back_rndv(front_rndv);
+
+  sort(front_rndv.begin(), front_rndv.begin() + front_pivot_idx);
+  sort(front_rndv.begin() + front_pivot_idx, front_rndv.end());
+
+  sort(middle_rndv.begin(), middle_rndv.begin() + middle_pivot_idx);
+  sort(middle_rndv.begin() + middle_pivot_idx, middle_rndv.end());
+
+  sort(back_rndv.begin(), back_rndv.begin() + back_pivot_idx);
+  sort(back_rndv.begin() + back_pivot_idx, back_rndv.end());
+
+  time_counter time;
+  resource_counter resource;
+
+  start_counters(time, resource);
+
+  // No limit.
+  mem_bench(1.0,
+	    front_revv, middle_revv, back_revv,
+	    fwdv,
+	    front_wstv, middle_wstv, back_wstv,
+	    front_rndv, middle_rndv, back_rndv);
+
+  stop_counters(time, resource);
+
+  report_performance(__FILE__, "bench 1 / 1 memory", time, resource);
+  clear_counters(time, resource);
+
+  start_counters(time, resource);
+
+  // Limit to the fourth.
+  mem_bench(1.0 / 4,
+	    front_revv, middle_revv, back_revv,
+	    fwdv,
+	    front_wstv, middle_wstv, back_wstv,
+	    front_rndv, middle_rndv, back_rndv);
+
+  stop_counters(time, resource);
+
+  report_performance(__FILE__, "bench 1 / 4 memory", time, resource);
+  clear_counters(time, resource);
+
+  start_counters(time, resource);
+
+  // Really limit allocation.
+  mem_bench(1.0 / 64,
+	    front_revv, middle_revv, back_revv,
+	    fwdv,
+	    front_wstv, middle_wstv, back_wstv,
+	    front_rndv, middle_rndv, back_rndv);
+
+  stop_counters(time, resource);
+
+  report_performance(__FILE__, "bench 1 /64 memory", time, resource);
+  clear_counters(time, resource);
+
+  middle_pivot_idx = small_size / 2;
+  back_pivot_idx = small_size - front_pivot_idx;
+  reduce_size(front_revv, middle_revv, back_revv);
+  fwdv.resize(small_size);
+  reduce_size(front_wstv, middle_wstv, back_wstv);
+  reduce_size(front_rndv, middle_rndv, back_rndv);
+
+  start_counters(time, resource);
+
+  // No memory.
+  mem_bench(0.0,
+	    front_revv, middle_revv, back_revv,
+	    fwdv,
+	    front_wstv, middle_wstv, back_wstv,
+	    front_rndv, middle_rndv, back_rndv);
+
+  stop_counters(time, resource);
+
+  report_performance(__FILE__, "bench 0 / 1 memory", time, resource);
+  return 0;
+}
diff --git a/libstdc++-v3/testsuite/performance/25_algorithms/stable_sort.cc b/libstdc++-v3/testsuite/performance/25_algorithms/stable_sort.cc
index e79e0a7f6b2..fa961d6eb35 100644
--- a/libstdc++-v3/testsuite/performance/25_algorithms/stable_sort.cc
+++ b/libstdc++-v3/testsuite/performance/25_algorithms/stable_sort.cc
@@ -17,49 +17,107 @@
 
 #include <vector>
 #include <algorithm>
+
+#include <testsuite_new_operators.h>
 #include <testsuite_performance.h>
 
-int main()
+const int max_size = 10000000;
+const int small_size = 200000;
+
+void bench(size_t mem_threshold,
+	   std::vector<int> revv,
+	   std::vector<int> fwdv,
+	   std::vector<int> rndv)
 {
   using namespace __gnu_test;
 
   time_counter time;
   resource_counter resource;
 
-  const int max_size = 10000000;
-
-  std::vector<int> v(max_size);
-
-  for (int i = 0; i < max_size; ++i)
-    v[i] = -i;
+  set_new_limit(mem_threshold);
 
   start_counters(time, resource);
-  std::stable_sort(v.begin(), v.end());
+  std::stable_sort(revv.begin(), revv.end());
   stop_counters(time, resource);
 
+  set_new_limit(~size_t(0));
   report_performance(__FILE__, "reverse", time, resource);
   clear_counters(time, resource);
 
-  for (int i = 0; i < max_size; ++i)
-    v[i] = i;
+  set_new_limit(mem_threshold);
 
   start_counters(time, resource);
-  std::stable_sort(v.begin(), v.end());
+  std::stable_sort(fwdv.begin(), fwdv.end());
   stop_counters(time, resource);
 
+  set_new_limit(~size_t(0));
   report_performance(__FILE__, "forwards", time, resource);
   clear_counters(time, resource);
 
-  // a simple psuedo-random series which does not rely on rand() and friends
-  v[0] = 0;
+  start_counters(time, resource);
+  std::stable_sort(rndv.begin(), rndv.end());
+  stop_counters(time, resource);
+
+  set_new_limit(~size_t(0));
+  report_performance(__FILE__, "random", time, resource);
+}
+
+int main()
+{
+  using namespace __gnu_test;
+
+  // No memory constraint.
+  set_new_limit(~size_t(0));
+
+  std::vector<int> revv(max_size);
+  for (int i = 0; i < max_size; ++i)
+    revv[i] = -i;
+
+  std::vector<int> fwdv(max_size);
+  for (int i = 0; i < max_size; ++i)
+    fwdv[i] = i;
+
+  // a simple pseudo-random series which does not rely on rand() and friends
+  std::vector<int> rndv(max_size);
+  rndv[0] = 0;
   for (int i = 1; i < max_size; ++i)
-    v[i] = (v[i-1] + 110211473) * 745988807;
+    rndv[i] = (rndv[i-1] + 110211473) * 745988807;
+
+  time_counter time;
+  resource_counter resource;
 
   start_counters(time, resource);
-  std::stable_sort(v.begin(), v.end());
+  bench(~size_t(0), revv, fwdv, rndv);
   stop_counters(time, resource);
 
-  report_performance(__FILE__, "random", time, resource);
+  report_performance(__FILE__, "bench 1 / 1 memory", time, resource);
+  clear_counters(time, resource);
+
+  start_counters(time, resource);
+  // Limit to fourth the expected size of the sorted array.
+  bench(max_size * sizeof(int) / 4, revv, fwdv, rndv);
+  stop_counters(time, resource);
+
+  report_performance(__FILE__, "bench 1 / 4 memory", time, resource);
+  clear_counters(time, resource);
+
+  start_counters(time, resource);
+  // Limit to 1/64 of range size.
+  bench(max_size * sizeof(int) / 64, revv, fwdv, rndv);
+  stop_counters(time, resource);
+
+  report_performance(__FILE__, "bench 1 /64 memory", time, resource);
+  clear_counters(time, resource);
+
+  revv.resize(small_size);
+  fwdv.resize(small_size);
+  rndv.resize(small_size);
+
+  start_counters(time, resource);
+  // Forbid any allocation.
+  bench(0, revv, fwdv, rndv);
+  stop_counters(time, resource);
 
+  report_performance(__FILE__, "bench 0 / 1 memory", time, resource);
   return 0;
 }

Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)

Reply via email to