Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)

François Dumont Tue, 24 Jul 2018 03:22:54 -0700

Hi

    Any chance to review this patch ?


François


On 06/06/2018 18:39, François Dumont wrote:

Hi

I review and rework this proposal. I noticed that the same idea tolimit buffer size within inplace_merge also apply to stable_sort.

I also change the decision when buffer is too small to considerthe buffer size rather than going through successive cuts of theoriginal ranges. This way we won't cut the range more than necessary.Note that if you prefer this second part could be committed seperatly.


    PR libstdc++/83938
    * include/bits/stl_algo.h:
    (__stable_partition_adaptive): When buffer to small cut range using
    buffer size.
    (__inplace_merge): Take temporary buffer length from smallest range.

(__merge_adaptive): When buffer too small consider smallest rangefirst

    and cut based on buffer size.
    (__stable_sort_adaptive): When buffer too small cut based on buffer
    size.
    (__stable_sort): Limit temporary buffer length.
    * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function
    to reduce requested buffer length on allocation failure.

Tested under Linux x86_64.

Ok to commit ?

François


On 25/01/2018 23:37, chang jc wrote:

Hi:

1. The __len = (__len + 1) / 2; is as you suggested, need to modify as
__len = (__len == 1) ? 0 : ((__len + 1) / 2);

2. The coding gain has been shown  PR c++/83938; I re-post here




   21
   20
   19
   18
   17
   16


   0.471136
   0.625695
   0.767262
   0.907461
   1.04838
   1.19508


   0.340845
   0.48651
   0.639139
   0.770133
   0.898454
   1.04632

it means when Merge [0, 4325376, 16777216); A is a sorted integer with
4325376 & B with 12451840 elements, total with 16M integers

The proposed method has the speed up under given buffer size =, ex
2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means
given sizof(int)*64K bytes.

3. As your suggestion, _TmpBuf __buf should be rewrite.

4. It represents a fact that the intuitive idea to split from larger
part is wrong.

For example, if you have an input sorted array A & B, A has 8 integers
& B has 24 integers. Given tmp buffer whose capacity = 4 integers.

Current it tries to split from B, right?

Then we have:

A1 | A2  B1 | B2

B1 & B2 has 12 integers each, right?

Current algorithm selects pivot as 13th integer from B, right?

If the corresponding upper bound of A is 6th integer.

Then it split in

A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12

After rotate, we have two arrays to merge

[A1 = 5 | B1 = 12]  & [A2 = 3 | B2 = 12]

Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge.

Sadly, [A1 = 5 | B1 = 12] CANNOT.

So we do rotate again, split & merge the two split arrays from [A1 = 5
| B1 = 12] again.


But wait, if we always split from the smaller one instead of larger one.

After rotate, it promises two split arrays both containceiling[small/2].


Since tmp buffer also allocate by starting from sizeof(small) &
recursively downgrade by ceiling[small/2^(# of fail allocate)].

It means the allocated tmp buffer promises to be sufficient at the
level of (# of fail allocate).

Instead, you can see if split from large at level (# of fail allocate)
several split array still CANNOT use  tmp buffer to do buffered merge.


As you know, buffered merge is far speed then (split, rotate, and
merge two sub-arrays) (PR c++/83938 gives the profiling figures),

the way should provide speedup.


Thanks.










On 24/01/2018 18:23, François Dumont wrote:

Hi

It sounds like a very sensitive change to make but nothing worthfigures.

Do you have any bench showing the importance of the gain ?

     At least the memory usage optimization is obvious.

On 19/01/2018 10:43, chang jc wrote:

Current std::inplace_merge() suffers from performance issue byinefficient


logic under limited memory,

It leads to performance downgrade.

Please help to review it.

Index: include/bits/stl_algo.h
===================================================================
--- include/bits/stl_algo.h    (revision 256871)
+++ include/bits/stl_algo.h    (working copy)
@@ -2437,7 +2437,7 @@
         _BidirectionalIterator __second_cut = __middle;
         _Distance __len11 = 0;
         _Distance __len22 = 0;
-      if (__len1 > __len2)
+      if (__len1 < __len2)
           {
             __len11 = __len1 / 2;
             std::advance(__first_cut, __len11);
@@ -2539,9 +2539,15 @@
         const _DistanceType __len1 = std::distance(__first, __middle);
         const _DistanceType __len2 = std::distance(__middle, __last);

+

         typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
_TmpBuf;

-      _TmpBuf __buf(__first, __last);
-
+      _BidirectionalIterator __start, __end;
+      if (__len1 < __len2) {
+    __start = __first; __end = __middle;
+      } else {
+    __start = __middle; __end = __last;
+      }
+      _TmpBuf __buf(__start, ___end);

Note another optimization, we could introduce a _Temporary_buffer<>constructor

in order to write:

_TmpBuf __buf(std::min(__len1, __len2), __first);

So that std::distance do not need to be called again.

         if (__buf.begin() == 0)
       std::__merge_without_buffer
         (__first, __middle, __last, __len1, __len2, __comp);
Index: include/bits/stl_tempbuf.h
===================================================================
--- include/bits/stl_tempbuf.h    (revision 256871)
+++ include/bits/stl_tempbuf.h    (working copy)
@@ -95,7 +95,7 @@
                               std::nothrow));
         if (__tmp != 0)
           return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
-      __len /= 2;
+      __len = (__len + 1) / 2;

This part is problematic, if __len is 1 and allocation fails it willloop

forever.

It doesn't seem really necessary for your patch.


2018-01-20 4:05 GMT+08:00 Jason Merrill<ja...@redhat.com>:

This is a libstdc++ bug and patch, not the C++ front end.  So I'm
adding the libstdc++ list to CC.

On Fri, Jan 19, 2018 at 3:02 AM, chang jc<r97922...@gmail.com>  wrote:

Current std::inplace_merge() suffers from performance issue by

inefficient

logic under limited memory,

It leads to performance downgrade.

Please help to review it.

Index: include/bits/stl_algo.h
===================================================================
--- include/bits/stl_algo.h     (revision 256871)
+++ include/bits/stl_algo.h     (working copy)
@@ -2437,7 +2437,7 @@
           _BidirectionalIterator __second_cut = __middle;
           _Distance __len11 = 0;
           _Distance __len22 = 0;
-         if (__len1 > __len2)
+         if (__len1 < __len2)
             {
               __len11 = __len1 / 2;
               std::advance(__first_cut, __len11);
@@ -2539,9 +2539,15 @@
        const _DistanceType __len1 = std::distance(__first, __middle);
        const _DistanceType __len2 = std::distance(__middle, __last);

+
        typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>

_TmpBuf;

-      _TmpBuf __buf(__first, __last);
-
+      _BidirectionalIterator __start, __end;
+      if (__len1 < __len2) {
+       __start = __first; __end = __middle;
+      } else {
+       __start = __middle; __end = __last;
+      }
+      _TmpBuf __buf(__start, ___end);
        if (__buf.begin() == 0)
         std::__merge_without_buffer
           (__first, __middle, __last, __len1, __len2, __comp);
Index: include/bits/stl_tempbuf.h
===================================================================
--- include/bits/stl_tempbuf.h  (revision 256871)
+++ include/bits/stl_tempbuf.h  (working copy)
@@ -95,7 +95,7 @@
std::nothrow));
           if (__tmp != 0)
             return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
-         __len /= 2;
+         __len = (__len + 1) / 2;
         }
        return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
      }




Thanks

Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)

Reply via email to