Em ter., 13 de jul. de 2021 às 09:44, Ranier Vilela <ranier...@gmail.com>
escreveu:

> Em ter., 13 de jul. de 2021 às 09:24, David Rowley <dgrowle...@gmail.com>
> escreveu:
>
>> On Wed, 14 Jul 2021 at 00:06, Ranier Vilela <ranier...@gmail.com> wrote:
>> >
>> > Em ter., 13 de jul. de 2021 às 04:19, Ronan Dunklau <
>> ronan.dunk...@aiven.io> escreveu:
>> >> I would be
>> >> surprised the check adds that much to the whole execution though.
>> >
>> > I think this branch is a misprediction.
>>
>> It could be.  I wondered that myself when I saw Ronan's results were
>> better than mine for 2,4 and 7.  However, I think Ronan had quite a
>> bit of noise in his results as there's no reason for the speedup in
>> tests 2,4 and 7.
>
>
>> > In most cases is it not datumSort?
>>
>> who knows.  Maybe someone's workload always requires the datum sort.
>>
>> > That's why I would like to use unlikely.
>>
>> We really only use unlikely() in cases where we want to move code out
>> of line to a cold area because it's really never executed under normal
>> circumstances. We tend to do that for ERROR cases as we don't ever
>> really want to optimise for errors. We also sometimes do it when some
>> function has a branch to initialise something during the first call.
>> The case in question here does not fit for either of those two cases.
>>
> Hum, I understand the usage cases now.
> Thanks for the hint.
>
>
>>
>> > IMO all the tests should all be to verify past behavior first.
>>
>> I'm not quire sure what you mean there.
>>
> I'm saying we could help the branch by keeping the same testing logic as
> before and not reversing it.
> Attached is a version to demonstrate this, I don't pretend to be v7.
>
> I couldn't find a good phrase to the contrary:
> "are we *not* using the single value optimization ?"
>
> I don't have time to take the tests right now.
>
Finally I had time to benchmark (David's benchsort.sh)

ubuntu 64 bits (20.04) 8gb ram SSD 256GB.
Table with the best results of each.


         HEAD            v6            v7             v7b         v6 vs
master               v7 vs v6            v7b vs v6
Test1 288,149636 449,018541 469,757169 550,48505 155,83% 104,62% 122,60%
Test2 94,766955 95,451406 94,556249 94,718982 100,72% 99,06% 99,23%
Test3 190,521319 260,279802 259,597067 278,115296 136,61% 99,74% 106,85%
Test4 78,779344 78,253455 78,114068 77,941482 99,33% 99,82% 99,60%
Test5 131,362614 142,662223 136,436347 149,639041 108,60% 95,64% 104,89%
Test6 112,884298 124,181671 115,528328 127,58497 110,01% 93,03% 102,74%
Test7 69,308587 68,643067 66,10195 69,087544 99,04% 96,30% 100,65%
Test8 243,674171 364,681142 371,928453 419,259703 149,66% 101,99% 114,97%

I have no idea why v7 failed with test6?
v6 slowdown with test4 and test7.
v7b slowdown with test2 and test4, in relation with v7.

If field struct datumSort is not absolutely necessary, I think that v7 will
be better.
Attached the patchs and file results.
Benchmarks datumSort:

1) head

a)
Test1
tps = 286.571700 (without initial connection time)
tps = 285.650960 (without initial connection time)
tps = 287.069988 (without initial connection time)
Test2
tps = 94.565228 (without initial connection time)
tps = 94.690370 (without initial connection time)
tps = 94.741829 (without initial connection time)
Test3
tps = 190.581533 (without initial connection time)
tps = 190.405579 (without initial connection time)
tps = 190.222865 (without initial connection time)
Test4
tps = 78.001599 (without initial connection time)
tps = 78.163404 (without initial connection time)
tps = 78.325974 (without initial connection time)
Test5
tps = 130.276589 (without initial connection time)
tps = 128.854218 (without initial connection time)
tps = 129.062907 (without initial connection time)
Test6
tps = 112.629915 (without initial connection time)
tps = 113.179220 (without initial connection time)
tps = 112.395572 (without initial connection time)
Test7
tps = 69.109489 (without initial connection time)
tps = 68.973755 (without initial connection time)
tps = 69.204028 (without initial connection time)
Test8
tps = 242.364808 (without initial connection time)
tps = 242.448441 (without initial connection time)
tps = 243.191950 (without initial connection time)


b)
Test1
tps = 288.149636 (without initial connection time)
tps = 287.761581 (without initial connection time)
tps = 286.865262 (without initial connection time)
Test2
tps = 94.766955 (without initial connection time)
tps = 94.361680 (without initial connection time)
tps = 94.636953 (without initial connection time)
Test3
tps = 190.394402 (without initial connection time)
tps = 190.521319 (without initial connection time)
tps = 190.446202 (without initial connection time)
Test4
tps = 78.096603 (without initial connection time)
tps = 78.576600 (without initial connection time)
tps = 78.779344 (without initial connection time)
Test5
tps = 131.131458 (without initial connection time)
tps = 131.344875 (without initial connection time)
tps = 131.362614 (without initial connection time)
Test6
tps = 112.884298 (without initial connection time)
tps = 112.268374 (without initial connection time)
tps = 112.257387 (without initial connection time)
Test7
tps = 68.873394 (without initial connection time)
tps = 68.960796 (without initial connection time)
tps = 69.308587 (without initial connection time)
Test8
tps = 243.005346 (without initial connection time)
tps = 242.541572 (without initial connection time)
tps = 243.674171 (without initial connection time)


2) v6 Ronan

a)
Test1
tps = 448.694509 (without initial connection time)
tps = 450.955287 (without initial connection time)
tps = 455.387480 (without initial connection time)
Test2
tps = 95.451406 (without initial connection time)
tps = 92.435683 (without initial connection time)
tps = 92.848692 (without initial connection time)
Test3
tps = 251.961882 (without initial connection time)
tps = 256.355680 (without initial connection time)
tps = 257.377248 (without initial connection time)
Test4
tps = 77.674259 (without initial connection time)
tps = 75.916193 (without initial connection time)
tps = 75.825782 (without initial connection time)
Test5
tps = 141.437637 (without initial connection time)
tps = 138.757166 (without initial connection time)
tps = 137.212253 (without initial connection time)
Test6
tps = 118.803730 (without initial connection time)
tps = 118.716715 (without initial connection time)
tps = 122.598958 (without initial connection time)
Test7
tps = 67.565001 (without initial connection time)
tps = 68.623917 (without initial connection time)
tps = 68.643067 (without initial connection time)
Test8
tps = 360.928838 (without initial connection time)
tps = 359.803634 (without initial connection time)
tps = 359.263230 (without initial connection time)


b)
Test1
tps = 446.230255 (without initial connection time)
tps = 448.362860 (without initial connection time)
tps = 449.018541 (without initial connection time)
Test2
tps = 93.237736 (without initial connection time)
tps = 93.002567 (without initial connection time)
tps = 94.246374 (without initial connection time)
Test3
tps = 260.279802 (without initial connection time)
tps = 257.502369 (without initial connection time)
tps = 257.472013 (without initial connection time)
Test4
tps = 78.253455 (without initial connection time)
tps = 77.615251 (without initial connection time)
tps = 77.059059 (without initial connection time)
Test5
tps = 142.931388 (without initial connection time)
tps = 142.662223 (without initial connection time)
tps = 142.591687 (without initial connection time)
Test6
tps = 123.884998 (without initial connection time)
tps = 124.181671 (without initial connection time)
tps = 123.813298 (without initial connection time)
Test7
tps = 67.932645 (without initial connection time)
tps = 68.517033 (without initial connection time)
tps = 68.358240 (without initial connection time)
Test8
tps = 362.861868 (without initial connection time)
tps = 364.303790 (without initial connection time)
tps = 364.681142 (without initial connection time)


3) v7 Ranier

a)
Test1
tps = 462.386836 (without initial connection time)
tps = 460.693244 (without initial connection time)
tps = 450.156405 (without initial connection time)
Test2
tps = 94.116273 (without initial connection time)
tps = 93.091603 (without initial connection time)
tps = 94.223999 (without initial connection time)
Test3
tps = 258.833053 (without initial connection time)
tps = 258.709934 (without initial connection time)
tps = 259.417303 (without initial connection time)
Test4
tps = 77.191825 (without initial connection time)
tps = 77.193516 (without initial connection time)
tps = 77.274461 (without initial connection time)
Test5
tps = 135.309499 (without initial connection time)
tps = 135.687557 (without initial connection time)
tps = 136.339779 (without initial connection time)
Test6
tps = 114.572720 (without initial connection time)
tps = 114.966016 (without initial connection time)
tps = 113.882849 (without initial connection time)
Test7
tps = 65.772566 (without initial connection time)
tps = 65.578058 (without initial connection time)
tps = 65.192073 (without initial connection time)
Test8
tps = 365.611005 (without initial connection time)
tps = 370.881668 (without initial connection time)
tps = 366.280728 (without initial connection time)

b)
Test1
tps = 469.757169 (without initial connection time)
tps = 462.451933 (without initial connection time)
tps = 463.899274 (without initial connection time)
Test2
tps = 94.452783 (without initial connection time)
tps = 93.936865 (without initial connection time)
tps = 94.556249 (without initial connection time)
Test3
tps = 259.352856 (without initial connection time)
tps = 259.597067 (without initial connection time)
tps = 259.577832 (without initial connection time)
Test4
tps = 77.942899 (without initial connection time)
tps = 77.916822 (without initial connection time)
tps = 78.114068 (without initial connection time)
Test5
tps = 136.326734 (without initial connection time)
tps = 136.436347 (without initial connection time)
tps = 136.255118 (without initial connection time)
Test6
tps = 115.141428 (without initial connection time)
tps = 115.162296 (without initial connection time)
tps = 115.528328 (without initial connection time)
Test7
tps = 66.101950 (without initial connection time)
tps = 66.056330 (without initial connection time)
tps = 65.835264 (without initial connection time)
Test8
tps = 371.207387 (without initial connection time)
tps = 369.605322 (without initial connection time)
tps = 371.928453 (without initial connection time)


4) v7b Ranier

a)
Test1
tps = 549.913612 (without initial connection time)
tps = 549.519411 (without initial connection time)
tps = 550.071784 (without initial connection time)
Test2
tps = 94.718982 (without initial connection time)
tps = 94.711507 (without initial connection time)
tps = 94.057730 (without initial connection time)
Test3
tps = 278.115296 (without initial connection time)
tps = 277.356331 (without initial connection time)
tps = 275.831997 (without initial connection time)
Test4
tps = 77.882305 (without initial connection time)
tps = 77.605122 (without initial connection time)
tps = 77.750030 (without initial connection time)
Test5
tps = 149.398715 (without initial connection time)
tps = 149.509539 (without initial connection time)
tps = 149.623641 (without initial connection time)
Test6
tps = 127.584970 (without initial connection time)
tps = 126.837393 (without initial connection time)
tps = 126.852430 (without initial connection time)
Test7
tps = 69.087544 (without initial connection time)
tps = 68.850765 (without initial connection time)
tps = 68.919791 (without initial connection time)
Test8
tps = 417.642265 (without initial connection time)
tps = 417.989324 (without initial connection time)
tps = 419.259703 (without initial connection time)

b)
Test1
tps = 550.485050 (without initial connection time)
tps = 548.557763 (without initial connection time)
tps = 548.962527 (without initial connection time)
Test2
tps = 93.842972 (without initial connection time)
tps = 93.711892 (without initial connection time)
tps = 94.248700 (without initial connection time)
Test3
tps = 276.055432 (without initial connection time)
tps = 276.074098 (without initial connection time)
tps = 275.530722 (without initial connection time)
Test4
tps = 77.873266 (without initial connection time)
tps = 77.850013 (without initial connection time)
tps = 77.941482 (without initial connection time)
Test5
tps = 149.639041 (without initial connection time)
tps = 149.372680 (without initial connection time)
tps = 149.188990 (without initial connection time)
Test6
tps = 126.791277 (without initial connection time)
tps = 126.652518 (without initial connection time)
tps = 126.652341 (without initial connection time)
Test7
tps = 68.944679 (without initial connection time)
tps = 68.968537 (without initial connection time)
tps = 68.760827 (without initial connection time)
Test8
tps = 418.834847 (without initial connection time)
tps = 417.495899 (without initial connection time)
tps = 417.558709 (without initial connection time)

diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index b99027e0d7..e1adecafaa 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -29,6 +29,10 @@
  *		which saves the results in a temporary file or memory. After the
  *		initial call, returns a tuple from the file with each call.
  *
+ *		The tuplesort can either occur on the whole tuple (this is the nominal
+ *		case) or, when the input / output tuple consists of only one attribute,
+ *		we switch to the tuplesort_*_datum API, optimized for that specific case.
+ *
  *		Conditions:
  *		  -- none.
  *
@@ -86,7 +90,13 @@ ExecSort(PlanState *pstate)
 		outerNode = outerPlanState(node);
 		tupDesc = ExecGetResultType(outerNode);
 
-		tuplesortstate = tuplesort_begin_heap(tupDesc,
+		/*
+		 * Switch to the tuplesort_*_datum interface when we have only one
+		 * key, as it is much more efficient especially when the type is
+		 * pass-by-value.
+		 */
+		if (tupDesc->natts != 1)
+			tuplesortstate = tuplesort_begin_heap(tupDesc,
 											  plannode->numCols,
 											  plannode->sortColIdx,
 											  plannode->sortOperators,
@@ -95,23 +105,49 @@ ExecSort(PlanState *pstate)
 											  work_mem,
 											  NULL,
 											  node->randomAccess);
+		else
+		{
+			node->tupleSort = false;
+			tuplesortstate = tuplesort_begin_datum(TupleDescAttr(tupDesc, 0)->atttypid,
+												   plannode->sortOperators[0],
+												   plannode->collations[0],
+												   plannode->nullsFirst[0],
+												   work_mem,
+												   NULL,
+												   node->randomAccess);
+		}
+
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
 
 		/*
-		 * Scan the subplan and feed all the tuples to tuplesort.
++		 * Scan the subplan and feed all the tuples to tuplesort, using either
++		 * the _putdatum or _puttupleslot API as appropriate.
 		 */
 
-		for (;;)
-		{
-			slot = ExecProcNode(outerNode);
-
-			if (TupIsNull(slot))
-				break;
-
-			tuplesort_puttupleslot(tuplesortstate, slot);
-		}
+		if (node->tupleSort)
+			for (;;)
+			{
+				slot = ExecProcNode(outerNode);
+
+				if (TupIsNull(slot))
+					break;
+
+				tuplesort_puttupleslot(tuplesortstate, slot);
+			}
+		else
+			for (;;)
+			{
+				slot = ExecProcNode(outerNode);
+
+				if (TupIsNull(slot))
+					break;
+				slot_getsomeattrs(slot, 1);
+				tuplesort_putdatum(tuplesortstate,
+								   slot->tts_values[0],
+								   slot->tts_isnull[0]);
+			}
 
 		/*
 		 * Complete the sort.
@@ -150,9 +186,18 @@ ExecSort(PlanState *pstate)
 	 * next fetch from the tuplesort.
 	 */
 	slot = node->ss.ps.ps_ResultTupleSlot;
-	(void) tuplesort_gettupleslot(tuplesortstate,
+	if (node->tupleSort)
+		(void) tuplesort_gettupleslot(tuplesortstate,
 								  ScanDirectionIsForward(dir),
 								  false, slot, NULL);
+	else
+	{
+		ExecClearTuple(slot);
+		if (tuplesort_getdatum(tuplesortstate, ScanDirectionIsForward(dir),
+							   &(slot->tts_values[0]), &(slot->tts_isnull[0]), NULL))
+			ExecStoreVirtualTuple(slot);
+	}
+
 	return slot;
 }
 
@@ -191,6 +236,7 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
 	sortstate->bounded = false;
 	sortstate->sort_Done = false;
 	sortstate->tuplesortstate = NULL;
+	sortstate->tupleSort = true;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0ec5509e7e..4c41ab80b1 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2151,6 +2151,7 @@ typedef struct SortState
 	int64		bound_Done;		/* value of bound we did the sort with */
 	void	   *tuplesortstate; /* private state of tuplesort.c */
 	bool		am_worker;		/* are we a worker? */
+	bool		tupleSort;  /* are we *not* using the single value optimization ? */	
 	SharedSortInfo *shared_info;	/* one entry per worker */
 } SortState;
 
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index b99027e0d7..7be852273e 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -29,6 +29,10 @@
  *		which saves the results in a temporary file or memory. After the
  *		initial call, returns a tuple from the file with each call.
  *
+ *		The tuplesort can either occur on the whole tuple (this is the nominal
+ *		case) or, when the input / output tuple consists of only one attribute,
+ *		we switch to the tuplesort_*_datum API, optimized for that specific case.
+ *
  *		Conditions:
  *		  -- none.
  *
@@ -44,6 +48,7 @@ ExecSort(PlanState *pstate)
 	ScanDirection dir;
 	Tuplesortstate *tuplesortstate;
 	TupleTableSlot *slot;
+	bool tupleSort = true;
 
 	CHECK_FOR_INTERRUPTS();
 
@@ -86,7 +91,13 @@ ExecSort(PlanState *pstate)
 		outerNode = outerPlanState(node);
 		tupDesc = ExecGetResultType(outerNode);
 
-		tuplesortstate = tuplesort_begin_heap(tupDesc,
+		/*
+		 * Switch to the tuplesort_*_datum interface when we have only one
+		 * key, as it is much more efficient especially when the type is
+		 * pass-by-value.
+		 */
+		if (tupDesc->natts != 1)
+			tuplesortstate = tuplesort_begin_heap(tupDesc,
 											  plannode->numCols,
 											  plannode->sortColIdx,
 											  plannode->sortOperators,
@@ -95,23 +106,49 @@ ExecSort(PlanState *pstate)
 											  work_mem,
 											  NULL,
 											  node->randomAccess);
+		else
+		{
+			tupleSort = false;
+			tuplesortstate = tuplesort_begin_datum(TupleDescAttr(tupDesc, 0)->atttypid,
+												   plannode->sortOperators[0],
+												   plannode->collations[0],
+												   plannode->nullsFirst[0],
+												   work_mem,
+												   NULL,
+												   node->randomAccess);
+		}
+
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
 
 		/*
-		 * Scan the subplan and feed all the tuples to tuplesort.
++		 * Scan the subplan and feed all the tuples to tuplesort, using either
++		 * the _putdatum or _puttupleslot API as appropriate.
 		 */
 
-		for (;;)
-		{
-			slot = ExecProcNode(outerNode);
-
-			if (TupIsNull(slot))
-				break;
-
-			tuplesort_puttupleslot(tuplesortstate, slot);
-		}
+		if (tupleSort)
+			for (;;)
+			{
+				slot = ExecProcNode(outerNode);
+
+				if (TupIsNull(slot))
+					break;
+
+				tuplesort_puttupleslot(tuplesortstate, slot);
+			}
+		else
+			for (;;)
+			{
+				slot = ExecProcNode(outerNode);
+
+				if (TupIsNull(slot))
+					break;
+				slot_getsomeattrs(slot, 1);
+				tuplesort_putdatum(tuplesortstate,
+								   slot->tts_values[0],
+								   slot->tts_isnull[0]);
+			}
 
 		/*
 		 * Complete the sort.
@@ -150,9 +187,18 @@ ExecSort(PlanState *pstate)
 	 * next fetch from the tuplesort.
 	 */
 	slot = node->ss.ps.ps_ResultTupleSlot;
-	(void) tuplesort_gettupleslot(tuplesortstate,
+	if (tupleSort)
+		(void) tuplesort_gettupleslot(tuplesortstate,
 								  ScanDirectionIsForward(dir),
 								  false, slot, NULL);
+	else
+	{
+		ExecClearTuple(slot);
+		if (tuplesort_getdatum(tuplesortstate, ScanDirectionIsForward(dir),
+							   &(slot->tts_values[0]), &(slot->tts_isnull[0]), NULL))
+			ExecStoreVirtualTuple(slot);
+	}
+
 	return slot;
 }
 

Reply via email to