Hi,

I've not yet been involved in postgresql's development process, so here's a
first. Please find attached a patch for improving the BT binary search
speeds, with accompanying performance data.

v1-0001 adapts _bt_compare and _bt_binsrch* to make use of static
properties of L&Y-style nbtree indexes to speed up finding an initial
position in the nbtee.

I alter the logic from _bt_compare to accept a 'start-comparing-at-column'
argument, and to also return which column the comparison result came from.
This allows us to keep track of which columns we've already checked for
equality, and when getting deeper into the index this allows us to skip
checking the already equal key columns.

Specifically, when binary-searching through a page, we keep track of which
column was checked for the high-key, and which for the low-key. The first
column of these that is not yet equal will be used for the next comparison.
Any columns before this column must be equal, as both the high- and the
low-keys in that page have already been validated as having an equal
prefix. The column number is then passed on through the insertion key,
allowing the optimization to be used in the climbing of the nbtree index.

v1-0001 will be especially performant in nbtree indexes with a relatively
low initial cardinality and high compare cost. My own performance testing
was done on a laptop (4c/8t), after first getting it warm with other
benchmarks & compilations, so the results are a bit unstable.

While testing this I also noticed that there were a lot of pipeline stalls
around the arguments and results of _bt_compare in the hot loops of
_bt_binsrch* where the new functionality would be used, so I've moved the
core logic of _bt_compare to a static inline function _bt_compare_inline,
which helped the code to go from a 2% TPS decrease for single-column
indexes to ~ 8% TPS increase for an insert-only workload, and for 3-column
text all-collated indexes a TPS increase of 20% on my system. This also
allowed me to keep the signature of _bt_compare the same as it was,
limiting the scope of the changes to only the named functions.

Finally, this could be a great start on prefix truncation for btree
indexes, though that is _not_ the goal of this patch. This patch skips, but
does not truncate, the common prefix.

Kind regards,

Matthias van de Meent

P.S. One more thing I noticed in benchmarking is that a significant part of
the costs of non-default collations is in looking up the collation twice in
the collation cache. My guess from flame graphs is that there could be a
large throughput increase for short strings if the collation lookup from
lc_collate_is_c() in varstr_cmp could be reused in the subsequent call to
pg_newlocale_from_collation()
System:
 Laptop (XPS 13 2018, i7-8550U 4c/8t, 16GB)

Compile configuration:
  ./configure CFLAGS= -fno-omit-frame-pointer -ggdb -O2

Postgres configuration:
  default postgresql config files.

All tests were run after getting the system hot with 15 minutes
  of other multi-thread benchmarking (~50% all-core load), and all
  tests had perf running attached to one of the backends for 900
  seconds of the 930s benchmark

PGBench command:
  pgbench postgres --random-seed=101 -P 10 -r -T 930 -c 10 -M prepared -s 100 
-f bench_multicolumn.sql

-----

\d+ pgbench_history

                                        Table "public.pgbench_history"
 Column |            Type             | Collation | Nullable | Default | 
Storage  | Stats target | Description
--------+-----------------------------+-----------+----------+---------+----------+--------------+-------------
 tid    | integer                     |           |          |         | plain  
  |              |
 bid    | integer                     |           |          |         | plain  
  |              |
 aid    | integer                     |           |          |         | plain  
  |              |
 delta  | integer                     |           |          |         | plain  
  |              |
 mtime  | timestamp without time zone |           |          |         | plain  
  |              |
 filler | character(22)               |           |          |         | 
extended |              |
Indexes:
    "pgbench_history_aid_idx" btree (aid)
    "pgbench_history_aid_idx1" btree (aid)
    "pgbench_history_aid_idx2" btree (aid)
    "pgbench_history_aid_idx3" btree (aid)
    "pgbench_history_aid_idx4" btree (aid)
Access method: heap

\d+ pgbench_history_2

                                       Table "public.pgbench_history_2"
 Column |            Type             | Collation | Nullable | Default | 
Storage  | Stats target | Description
--------+-----------------------------+-----------+----------+---------+----------+--------------+-------------
 tid    | integer                     |           |          |         | plain  
  |              |
 bid    | integer                     |           |          |         | plain  
  |              |
 aid    | integer                     |           |          |         | plain  
  |              |
 delta  | integer                     |           |          |         | plain  
  |              |
 mtime  | timestamp without time zone |           |          |         | plain  
  |              |
 filler | character(22)               |           |          |         | 
extended |              |
Indexes:
    "pgbench_history_2_bid_tid_aid_idx" btree ((bid::text) COLLATE 
"en_GB.utf8", (tid::text) COLLATE "en_GB.utf8", (aid::text) COLLATE 
"en_GB.utf8")
    "pgbench_history_2_bid_tid_aid_idx1" btree ((bid::text) COLLATE 
"en_GB.utf8", (tid::text) COLLATE "en_GB.utf8", (aid::text) COLLATE 
"en_GB.utf8")
    "pgbench_history_2_bid_tid_aid_idx2" btree ((bid::text) COLLATE 
"en_GB.utf8", (tid::text) COLLATE "en_GB.utf8", (aid::text) COLLATE 
"en_GB.utf8")
Access method: heap

-----

# cat bench.sql

\set aid random(1, 100000 * :scale)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)

\set aid2 random(1, 100000 * :scale)
\set bid2 random(1, 1 * :scale)
\set tid2 random(1, 10 * :scale)
\set delta2 random(-5000, 5000)

\set aid3 random(1, 100000 * :scale)
\set bid3 random(1, 1 * :scale)
\set tid3 random(1, 10 * :scale)
\set delta3 random(-5000, 5000)

\set aid4 random(1, 100000 * :scale)
\set bid4 random(1, 1 * :scale)
\set tid4 random(1, 10 * :scale)
\set delta4 random(-5000, 5000)

\set aidsel random(1, 100000 * :scale)
\set bidsel random(1, 1 * :scale)
\set tidsel random(1, 10 * :scale)

BEGIN;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime)
SELECT tid.val, bid.val, aid.val, delta.val, CURRENT_TIMESTAMP
    FROM unnest(ARRAY[ :tid, :tid2, :tid3, :tid4 ]::int[]) AS tid(val)
    CROSS JOIN unnest(ARRAY[ :aid, :aid2, :aid3, :aid4 ]::int[]) AS aid(val)
    CROSS JOIN unnest(ARRAY[ :bid, :bid2, :bid3, :bid4 ]::int[]) AS bid(val)
    CROSS JOIN unnest(ARRAY[ :delta, :delta2, :delta3, :delta4 ]::int[]) AS 
delta(val)
ORDER BY 1,2,3,4;
END;

-----

# cat bench_multicolumn.sql

\set aid random(1, 100000 * :scale)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)

\set aid2 random(1, 100000 * :scale)
\set bid2 random(1, 1 * :scale)
\set tid2 random(1, 10 * :scale)
\set delta2 random(-5000, 5000)

\set aid3 random(1, 100000 * :scale)
\set bid3 random(1, 1 * :scale)
\set tid3 random(1, 10 * :scale)
\set delta3 random(-5000, 5000)

\set aid4 random(1, 100000 * :scale)
\set bid4 random(1, 1 * :scale)
\set tid4 random(1, 10 * :scale)
\set delta4 random(-5000, 5000)

\set aidsel random(1, 100000 * :scale)
\set bidsel random(1, 1 * :scale)
\set tidsel random(1, 10 * :scale)

BEGIN;
INSERT INTO pgbench_history_2 (tid, bid, aid, delta, mtime)
SELECT tid.val, bid.val, aid.val, delta.val, CURRENT_TIMESTAMP
    FROM unnest(ARRAY[ :tid, :tid2, :tid3, :tid4 ]::int[]) AS tid(val)
    CROSS JOIN unnest(ARRAY[ :aid, :aid2, :aid3, :aid4 ]::int[]) AS aid(val)
    CROSS JOIN unnest(ARRAY[ :bid, :bid2, :bid3, :bid4 ]::int[]) AS bid(val)
    CROSS JOIN unnest(ARRAY[ :delta, :delta2, :delta3, :delta4 ]::int[]) AS 
delta(val)
ORDER BY 1,2,3,4;
END;

------

patched, 930 s, bench.sql
pgbench: setting random seed to 101
starting vacuum...end.
progress: 10.0 s, 341.0 tps, lat 29.089 ms stddev 30.810
progress: 20.0 s, 319.8 tps, lat 31.171 ms stddev 10.435
progress: 30.0 s, 247.4 tps, lat 38.007 ms stddev 18.556
progress: 40.0 s, 264.2 tps, lat 39.939 ms stddev 76.057
progress: 50.0 s, 227.6 tps, lat 43.968 ms stddev 28.792
progress: 60.0 s, 301.9 tps, lat 33.063 ms stddev 40.975
progress: 70.0 s, 229.6 tps, lat 43.487 ms stddev 108.210
progress: 80.0 s, 335.9 tps, lat 29.665 ms stddev 24.645
progress: 90.0 s, 172.6 tps, lat 57.101 ms stddev 143.803
progress: 100.0 s, 271.4 tps, lat 37.329 ms stddev 101.465
progress: 110.0 s, 223.8 tps, lat 44.595 ms stddev 54.963
progress: 120.0 s, 297.0 tps, lat 33.564 ms stddev 11.235
progress: 130.0 s, 190.0 tps, lat 52.578 ms stddev 84.232
progress: 140.0 s, 287.3 tps, lat 34.508 ms stddev 22.400
progress: 150.0 s, 244.5 tps, lat 41.034 ms stddev 59.983
progress: 160.0 s, 144.1 tps, lat 51.613 ms stddev 104.385
progress: 170.0 s, 253.4 tps, lat 49.536 ms stddev 182.602
progress: 180.0 s, 129.6 tps, lat 67.945 ms stddev 115.509
progress: 190.0 s, 219.4 tps, lat 50.908 ms stddev 171.248
progress: 200.0 s, 217.8 tps, lat 45.827 ms stddev 41.240
progress: 210.0 s, 275.8 tps, lat 36.161 ms stddev 13.430
progress: 220.0 s, 126.5 tps, lat 79.032 ms stddev 212.278
progress: 230.0 s, 239.1 tps, lat 41.366 ms stddev 32.036
progress: 240.0 s, 227.8 tps, lat 44.042 ms stddev 59.363
progress: 250.0 s, 152.7 tps, lat 62.596 ms stddev 118.120
progress: 260.0 s, 262.3 tps, lat 39.728 ms stddev 34.226
progress: 270.0 s, 208.1 tps, lat 47.917 ms stddev 76.345
progress: 280.0 s, 181.4 tps, lat 54.180 ms stddev 78.126
progress: 290.0 s, 222.6 tps, lat 45.635 ms stddev 87.984
progress: 300.0 s, 173.2 tps, lat 57.669 ms stddev 100.865
progress: 310.0 s, 244.9 tps, lat 40.697 ms stddev 11.285
progress: 320.0 s, 146.3 tps, lat 68.471 ms stddev 256.766
progress: 330.0 s, 203.7 tps, lat 47.753 ms stddev 41.044
progress: 340.0 s, 159.0 tps, lat 64.396 ms stddev 217.094
progress: 350.0 s, 136.2 tps, lat 73.353 ms stddev 227.920
progress: 360.0 s, 215.2 tps, lat 46.403 ms stddev 46.676
progress: 370.0 s, 144.9 tps, lat 68.682 ms stddev 122.000
progress: 380.0 s, 247.4 tps, lat 40.382 ms stddev 41.599
progress: 390.0 s, 200.9 tps, lat 49.744 ms stddev 98.925
progress: 400.0 s, 185.9 tps, lat 50.971 ms stddev 73.078
progress: 410.0 s, 210.1 tps, lat 49.879 ms stddev 104.948
progress: 420.0 s, 152.3 tps, lat 65.604 ms stddev 118.946
progress: 430.0 s, 255.7 tps, lat 39.041 ms stddev 11.417
progress: 440.0 s, 158.8 tps, lat 62.968 ms stddev 121.171
progress: 450.0 s, 168.5 tps, lat 57.488 ms stddev 84.297
progress: 460.0 s, 191.6 tps, lat 53.493 ms stddev 124.024
progress: 470.0 s, 116.5 tps, lat 79.623 ms stddev 277.348
progress: 480.0 s, 233.9 tps, lat 45.812 ms stddev 86.531
progress: 490.0 s, 138.4 tps, lat 72.205 ms stddev 142.392
progress: 500.0 s, 186.6 tps, lat 53.620 ms stddev 129.372
progress: 510.0 s, 200.3 tps, lat 49.802 ms stddev 46.425
progress: 520.0 s, 172.8 tps, lat 55.253 ms stddev 82.574
progress: 530.0 s, 153.3 tps, lat 67.972 ms stddev 168.677
progress: 540.0 s, 165.7 tps, lat 60.225 ms stddev 90.945
progress: 550.0 s, 261.4 tps, lat 38.208 ms stddev 14.819
progress: 560.0 s, 118.0 tps, lat 84.580 ms stddev 164.559
progress: 570.0 s, 218.8 tps, lat 45.462 ms stddev 49.899
progress: 580.0 s, 198.9 tps, lat 50.508 ms stddev 89.160
progress: 590.0 s, 125.4 tps, lat 79.599 ms stddev 149.763
progress: 600.0 s, 232.0 tps, lat 42.878 ms stddev 18.589
progress: 610.0 s, 142.0 tps, lat 70.308 ms stddev 149.538
progress: 620.0 s, 211.2 tps, lat 44.240 ms stddev 52.681
progress: 630.0 s, 181.0 tps, lat 58.747 ms stddev 193.505
progress: 640.0 s, 121.0 tps, lat 79.193 ms stddev 271.643
progress: 650.0 s, 168.6 tps, lat 61.646 ms stddev 168.953
progress: 660.0 s, 130.9 tps, lat 75.944 ms stddev 219.201
progress: 670.0 s, 196.8 tps, lat 50.921 ms stddev 55.397
progress: 680.0 s, 118.3 tps, lat 84.445 ms stddev 154.030
progress: 690.0 s, 238.6 tps, lat 41.867 ms stddev 32.984
progress: 700.0 s, 139.4 tps, lat 71.821 ms stddev 124.981
progress: 710.0 s, 202.1 tps, lat 48.222 ms stddev 24.230
progress: 720.0 s, 127.5 tps, lat 80.099 ms stddev 215.664
progress: 730.0 s, 172.4 tps, lat 56.097 ms stddev 54.861
progress: 740.0 s, 165.8 tps, lat 62.131 ms stddev 167.580
progress: 750.0 s, 133.3 tps, lat 73.846 ms stddev 215.322
progress: 760.0 s, 189.3 tps, lat 53.416 ms stddev 65.091
progress: 770.0 s, 115.6 tps, lat 86.455 ms stddev 168.831
progress: 780.0 s, 172.2 tps, lat 58.051 ms stddev 76.124
progress: 790.0 s, 146.3 tps, lat 68.235 ms stddev 137.139
progress: 800.0 s, 224.1 tps, lat 44.547 ms stddev 16.364
progress: 810.0 s, 126.2 tps, lat 79.209 ms stddev 229.841
progress: 820.0 s, 205.1 tps, lat 46.008 ms stddev 10.014
progress: 830.0 s, 127.7 tps, lat 81.160 ms stddev 175.965
progress: 840.0 s, 140.6 tps, lat 72.159 ms stddev 204.774
progress: 850.0 s, 183.8 tps, lat 54.245 ms stddev 42.740
progress: 860.0 s, 143.2 tps, lat 68.757 ms stddev 106.735
progress: 870.0 s, 154.1 tps, lat 65.654 ms stddev 158.173
progress: 880.0 s, 137.5 tps, lat 70.474 ms stddev 110.565
progress: 890.0 s, 184.7 tps, lat 55.848 ms stddev 87.179
progress: 900.0 s, 115.7 tps, lat 86.003 ms stddev 299.908
progress: 910.0 s, 194.3 tps, lat 51.392 ms stddev 13.637
progress: 920.0 s, 108.3 tps, lat 92.466 ms stddev 209.959
progress: 930.0 s, 168.1 tps, lat 47.101 ms stddev 36.065
transaction type: bench.sql
scaling factor: 100
query mode: prepared
number of clients: 10
number of threads: 1
duration: 930 s
number of transactions actually processed: 178177
latency average = 52.129 ms
latency stddev = 114.687 ms
tps = 191.518158 (including connections establishing)
tps = 191.518738 (excluding connections establishing)
statement latencies in milliseconds:
         0.003  \set aid random(1, 100000 * :scale)
         0.001  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.001  \set delta random(-5000, 5000)
         0.001  \set aid2 random(1, 100000 * :scale)
         0.001  \set bid2 random(1, 1 * :scale)
         0.001  \set tid2 random(1, 10 * :scale)
         0.001  \set delta2 random(-5000, 5000)
         0.001  \set aid3 random(1, 100000 * :scale)
         0.001  \set bid3 random(1, 1 * :scale)
         0.001  \set tid3 random(1, 10 * :scale)
         0.001  \set delta3 random(-5000, 5000)
         0.001  \set aid4 random(1, 100000 * :scale)
         0.001  \set bid4 random(1, 1 * :scale)
         0.001  \set tid4 random(1, 10 * :scale)
         0.001  \set delta4 random(-5000, 5000)
         0.001  \set aidsel random(1, 100000 * :scale)
         0.001  \set bidsel random(1, 1 * :scale)
         0.001  \set tidsel random(1, 10 * :scale)
         0.356  BEGIN;
        17.408  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime)
        34.344  END;

-----

master, 930 s, bench.sql
pgbench: setting random seed to 101
starting vacuum...end.
progress: 10.0 s, 312.5 tps, lat 31.717 ms stddev 14.195
progress: 20.0 s, 262.3 tps, lat 37.993 ms stddev 11.230
progress: 30.0 s, 272.7 tps, lat 36.714 ms stddev 23.265
progress: 40.0 s, 263.0 tps, lat 37.879 ms stddev 65.930
progress: 50.0 s, 224.8 tps, lat 44.470 ms stddev 86.014
progress: 60.0 s, 286.2 tps, lat 34.900 ms stddev 13.527
progress: 70.0 s, 165.2 tps, lat 60.233 ms stddev 171.420
progress: 80.0 s, 248.7 tps, lat 40.269 ms stddev 36.533
progress: 90.0 s, 190.4 tps, lat 52.448 ms stddev 41.025
progress: 100.0 s, 115.8 tps, lat 85.946 ms stddev 293.640
progress: 110.0 s, 286.4 tps, lat 34.936 ms stddev 14.323
progress: 120.0 s, 160.8 tps, lat 62.172 ms stddev 92.608
progress: 130.0 s, 215.8 tps, lat 46.146 ms stddev 131.800
progress: 140.0 s, 182.3 tps, lat 54.950 ms stddev 71.712
progress: 150.0 s, 256.6 tps, lat 38.815 ms stddev 11.481
progress: 160.0 s, 127.3 tps, lat 72.627 ms stddev 275.988
progress: 170.0 s, 191.9 tps, lat 55.868 ms stddev 185.480
progress: 180.0 s, 214.0 tps, lat 45.943 ms stddev 39.222
progress: 190.0 s, 160.0 tps, lat 63.257 ms stddev 130.757
progress: 200.0 s, 219.6 tps, lat 43.744 ms stddev 28.191
progress: 210.0 s, 181.0 tps, lat 57.440 ms stddev 99.531
progress: 220.0 s, 151.7 tps, lat 65.776 ms stddev 213.306
progress: 230.0 s, 159.7 tps, lat 62.497 ms stddev 136.030
progress: 240.0 s, 238.4 tps, lat 39.954 ms stddev 11.079
progress: 250.0 s, 132.1 tps, lat 79.017 ms stddev 143.541
progress: 260.0 s, 226.1 tps, lat 43.954 ms stddev 35.270
progress: 270.0 s, 204.8 tps, lat 49.023 ms stddev 69.802
progress: 280.0 s, 147.9 tps, lat 55.598 ms stddev 74.641
progress: 290.0 s, 206.6 tps, lat 56.805 ms stddev 187.694
progress: 300.0 s, 143.2 tps, lat 67.343 ms stddev 118.650
progress: 310.0 s, 214.7 tps, lat 43.773 ms stddev 35.842
progress: 320.0 s, 164.4 tps, lat 66.286 ms stddev 180.354
progress: 330.0 s, 218.8 tps, lat 45.652 ms stddev 12.087
progress: 340.0 s, 84.8 tps, lat 117.904 ms stddev 273.770
progress: 350.0 s, 184.5 tps, lat 54.002 ms stddev 138.251
progress: 360.0 s, 137.6 tps, lat 72.837 ms stddev 132.434
progress: 370.0 s, 226.8 tps, lat 44.034 ms stddev 13.589
progress: 380.0 s, 129.8 tps, lat 76.705 ms stddev 184.436
progress: 390.0 s, 211.2 tps, lat 47.189 ms stddev 11.393
progress: 400.0 s, 105.4 tps, lat 94.978 ms stddev 204.029
progress: 410.0 s, 192.0 tps, lat 51.845 ms stddev 79.272
progress: 420.0 s, 153.4 tps, lat 65.260 ms stddev 117.110
progress: 430.0 s, 221.9 tps, lat 45.092 ms stddev 12.958
progress: 440.0 s, 133.4 tps, lat 74.657 ms stddev 248.591
progress: 450.0 s, 197.3 tps, lat 50.618 ms stddev 37.545
progress: 460.0 s, 127.0 tps, lat 78.603 ms stddev 178.486
progress: 470.0 s, 175.3 tps, lat 57.002 ms stddev 162.394
progress: 480.0 s, 147.5 tps, lat 67.667 ms stddev 141.490
progress: 490.0 s, 228.5 tps, lat 43.423 ms stddev 11.420
progress: 500.0 s, 126.2 tps, lat 79.533 ms stddev 198.813
progress: 510.0 s, 188.5 tps, lat 51.785 ms stddev 60.370
progress: 520.0 s, 167.0 tps, lat 61.005 ms stddev 119.650
progress: 530.0 s, 133.8 tps, lat 74.877 ms stddev 230.466
progress: 540.0 s, 210.7 tps, lat 47.299 ms stddev 31.189
progress: 550.0 s, 114.6 tps, lat 80.473 ms stddev 167.048
progress: 560.0 s, 185.0 tps, lat 58.132 ms stddev 152.718
progress: 570.0 s, 160.7 tps, lat 62.098 ms stddev 77.087
progress: 580.0 s, 204.3 tps, lat 48.962 ms stddev 18.831
progress: 590.0 s, 101.2 tps, lat 98.628 ms stddev 290.047
progress: 600.0 s, 201.8 tps, lat 49.564 ms stddev 14.297
progress: 610.0 s, 114.3 tps, lat 87.298 ms stddev 188.860
progress: 620.0 s, 194.8 tps, lat 47.937 ms stddev 85.735
progress: 630.0 s, 165.2 tps, lat 64.231 ms stddev 140.820
progress: 640.0 s, 182.4 tps, lat 54.844 ms stddev 76.489
progress: 650.0 s, 69.6 tps, lat 143.706 ms stddev 303.705
progress: 660.0 s, 210.7 tps, lat 47.316 ms stddev 14.321
progress: 670.0 s, 108.9 tps, lat 91.912 ms stddev 222.168
progress: 680.0 s, 206.9 tps, lat 47.959 ms stddev 12.506
progress: 690.0 s, 135.4 tps, lat 74.120 ms stddev 246.657
progress: 700.0 s, 202.9 tps, lat 46.287 ms stddev 11.305
progress: 710.0 s, 119.2 tps, lat 88.569 ms stddev 186.168
progress: 720.0 s, 181.4 tps, lat 54.055 ms stddev 120.903
progress: 730.0 s, 175.1 tps, lat 58.069 ms stddev 71.869
progress: 740.0 s, 131.0 tps, lat 73.536 ms stddev 152.709
progress: 750.0 s, 216.3 tps, lat 47.885 ms stddev 82.389
progress: 760.0 s, 128.8 tps, lat 77.422 ms stddev 166.961
progress: 770.0 s, 228.6 tps, lat 43.780 ms stddev 19.923
progress: 780.0 s, 131.7 tps, lat 75.922 ms stddev 205.226
progress: 790.0 s, 192.8 tps, lat 50.344 ms stddev 52.140
progress: 800.0 s, 177.0 tps, lat 57.879 ms stddev 152.870
progress: 810.0 s, 153.3 tps, lat 65.338 ms stddev 239.970
progress: 820.0 s, 231.9 tps, lat 43.010 ms stddev 12.916
progress: 830.0 s, 124.9 tps, lat 79.828 ms stddev 169.748
progress: 840.0 s, 168.6 tps, lat 59.181 ms stddev 139.497
progress: 850.0 s, 181.9 tps, lat 54.637 ms stddev 63.153
progress: 860.0 s, 154.8 tps, lat 62.182 ms stddev 76.714
progress: 870.0 s, 130.3 tps, lat 79.718 ms stddev 248.145
progress: 880.0 s, 160.3 tps, lat 54.626 ms stddev 57.105
progress: 890.0 s, 112.6 tps, lat 99.791 ms stddev 287.384
progress: 900.0 s, 109.1 tps, lat 91.646 ms stddev 318.205
progress: 910.0 s, 135.5 tps, lat 73.656 ms stddev 138.860
progress: 920.0 s, 216.3 tps, lat 46.173 ms stddev 11.084
progress: 930.0 s, 92.5 tps, lat 108.177 ms stddev 219.292
transaction type: bench.sql
scaling factor: 100
query mode: prepared
number of clients: 10
number of threads: 1
duration: 930 s
number of transactions actually processed: 164419
latency average = 56.479 ms
latency stddev = 127.463 ms
tps = 176.784451 (including connections establishing)
tps = 176.785074 (excluding connections establishing)
statement latencies in milliseconds:
         0.003  \set aid random(1, 100000 * :scale)
         0.001  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.001  \set delta random(-5000, 5000)
         0.001  \set aid2 random(1, 100000 * :scale)
         0.001  \set bid2 random(1, 1 * :scale)
         0.001  \set tid2 random(1, 10 * :scale)
         0.001  \set delta2 random(-5000, 5000)
         0.001  \set aid3 random(1, 100000 * :scale)
         0.001  \set bid3 random(1, 1 * :scale)
         0.001  \set tid3 random(1, 10 * :scale)
         0.001  \set delta3 random(-5000, 5000)
         0.001  \set aid4 random(1, 100000 * :scale)
         0.001  \set bid4 random(1, 1 * :scale)
         0.001  \set tid4 random(1, 10 * :scale)
         0.001  \set delta4 random(-5000, 5000)
         0.001  \set aidsel random(1, 100000 * :scale)
         0.001  \set bidsel random(1, 1 * :scale)
         0.001  \set tidsel random(1, 10 * :scale)
         0.361  BEGIN;
        18.019  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime)
        38.078  END;


-----

master, 930s, bench_multicolumn.sql
pgbench: setting random seed to 101
starting vacuum...end.
progress: 10.0 s, 430.8 tps, lat 23.133 ms stddev 8.663
progress: 20.0 s, 396.7 tps, lat 25.159 ms stddev 9.469
progress: 30.0 s, 268.1 tps, lat 37.273 ms stddev 19.749
progress: 40.0 s, 213.6 tps, lat 46.806 ms stddev 119.588
progress: 50.0 s, 266.8 tps, lat 36.114 ms stddev 22.595
progress: 60.0 s, 190.5 tps, lat 53.182 ms stddev 112.206
progress: 70.0 s, 224.8 tps, lat 45.354 ms stddev 190.966
progress: 80.0 s, 267.4 tps, lat 37.350 ms stddev 18.173
progress: 90.0 s, 90.9 tps, lat 110.253 ms stddev 375.539
progress: 100.0 s, 236.9 tps, lat 42.115 ms stddev 213.869
progress: 110.0 s, 247.8 tps, lat 40.288 ms stddev 43.067
progress: 120.0 s, 163.6 tps, lat 61.188 ms stddev 169.101
progress: 130.0 s, 199.6 tps, lat 32.540 ms stddev 19.470
progress: 140.0 s, 115.6 tps, lat 107.340 ms stddev 352.879
progress: 150.0 s, 252.4 tps, lat 43.892 ms stddev 127.978
progress: 160.0 s, 150.2 tps, lat 49.491 ms stddev 21.708
progress: 170.0 s, 149.9 tps, lat 83.520 ms stddev 418.274
progress: 180.0 s, 225.5 tps, lat 39.943 ms stddev 20.760
progress: 190.0 s, 100.5 tps, lat 101.205 ms stddev 280.655
progress: 200.0 s, 207.0 tps, lat 52.350 ms stddev 131.781
progress: 210.0 s, 105.1 tps, lat 94.512 ms stddev 243.827
progress: 220.0 s, 219.0 tps, lat 43.176 ms stddev 21.080
progress: 230.0 s, 105.2 tps, lat 100.327 ms stddev 357.633
^[OS^[  ~^[progress: 240.0 s, 233.4 tps, lat 42.653 ms stddev 41.913
progress: 250.0 s, 123.6 tps, lat 81.513 ms stddev 219.729
progress: 260.0 s, 142.1 tps, lat 70.376 ms stddev 200.171
progress: 270.0 s, 159.8 tps, lat 62.572 ms stddev 142.659
progress: 280.0 s, 148.8 tps, lat 63.117 ms stddev 116.285
progress: 290.0 s, 175.1 tps, lat 60.219 ms stddev 182.777
progress: 300.0 s, 95.2 tps, lat 104.471 ms stddev 230.961
progress: 310.0 s, 213.3 tps, lat 47.123 ms stddev 26.251
progress: 320.0 s, 70.4 tps, lat 142.142 ms stddev 468.454
progress: 330.0 s, 192.0 tps, lat 51.311 ms stddev 38.811
progress: 340.0 s, 89.6 tps, lat 113.461 ms stddev 311.851
progress: 350.0 s, 186.9 tps, lat 52.970 ms stddev 107.756
progress: 360.0 s, 149.4 tps, lat 67.484 ms stddev 173.089
progress: 370.0 s, 101.3 tps, lat 98.464 ms stddev 363.240
progress: 380.0 s, 208.9 tps, lat 47.840 ms stddev 17.286
progress: 390.0 s, 129.3 tps, lat 77.385 ms stddev 251.631
progress: 400.0 s, 123.8 tps, lat 76.163 ms stddev 168.368
progress: 410.0 s, 167.0 tps, lat 62.836 ms stddev 121.803
progress: 420.0 s, 112.4 tps, lat 89.379 ms stddev 209.648
progress: 430.0 s, 165.4 tps, lat 52.842 ms stddev 21.411
progress: 440.0 s, 127.3 tps, lat 88.522 ms stddev 281.637
progress: 450.0 s, 94.0 tps, lat 98.969 ms stddev 214.125
progress: 460.0 s, 196.7 tps, lat 54.223 ms stddev 115.539
progress: 470.0 s, 108.8 tps, lat 91.794 ms stddev 319.129
progress: 480.0 s, 122.1 tps, lat 80.696 ms stddev 196.234
progress: 490.0 s, 170.5 tps, lat 59.599 ms stddev 122.153
progress: 500.0 s, 99.2 tps, lat 100.730 ms stddev 277.295
progress: 510.0 s, 148.5 tps, lat 63.083 ms stddev 129.138
progress: 520.0 s, 149.8 tps, lat 70.758 ms stddev 246.407
progress: 530.0 s, 88.5 tps, lat 113.259 ms stddev 332.573
progress: 540.0 s, 170.2 tps, lat 56.910 ms stddev 107.577
progress: 550.0 s, 135.0 tps, lat 76.030 ms stddev 210.734
progress: 560.0 s, 79.9 tps, lat 125.229 ms stddev 244.893
progress: 570.0 s, 181.1 tps, lat 52.076 ms stddev 23.068
progress: 580.0 s, 126.9 tps, lat 83.354 ms stddev 220.131
progress: 590.0 s, 92.7 tps, lat 107.679 ms stddev 237.332
progress: 600.0 s, 172.7 tps, lat 57.274 ms stddev 84.826
progress: 610.0 s, 145.2 tps, lat 69.548 ms stddev 149.931
progress: 620.0 s, 90.6 tps, lat 110.333 ms stddev 251.717
progress: 630.0 s, 147.6 tps, lat 60.933 ms stddev 92.366
progress: 640.0 s, 187.2 tps, lat 58.374 ms stddev 152.668
progress: 650.0 s, 187.9 tps, lat 53.002 ms stddev 110.319
progress: 660.0 s, 145.5 tps, lat 69.447 ms stddev 162.522
progress: 670.0 s, 131.2 tps, lat 76.227 ms stddev 200.570
progress: 680.0 s, 121.4 tps, lat 81.743 ms stddev 160.592
progress: 690.0 s, 155.2 tps, lat 52.632 ms stddev 70.630
progress: 700.0 s, 181.6 tps, lat 62.099 ms stddev 158.982
progress: 710.0 s, 156.5 tps, lat 65.605 ms stddev 150.245
progress: 720.0 s, 141.2 tps, lat 73.262 ms stddev 164.604
progress: 730.0 s, 120.1 tps, lat 82.681 ms stddev 140.803
progress: 740.0 s, 175.7 tps, lat 57.051 ms stddev 114.034
progress: 750.0 s, 147.6 tps, lat 67.595 ms stddev 147.916
progress: 760.0 s, 169.0 tps, lat 59.360 ms stddev 103.024
progress: 770.0 s, 156.3 tps, lat 56.774 ms stddev 64.596
progress: 780.0 s, 173.8 tps, lat 60.229 ms stddev 145.519
progress: 790.0 s, 171.7 tps, lat 60.409 ms stddev 139.820
progress: 800.0 s, 171.4 tps, lat 56.860 ms stddev 101.722
progress: 810.0 s, 136.4 tps, lat 72.898 ms stddev 193.728
progress: 820.0 s, 157.6 tps, lat 66.915 ms stddev 136.713
progress: 830.0 s, 197.3 tps, lat 50.840 ms stddev 111.424
progress: 840.0 s, 120.8 tps, lat 82.673 ms stddev 270.600
progress: 850.0 s, 151.8 tps, lat 65.834 ms stddev 132.691
progress: 860.0 s, 184.0 tps, lat 54.286 ms stddev 106.034
progress: 870.0 s, 132.6 tps, lat 75.403 ms stddev 175.190
progress: 880.0 s, 155.6 tps, lat 64.321 ms stddev 144.043
progress: 890.0 s, 138.1 tps, lat 72.183 ms stddev 149.395
progress: 900.0 s, 146.5 tps, lat 68.217 ms stddev 129.466
progress: 910.0 s, 155.5 tps, lat 59.979 ms stddev 69.778
progress: 920.0 s, 109.1 tps, lat 97.855 ms stddev 169.143
progress: 930.0 s, 115.1 tps, lat 86.888 ms stddev 157.241
transaction type: bench_multicolumn.sql
scaling factor: 100
query mode: prepared
number of clients: 10
number of threads: 1
duration: 930 s
number of transactions actually processed: 150577
latency average = 61.729 ms
latency stddev = 165.584 ms
tps = 161.902177 (including connections establishing)
tps = 161.902494 (excluding connections establishing)
statement latencies in milliseconds:
         0.002  \set aid random(1, 100000 * :scale)
         0.001  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.001  \set delta random(-5000, 5000)
         0.001  \set aid2 random(1, 100000 * :scale)
         0.001  \set bid2 random(1, 1 * :scale)
         0.001  \set tid2 random(1, 10 * :scale)
         0.001  \set delta2 random(-5000, 5000)
         0.000  \set aid3 random(1, 100000 * :scale)
         0.001  \set bid3 random(1, 1 * :scale)
         0.001  \set tid3 random(1, 10 * :scale)
         0.001  \set delta3 random(-5000, 5000)
         0.001  \set aid4 random(1, 100000 * :scale)
         0.001  \set bid4 random(1, 1 * :scale)
         0.001  \set tid4 random(1, 10 * :scale)
         0.001  \set delta4 random(-5000, 5000)
         0.001  \set aidsel random(1, 100000 * :scale)
         0.001  \set bidsel random(1, 1 * :scale)
         0.000  \set tidsel random(1, 10 * :scale)
         0.256  BEGIN;
        23.732  INSERT INTO pgbench_history_2 (tid, bid, aid, delta, mtime)
        37.728  END;

patched, 930 s, bench_multicolumn.sql
pgbench: setting random seed to 101
starting vacuum...end.
progress: 10.0 s, 245.9 tps, lat 40.426 ms stddev 13.338
progress: 20.0 s, 244.9 tps, lat 40.778 ms stddev 11.499
progress: 30.0 s, 220.2 tps, lat 44.843 ms stddev 13.181
progress: 40.0 s, 176.9 tps, lat 57.154 ms stddev 39.953
progress: 50.0 s, 213.6 tps, lat 46.762 ms stddev 13.346
progress: 60.0 s, 488.2 tps, lat 20.475 ms stddev 17.779
progress: 70.0 s, 470.3 tps, lat 21.239 ms stddev 24.805
progress: 80.0 s, 387.4 tps, lat 25.815 ms stddev 34.607
progress: 90.0 s, 400.3 tps, lat 24.957 ms stddev 23.723
progress: 100.0 s, 366.5 tps, lat 27.286 ms stddev 66.979
progress: 110.0 s, 367.3 tps, lat 27.176 ms stddev 28.625
progress: 120.0 s, 338.6 tps, lat 29.514 ms stddev 45.125
progress: 130.0 s, 304.8 tps, lat 32.780 ms stddev 61.692
progress: 140.0 s, 269.0 tps, lat 37.148 ms stddev 73.160
progress: 150.0 s, 263.4 tps, lat 37.251 ms stddev 79.585
progress: 160.0 s, 299.4 tps, lat 32.048 ms stddev 47.968
progress: 170.0 s, 261.5 tps, lat 40.426 ms stddev 84.604
progress: 180.0 s, 223.9 tps, lat 44.454 ms stddev 100.991
progress: 190.0 s, 169.4 tps, lat 59.277 ms stddev 237.757
progress: 200.0 s, 242.9 tps, lat 41.131 ms stddev 75.598
progress: 210.0 s, 297.7 tps, lat 33.600 ms stddev 61.503
progress: 220.0 s, 265.3 tps, lat 37.649 ms stddev 100.443
progress: 230.0 s, 263.1 tps, lat 37.962 ms stddev 77.040
progress: 240.0 s, 223.7 tps, lat 44.671 ms stddev 96.937
progress: 250.0 s, 208.9 tps, lat 47.850 ms stddev 163.998
progress: 260.0 s, 153.5 tps, lat 54.204 ms stddev 125.948
progress: 270.0 s, 225.4 tps, lat 49.184 ms stddev 139.220
progress: 280.0 s, 223.1 tps, lat 46.299 ms stddev 110.193
progress: 290.0 s, 249.5 tps, lat 40.970 ms stddev 91.191
progress: 300.0 s, 251.4 tps, lat 39.775 ms stddev 80.429
progress: 310.0 s, 213.1 tps, lat 46.846 ms stddev 105.240
progress: 320.0 s, 197.4 tps, lat 50.748 ms stddev 111.180
progress: 330.0 s, 165.0 tps, lat 60.244 ms stddev 151.391
progress: 340.0 s, 195.7 tps, lat 51.197 ms stddev 109.990
progress: 350.0 s, 199.4 tps, lat 48.199 ms stddev 83.645
progress: 360.0 s, 134.4 tps, lat 77.168 ms stddev 257.218
progress: 370.0 s, 109.1 tps, lat 91.737 ms stddev 239.082
progress: 380.0 s, 159.5 tps, lat 57.498 ms stddev 129.741
progress: 390.0 s, 193.0 tps, lat 54.895 ms stddev 116.293
progress: 400.0 s, 156.1 tps, lat 65.559 ms stddev 135.288
progress: 410.0 s, 210.0 tps, lat 47.507 ms stddev 93.252
progress: 420.0 s, 174.9 tps, lat 57.200 ms stddev 165.627
progress: 430.0 s, 165.3 tps, lat 60.328 ms stddev 135.303
progress: 440.0 s, 114.1 tps, lat 87.840 ms stddev 190.481
progress: 450.0 s, 138.3 tps, lat 59.793 ms stddev 102.914
progress: 460.0 s, 182.2 tps, lat 63.602 ms stddev 197.301
progress: 470.0 s, 138.0 tps, lat 65.635 ms stddev 200.784
progress: 480.0 s, 170.1 tps, lat 64.948 ms stddev 173.358
progress: 490.0 s, 169.6 tps, lat 58.762 ms stddev 139.054
progress: 500.0 s, 135.5 tps, lat 72.901 ms stddev 213.248
progress: 510.0 s, 148.8 tps, lat 68.151 ms stddev 155.461
progress: 520.0 s, 156.1 tps, lat 63.845 ms stddev 147.884
progress: 530.0 s, 173.6 tps, lat 47.442 ms stddev 64.936
progress: 540.0 s, 180.7 tps, lat 65.152 ms stddev 207.246
progress: 550.0 s, 179.4 tps, lat 55.658 ms stddev 105.423
progress: 560.0 s, 173.8 tps, lat 57.615 ms stddev 94.866
progress: 570.0 s, 179.6 tps, lat 55.548 ms stddev 130.517
progress: 580.0 s, 181.0 tps, lat 55.044 ms stddev 126.109
progress: 590.0 s, 161.5 tps, lat 62.217 ms stddev 136.538
progress: 600.0 s, 161.3 tps, lat 51.697 ms stddev 87.159
progress: 610.0 s, 167.6 tps, lat 64.527 ms stddev 199.725
progress: 620.0 s, 150.8 tps, lat 71.611 ms stddev 256.386
progress: 630.0 s, 150.7 tps, lat 66.379 ms stddev 178.892
progress: 640.0 s, 139.3 tps, lat 71.740 ms stddev 215.406
progress: 650.0 s, 162.0 tps, lat 61.436 ms stddev 214.835
progress: 660.0 s, 169.1 tps, lat 59.385 ms stddev 136.375
progress: 670.0 s, 174.4 tps, lat 57.328 ms stddev 142.646
progress: 680.0 s, 163.9 tps, lat 60.961 ms stddev 147.686
progress: 690.0 s, 168.2 tps, lat 59.408 ms stddev 113.007
progress: 700.0 s, 218.1 tps, lat 43.138 ms stddev 63.753
progress: 710.0 s, 165.2 tps, lat 61.004 ms stddev 124.893
progress: 720.0 s, 163.4 tps, lat 60.585 ms stddev 142.039
progress: 730.0 s, 172.2 tps, lat 61.205 ms stddev 158.038
progress: 740.0 s, 158.2 tps, lat 63.365 ms stddev 151.284
progress: 750.0 s, 154.9 tps, lat 64.682 ms stddev 150.947
progress: 760.0 s, 168.5 tps, lat 59.308 ms stddev 181.120
progress: 770.0 s, 145.6 tps, lat 68.708 ms stddev 203.380
progress: 780.0 s, 153.4 tps, lat 59.480 ms stddev 95.686
progress: 790.0 s, 157.3 tps, lat 68.311 ms stddev 176.298
progress: 800.0 s, 155.9 tps, lat 63.335 ms stddev 122.997
progress: 810.0 s, 136.9 tps, lat 74.575 ms stddev 186.505
progress: 820.0 s, 152.7 tps, lat 62.206 ms stddev 166.385
progress: 830.0 s, 143.3 tps, lat 73.374 ms stddev 152.323
progress: 840.0 s, 185.8 tps, lat 53.480 ms stddev 133.932
progress: 850.0 s, 154.1 tps, lat 61.599 ms stddev 192.194
progress: 860.0 s, 144.2 tps, lat 73.218 ms stddev 199.194
progress: 870.0 s, 159.3 tps, lat 60.034 ms stddev 133.204
progress: 880.0 s, 151.1 tps, lat 64.337 ms stddev 147.333
progress: 890.0 s, 140.8 tps, lat 68.930 ms stddev 172.148
progress: 900.0 s, 159.7 tps, lat 68.639 ms stddev 159.429
progress: 910.0 s, 149.5 tps, lat 66.574 ms stddev 169.599
progress: 920.0 s, 141.4 tps, lat 71.256 ms stddev 179.652
progress: 930.0 s, 111.0 tps, lat 89.830 ms stddev 177.565
transaction type: bench_multicolumn.sql
scaling factor: 100
query mode: prepared
number of clients: 10
number of threads: 1
duration: 930 s
number of transactions actually processed: 185220
latency average = 50.178 ms
latency stddev = 127.798 ms
tps = 199.157086 (including connections establishing)
tps = 199.157644 (excluding connections establishing)
statement latencies in milliseconds:
         0.002  \set aid random(1, 100000 * :scale)
         0.001  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.001  \set delta random(-5000, 5000)
         0.001  \set aid2 random(1, 100000 * :scale)
         0.001  \set bid2 random(1, 1 * :scale)
         0.001  \set tid2 random(1, 10 * :scale)
         0.001  \set delta2 random(-5000, 5000)
         0.001  \set aid3 random(1, 100000 * :scale)
         0.000  \set bid3 random(1, 1 * :scale)
         0.001  \set tid3 random(1, 10 * :scale)
         0.001  \set delta3 random(-5000, 5000)
         0.001  \set aid4 random(1, 100000 * :scale)
         0.001  \set bid4 random(1, 1 * :scale)
         0.001  \set tid4 random(1, 10 * :scale)
         0.001  \set delta4 random(-5000, 5000)
         0.001  \set aidsel random(1, 100000 * :scale)
         0.001  \set bidsel random(1, 1 * :scale)
         0.000  \set tidsel random(1, 10 * :scale)
         0.237  BEGIN;
        20.808  INSERT INTO pgbench_history_2 (tid, bid, aid, delta, mtime)
        29.121  END;
From 0fc050033563f5bef074401197d3b38425478857 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekew...@gmail.com>
Date: Wed, 2 Sep 2020 11:51:12 +0200
Subject: [PATCH] Update _bt_binsrch* to account for prefix column equality

This improves the search speed for multi-column search
 indexes, as the prefixes of hikey and lokey do not need
 to be compared when high key and low key both have N
 columns common with the searchkey.

Signed-off-by: Matthias van de Meent <boekew...@gmail.com>
---
 src/backend/access/nbtree/nbtinsert.c |   9 ++
 src/backend/access/nbtree/nbtsearch.c | 164 +++++++++++++++++++-------
 src/backend/access/nbtree/nbtutils.c  |   1 +
 src/include/access/nbtree.h           |  20 ++++
 4 files changed, 153 insertions(+), 41 deletions(-)

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index d36f7557c8..2bf6115040 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -208,6 +208,7 @@ search:
 			/* start over... */
 			if (stack)
 				_bt_freestack(stack);
+			itup_key->searchcolneq = 1;
 			goto search;
 		}
 
@@ -402,6 +403,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	bool		inposting = false;
 	bool		prevalldead = true;
 	int			curposti = 0;
+	int			searchcolneq = itup_key->searchcolneq;
 
 	/* Assume unique until we find a duplicate */
 	*is_unique = true;
@@ -559,6 +561,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 						if (nbuf != InvalidBuffer)
 							_bt_relbuf(rel, nbuf);
 						*is_unique = false;
+						/* Reset the search state to the state as we were called */
+						itup_key->searchcolneq = searchcolneq;
 						return InvalidTransactionId;
 					}
 
@@ -577,6 +581,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 						*speculativeToken = SnapshotDirty.speculativeToken;
 						/* Caller releases lock on buf immediately */
 						insertstate->bounds_valid = false;
+						/* Reset the search state to the state as we were called */
+						itup_key->searchcolneq = searchcolneq;
 						return xwait;
 					}
 
@@ -751,6 +757,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	if (nbuf != InvalidBuffer)
 		_bt_relbuf(rel, nbuf);
 
+	/* Reset the search state to the state as we were called */
+	itup_key->searchcolneq = searchcolneq;
+
 	return InvalidTransactionId;
 }
 
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 1e628a33d7..aefc8733fd 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -45,6 +45,7 @@ static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
 static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
+static inline BTCompareResult _bt_compare_inline(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum, int scanStartCol);
 
 
 /*
@@ -348,8 +349,10 @@ _bt_binsrch(Relation rel,
 	BTPageOpaque opaque;
 	OffsetNumber low,
 				high;
-	int32		result,
-				cmpval;
+	int32		cmpval;
+	int			higheqcol,
+				loweqcol;
+	BTCompareResult result;
 
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -388,20 +391,33 @@ _bt_binsrch(Relation rel,
 
 	cmpval = key->nextkey ? 0 : 1;	/* select comparison value */
 
+	higheqcol = key->searchcolneq;
+	loweqcol = key->searchcolneq;
+
 	while (high > low)
 	{
 		OffsetNumber mid = low + ((high - low) / 2);
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, key, page, mid);
+		result = _bt_compare_inline(rel, key, page, mid, Min(higheqcol, loweqcol));
 
-		if (result >= cmpval)
+		Assert(result.compareResult == _bt_compare(rel, key, page, mid));
+
+		if (result.compareResult >= cmpval)
+		{
 			low = mid + 1;
+			loweqcol = result.comparedColumn;
+		}
 		else
+		{
 			high = mid;
+			higheqcol = result.comparedColumn;
+		}
 	}
 
+	key->searchcolneq = Min(loweqcol, higheqcol);
+
 	/*
 	 * At this point we have high == low, but be careful: they could point
 	 * past the last slot on the page.
@@ -452,8 +468,10 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	OffsetNumber low,
 				high,
 				stricthigh;
-	int32		result,
-				cmpval;
+	int32		cmpval;
+	int			higheqcol,
+				loweqcol;
+	BTCompareResult result;
 
 	page = BufferGetPage(insertstate->buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -500,6 +518,8 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	stricthigh = high;			/* high initially strictly higher */
 
 	cmpval = 1;					/* !nextkey comparison value */
+	loweqcol = key->searchcolneq;
+	higheqcol = key->searchcolneq;
 
 	while (high > low)
 	{
@@ -507,14 +527,20 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, key, page, mid);
+		result = _bt_compare_inline(rel, key, page, mid, Min(loweqcol, higheqcol));
+		Assert(result.compareResult == _bt_compare(rel, key, page, mid));
 
-		if (result >= cmpval)
+		if (result.compareResult >= cmpval)
+		{
 			low = mid + 1;
+			loweqcol = result.comparedColumn;
+		}
 		else
 		{
 			high = mid;
-			if (result != 0)
+			higheqcol = result.comparedColumn;
+
+			if (result.compareResult != 0)
 				stricthigh = high;
 		}
 
@@ -525,10 +551,12 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 		 * posting list when postingoff is set.  This should happen
 		 * infrequently.
 		 */
-		if (unlikely(result == 0 && key->scantid != NULL))
+		if (unlikely(result.compareResult == 0 && key->scantid != NULL))
 			insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
 	}
 
+	key->searchcolneq = Min(loweqcol, higheqcol);
+
 	/*
 	 * On a leaf page, a binary search always returns the first key >= scan
 	 * key (at least in !nextkey case), which could be the last slot + 1. This
@@ -623,6 +651,36 @@ _bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
  *			 0 if scankey == tuple at offnum;
  *			>0 if scankey > tuple at offnum.
  *
+ * See _bt_compare_inline() for more specific information.
+ *----------
+ */
+int32
+_bt_compare(Relation rel,
+			BTScanInsert key,
+			Page page,
+			OffsetNumber offnum)
+{
+	BTCompareResult result = _bt_compare_inline(rel, key, page, offnum, 1);
+	return result.compareResult;
+}
+
+/*----------
+ *	_bt_compare_inline() -- Compare insertion-type scankey to tuple on a page
+ *
+ *	page/offnum: location of btree item to be compared to.
+ *	scanStartCol: first column of key to start comparing. 1-based
+ *
+ *		This routine returns:
+ *
+ *		BTCompareResult.compareResult:
+ *			<0 if scankey < tuple at offnum;
+ *			 0 if scankey == tuple at offnum;
+ *			>0 if scankey > tuple at offnum.
+ *		BTCompareResult.comparedColumn:
+ *			if all columns are equal, the number of indexed
+ *			columns in the scankey + 1, otherwise the 1-indexed column
+ *			number that was compared to get the non-0 result.
+ *
  * NULLs in the keys are treated as sortable values.  Therefore
  * "equality" does not necessarily mean that the item should be returned
  * to the caller as a matching key.  Similarly, an insertion scankey
@@ -640,11 +698,12 @@ _bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
  * key.  See backend/access/nbtree/README for details.
  *----------
  */
-int32
-_bt_compare(Relation rel,
-			BTScanInsert key,
-			Page page,
-			OffsetNumber offnum)
+inline static BTCompareResult
+_bt_compare_inline(Relation rel,
+				   BTScanInsert key,
+				   Page page,
+				   OffsetNumber offnum,
+				   int scanStartCol)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -653,18 +712,23 @@ _bt_compare(Relation rel,
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
-	int32		result;
+	BTCompareResult result;
 
 	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
 	Assert(key->heapkeyspace || key->scantid == NULL);
+	Assert(scanStartCol >= 1);
 
 	/*
 	 * Force result ">" if target item is first data item on an internal page
 	 * --- see NOTE above.
 	 */
 	if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
-		return 1;
+	{
+		result.compareResult = 1;
+		result.comparedColumn = scanStartCol;
+		return result;
+	}
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 	ntupatts = BTreeTupleGetNAtts(itup, rel);
@@ -684,8 +748,12 @@ _bt_compare(Relation rel,
 	ncmpkey = Min(ntupatts, key->keysz);
 	Assert(key->heapkeyspace || ncmpkey == key->keysz);
 	Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
-	scankey = key->scankeys;
-	for (int i = 1; i <= ncmpkey; i++)
+	Assert(scanStartCol <= ncmpkey + 1);
+
+	scankey = key->scankeys + (scanStartCol - 1);
+	result.comparedColumn = scanStartCol;
+
+	for (; result.comparedColumn <= ncmpkey; result.comparedColumn++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -695,18 +763,18 @@ _bt_compare(Relation rel,
 		if (scankey->sk_flags & SK_ISNULL)	/* key is NULL */
 		{
 			if (isNull)
-				result = 0;		/* NULL "=" NULL */
+				result.compareResult = 0;		/* NULL "=" NULL */
 			else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
-				result = -1;	/* NULL "<" NOT_NULL */
+				result.compareResult = -1;	/* NULL "<" NOT_NULL */
 			else
-				result = 1;		/* NULL ">" NOT_NULL */
+				result.compareResult = 1;		/* NULL ">" NOT_NULL */
 		}
 		else if (isNull)		/* key is NOT_NULL and item is NULL */
 		{
 			if (scankey->sk_flags & SK_BT_NULLS_FIRST)
-				result = 1;		/* NOT_NULL ">" NULL */
+				result.compareResult = 1;		/* NOT_NULL ">" NULL */
 			else
-				result = -1;	/* NOT_NULL "<" NULL */
+				result.compareResult = -1;	/* NOT_NULL "<" NULL */
 		}
 		else
 		{
@@ -718,17 +786,17 @@ _bt_compare(Relation rel,
 			 * to flip the sign of the comparison result.  (Unless it's a DESC
 			 * column, in which case we *don't* flip the sign.)
 			 */
-			result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
-													 scankey->sk_collation,
-													 datum,
-													 scankey->sk_argument));
+			result.compareResult = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+																   scankey->sk_collation,
+																   datum,
+																   scankey->sk_argument));
 
 			if (!(scankey->sk_flags & SK_BT_DESC))
-				INVERT_COMPARE_RESULT(result);
+				INVERT_COMPARE_RESULT(result.compareResult);
 		}
 
 		/* if the keys are unequal, return the difference */
-		if (result != 0)
+		if (result.compareResult != 0)
 			return result;
 
 		scankey++;
@@ -744,7 +812,10 @@ _bt_compare(Relation rel,
 	 * necessary.
 	 */
 	if (key->keysz > ntupatts)
-		return 1;
+	{
+		result.compareResult = 1;
+		return result;
+	}
 
 	/*
 	 * Use the heap TID attribute and scantid to try to break the tie.  The
@@ -789,10 +860,14 @@ _bt_compare(Relation rel,
 		 */
 		if (key->heapkeyspace && !key->pivotsearch &&
 			key->keysz == ntupatts && heapTid == NULL)
-			return 1;
+		{
+			result.compareResult = 1;
+			return result;
+		}
 
 		/* All provided scankey arguments found to be equal */
-		return 0;
+		result.compareResult = 0;
+		return result;
 	}
 
 	/*
@@ -801,7 +876,10 @@ _bt_compare(Relation rel,
 	 */
 	Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
 	if (heapTid == NULL)
-		return 1;
+	{
+		result.compareResult = 1;
+		return result;
+	}
 
 	/*
 	 * Scankey must be treated as equal to a posting list tuple if its scantid
@@ -810,18 +888,19 @@ _bt_compare(Relation rel,
 	 * with scantid.
 	 */
 	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	result = ItemPointerCompare(key->scantid, heapTid);
-	if (result <= 0 || !BTreeTupleIsPosting(itup))
+	result.compareResult = ItemPointerCompare(key->scantid, heapTid);
+	if (result.compareResult <= 0 || !BTreeTupleIsPosting(itup))
 		return result;
 	else
 	{
-		result = ItemPointerCompare(key->scantid,
-									BTreeTupleGetMaxHeapTID(itup));
-		if (result > 0)
-			return 1;
+		result.compareResult = ItemPointerCompare(key->scantid,
+												  BTreeTupleGetMaxHeapTID(itup));
+		if (result.compareResult > 0)
+			return result;
 	}
 
-	return 0;
+	result.compareResult = 0;
+	return result;
 }
 
 /*
@@ -1340,6 +1419,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
+	inskey.searchcolneq = 1;
 	inskey.scantid = NULL;
 	inskey.keysz = keysCount;
 
@@ -1375,6 +1455,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 
 	_bt_initialize_more_data(so, dir);
 
+	/* reset the column search position for the following statement */
+	inskey.searchcolneq = 1;
 	/* position to the precise item on the page */
 	offnum = _bt_binsrch(rel, &inskey, buf);
 
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 81589b9056..5c058ca9a8 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -122,6 +122,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key->anynullkeys = false;	/* initial assumption */
 	key->nextkey = false;
 	key->pivotsearch = false;
+	key->searchcolneq = 1;
 	key->keysz = Min(indnkeyatts, tupnatts);
 	key->scantid = key->heapkeyspace && itup ?
 		BTreeTupleGetHeapTID(itup) : NULL;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 65d9698b89..9aebff05c0 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -639,6 +639,12 @@ typedef BTStackData *BTStack;
  * using a scankey built from a leaf page's high key.  Most callers set this
  * to false.
  *
+ * scancolneq contains an integer that is the first column number of the
+ * current index scan location that has not yet proven to equal the search
+ * key's columns. This helps by not repeatedly having to compare columns
+ * that we know are equal.
+ * This is incremented by _bt_search and _bt_search_insert
+ *
  * scantid is the heap TID that is used as a final tiebreaker attribute.  It
  * is set to NULL when index scan doesn't need to find a position for a
  * specific physical tuple.  Must be set when inserting new tuples into
@@ -662,6 +668,7 @@ typedef struct BTScanInsertData
 	bool		nextkey;
 	bool		pivotsearch;
 	ItemPointer scantid;		/* tiebreaker for scankeys */
+	int			searchcolneq;	/* column that is not yet equal */
 	int			keysz;			/* Size of scankeys array */
 	ScanKeyData scankeys[INDEX_MAX_KEYS];	/* Must appear last */
 } BTScanInsertData;
@@ -945,6 +952,19 @@ typedef struct BTScanOpaqueData
 
 typedef BTScanOpaqueData *BTScanOpaque;
 
+/*
+ * When two tuples are compared, we get a comparison result for one of the
+ * columns. While the result itself is the most interesting, the column
+ * can be used to further optimize searching in the index.
+ *
+ * See also: _bt_compare() and _bt_compare_inline()
+ */
+typedef struct BTCompareResult
+{
+	int32		compareResult; /* the value of the comparison between the first non-equal columns */
+	int			comparedColumn; /* the column number of the first non-equal column */
+} BTCompareResult;
+
 /*
  * We use some private sk_flags bits in preprocessed scan keys.  We're allowed
  * to use bits 16-31 (see skey.h).  The uppermost bits are copied from the
-- 
2.20.1

Reply via email to