On Tue, May 19, 2020 at 09:15:40PM -0700, Jeff Davis wrote:
On Tue, 2020-05-19 at 19:53 +0200, Tomas Vondra wrote:

And if there a way to pre-allocate larger chunks? Presumably we could
assign the blocks to tape in larger chunks (e.g. 128kB, i.e. 16 x
8kB)
instead of just single block. I haven't seen anything like that in
tape.c, though ...

It turned out to be simple (at least a POC) so I threw together a
patch. I just added a 32-element array of block numbers to each tape.
When we need a new block, we retrieve a block number from that array;
or if it's empty, we fill it by calling ltsGetFreeBlock() 32 times.

I reproduced the problem on a smaller scale (330M groups, ~30GB of
memory on a 16GB box). Work_mem=64MB. The query is a simple distinct.

Unpatched master:
  Sort: 250s
  HashAgg: 310s
Patched master:
  Sort: 245s
  HashAgg: 262s

That's a nice improvement for such a simple patch. We can tweak the
number of blocks to preallocate, or do other things like double from a
small number up to a maximum. Also, a proper patch would probably
release the blocks back as free when the tape was rewound.

As long as the number of block numbers to preallocate is not too large,
I don't think we need to change the API. It seems fine for sort to do
the same thing, even though there's not any benefit.


I gave it a try on the machine with temp tablespace on SSD, and I can
confirm it improves performance. I've tried with different work_mem
values and I've also increased the number of pre-allocated blocks to 64
and 128 blocks, and the numbers look like this:

master

               sort      hash
   ----------------------------
      4MB       335      1331
    128MB       220      1208


patched (32)

               sort      hash
   ----------------------------
      4MB       344       685
    128MB       217       641


patched (64)

                sort      hash
   ----------------------------
      4MB       329       545
    128MB       214       493

patched (128)

               sort      hash
   ----------------------------
      4MB       331       478
    128MB       222       434


I agree that's pretty nice. I wonder how far would we need to go before
reaching a plateau. I'll try this on the other machine with temporary
tablespace on SATA, but that'll take longer.

The I/O pattern changed significantly - it's not visible on the charts,
so I'm not attaching them. But the statistics of block sizes and "gaps"
are pretty clear.


size of I/O requests
--------------------

a) master

type | bytes | count | pct ------+---------+---------+--------
     RA   |    8192 | 2905948 |  95.83
     RA   |   24576 |   63470 |   2.09
     RA   |   16384 |   40155 |   1.32
     W    |    8192 |  149295 |  52.85
     W    |   16384 |   51781 |  18.33
     W    |   24576 |   22247 |   7.88
     W    | 1310720 |   15493 |   5.48
     W    |   32768 |   11856 |   4.20

b) patched, 32 blocks

     type |  bytes  | count  |  pct
    ------+---------+--------+--------
     RA   |  131072 | 247686 |  41.75
     RA   |    8192 |  95746 |  16.14
     RA   |   16384 |  82314 |  13.87
     RA   |   32768 |  82146 |  13.85
     RA   |   65536 |  82126 |  13.84
     W    | 1310720 |  16815 |  52.19
     W    |  262144 |   3628 |  11.26
     W    |  524288 |   2791 |   8.66

c) patched, 64 blocks

     type |  bytes  | count  |  pct
    ------+---------+--------+--------
     RA   |  131072 | 213556 |  56.18
     RA   |    8192 |  47663 |  12.54
     RA   |   16384 |  39358 |  10.35
     RA   |   32768 |  39308 |  10.34
     RA   |   65536 |  39304 |  10.34
     W    | 1310720 |  18132 |  65.27
     W    |  524288 |   3722 |  13.40
     W    |  262144 |    581 |   2.09
     W    | 1048576 |    405 |   1.46
     W    |    8192 |    284 |   1.02

d) patched, 128 blocks

     type |  bytes  | count  |  pct
    ------+---------+--------+--------
     RA   |  131072 | 200816 |  70.93
     RA   |    8192 |  23640 |   8.35
     RA   |   16384 |  19324 |   6.83
     RA   |   32768 |  19279 |   6.81
     RA   |   65536 |  19273 |   6.81
     W    | 1310720 |  18000 |  65.91
     W    |  524288 |   2074 |   7.59
     W    | 1048576 |    660 |   2.42
     W    |    8192 |    409 |   1.50
     W    |  786432 |    354 |   1.30

Clearly, the I/O requests are much larger - both reads and writes
shifted from 8kB to much larger ones, and the larger the number of
blocks the more significant the shift is. This means the workload is
getting more "sequential" and the write combining / read-ahead becomes
more effective.


deltas between I/O requests
---------------------------

I'll only show reads to save space, it's about the same for writes.

a) master

type | block_delta | count | pct ------+-------------+--------+-------
     RA   |         256 | 569237 | 18.77
     RA   |         240 | 475182 | 15.67
     RA   |         272 | 437260 | 14.42
     RA   |         224 | 328604 | 10.84
     RA   |         288 | 293628 |  9.68
     RA   |         208 | 199530 |  6.58
     RA   |         304 | 181695 |  5.99
     RA   |         192 | 109472 |  3.61
     RA   |         320 | 105211 |  3.47
     RA   |         336 |  57423 |  1.89

b) patched, 32 blocks

type | block_delta | count | pct ------+-------------+--------+-------
     RA   |         256 | 165071 | 27.82
     RA   |          32 |  82129 | 13.84
     RA   |          64 |  82122 | 13.84
     RA   |         128 |  82077 | 13.83
     RA   |          16 |  82042 | 13.83
     RA   |        7440 |  45168 |  7.61
     RA   |        7952 |   9838 |  1.66

c) patched, 64 blocks

     type | block_delta | count  |  pct
    ------+-------------+--------+-------
     RA   |         256 | 173737 | 45.70
     RA   |          32 |  39301 | 10.34
     RA   |          64 |  39299 | 10.34
     RA   |         128 |  39291 | 10.34
     RA   |          16 |  39250 | 10.32
     RA   |       15120 |  21202 |  5.58
     RA   |       15376 |   4448 |  1.17

d) patched, 128 blocks

     type | block_delta | count  |  pct
    ------+-------------+--------+-------
     RA   |         256 | 180955 | 63.91
     RA   |          32 |  19274 |  6.81
     RA   |          64 |  19273 |  6.81
     RA   |         128 |  19264 |  6.80
     RA   |          16 |  19203 |  6.78
     RA   |       30480 |   9835 |  3.47

The way I understand it, this needs to be interpreted together with
block size stats - in a perfectly sequential workload the two stats
would match. For master that's clearly not the case - the most common
read request size is 8kB, but the most common delta is 128kB (256
sectors, which is the read-ahead for the SSD device). The patched
results are much closer, mostly thanks to switching to 128kB reads.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply via email to