On 4/21/22 10:14, David Rowley wrote:
> On Wed, 20 Apr 2022 at 14:56, Bruce Momjian <br...@momjian.us> wrote:
>> NVMe devices have a maximum queue length of 64k:
>
>> Should we increase its maximum to 64k? Backpatched? (SATA has a
>> maximum queue length of 256.)
>
> I have a machine here with 1 x PCIe 3.0 NVMe SSD and also 1 x PCIe 4.0
> NVMe SSD. I ran a few tests to see how different values of
> effective_io_concurrency would affect performance. I tried to come up
> with a query that did little enough CPU processing to ensure that I/O
> was the clear bottleneck.
>
> The test was with a 128GB table on a machine with 64GB of RAM. I
> padded the tuples out so there were 4 per page so that the aggregation
> didn't have much work to do.
>
> The query I ran was: explain (analyze, buffers, timing off) select
> count(p) from r where a = 1;
>
> Here's what I saw:
>
> NVME PCIe 3.0 (Samsung 970 Evo 1TB)
> e_i_c query_time_ms
> 0 88627.221
> 1 652915.192
> 5 271536.054
> 10 141168.986
> 100 67340.026
> 1000 70686.596
> 10000 70027.938
> 100000 70106.661
>
> Saw a max of 991 MB/sec in iotop
>
> NVME PCIe 4.0 (Samsung 980 Pro 1TB)
> e_i_c query_time_ms
> 0 59306.960
> 1 956170.704
> 5 237879.121
> 10 135004.111
> 100 55662.030
> 1000 51513.717
> 10000 59807.824
> 100000 53443.291
>
> Saw a max of 1126 MB/sec in iotop
>
> I'm not pretending that this is the best query and table size to show
> it, but at least this test shows that there's not much to gain by
> prefetching further. I imagine going further than we need to is
> likely to have negative consequences due to populating the kernel page
> cache with buffers that won't be used for a while. I also imagine
> going too far out likely increases the risk that buffers we've
> prefetched are evicted before they're used.
>
Not sure.
I don't think the risk of polluting the cache is very high, because the
1k buffers is 8MB and 64k would be 512MB. That's significant, but likely
just a tiny fraction of the available memory in machines with NVME.
Sure, there may be multiple sessions doing prefetch, but chances the
sessions touch the same data etc.
> This does also highlight that an effective_io_concurrency of 1 (the
> default) is pretty terrible in this test. The bitmap contained every
> 2nd page. I imagine that would break normal page prefetching by the
> kernel. If that's true, then it does not explain why e_i_c = 0 was so
> fast.
>
Yeah, this default is clearly pretty unfortunate. I think the problem is
that async request is not free, i.e. prefetching means
async request + read
and the prefetch trick is in assuming that
cost(async request) << cost(read)
and moving the read to a background thread. But the NVMe make reads
cheaper, so the amount of work moved to the background thread gets
lower, while the cost of the async request remains roughly the same.
Which means the difference (benefit) decreases over time.
Also, recent NVMe devices (like Intel Optane) aim to require lower queue
depths, so although the NVMe spec supports 64k queues and 64k commands
per queue, that does not mean you need to use that many requests to get
good performance.
As for the strange behavior with e_i_c=0, I think this can be explained
by how NVMe work internally. A simplified model of NVMe device is "slow"
flash with a DRAM cache, and AFAIK the data is not read from flash into
DRAM in 8kB pages but larger chunks. So even if there's no explicit OS
readahead, the device may still cache larger chunks in the DRAM buffer.
> I've attached the test setup that I did. I'm open to modifying the
> test and running again if someone has an idea that might show benefits
> to larger values for effective_io_concurrency.
>
I think it'd be interesting to test different / less regular patterns,
not just every 2nd page etc.
The other idea I had while looking at batching a while back, is that we
should batch the prefetches. The current logic interleaves prefetches
with other work - prefetch one page, process one page, ... But once
reading a page gets sufficiently fast, this means the queues never get
deep enough for optimizations. So maybe we should think about batching
the prefetches, in some way. Unfortunately posix_fadvise does not allow
batching of requests, but we can at least stop interleaving the requests.
The attached patch is a trivial version that waits until we're at least
32 pages behind the target, and then prefetches all of them. Maybe give
it a try? (This pretty much disables prefetching for e_i_c below 32, but
for an experimental patch that's enough.)
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index f6fe07ad703..550d0f387c1 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -472,6 +472,11 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
if (prefetch_iterator)
{
+
+ /* wait until we can prefetch a batch of buffers */
+ if (node->prefetch_pages > node->prefetch_target - 32)
+ return;
+
while (node->prefetch_pages < node->prefetch_target)
{
TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
@@ -517,7 +522,16 @@ BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
if (prefetch_iterator)
{
- while (1)
+ bool have_batch = false;
+
+ SpinLockAcquire(&pstate->mutex);
+ if (pstate->prefetch_pages < pstate->prefetch_target - 32)
+ {
+ have_batch = true;
+ }
+ SpinLockRelease(&pstate->mutex);
+
+ while (have_batch)
{
TBMIterateResult *tbmpre;
bool do_prefetch = false;