Re: effective_io_concurrency and NVMe devices

Tomas Vondra Tue, 07 Jun 2022 08:13:04 -0700

On 6/7/22 15:29, Jakub Wartak wrote:
> Hi Tomas,
> 
>>> I have a machine here with 1 x PCIe 3.0 NVMe SSD and also 1 x PCIe 4.0
>>> NVMe SSD. I ran a few tests to see how different values of
>>> effective_io_concurrency would affect performance. I tried to come up
>>> with a query that did little enough CPU processing to ensure that I/O
>>> was the clear bottleneck.
>>>
>>> The test was with a 128GB table on a machine with 64GB of RAM.  I
>>> padded the tuples out so there were 4 per page so that the aggregation
>>> didn't have much work to do.
>>>
>>> The query I ran was: explain (analyze, buffers, timing off) select
>>> count(p) from r where a = 1;
>  
>> The other idea I had while looking at batching a while back, is that we 
>> should
>> batch the prefetches. The current logic interleaves prefetches with other 
>> work -
>> prefetch one page, process one page, ... But once reading a page gets
>> sufficiently fast, this means the queues never get deep enough for
>> optimizations. So maybe we should think about batching the prefetches, in 
>> some
>> way. Unfortunately posix_fadvise does not allow batching of requests, but we
>> can at least stop interleaving the requests.
> 
> .. for now it doesn't, but IORING_OP_FADVISE is on the long-term horizon.  
>


Interesting! Will take time to get into real systems, though.

>> The attached patch is a trivial version that waits until we're at least
>> 32 pages behind the target, and then prefetches all of them. Maybe give it a 
>> try?
>> (This pretty much disables prefetching for e_i_c below 32, but for an
>> experimental patch that's enough.)
> 
> I've tried it at e_i_c=10 initially on David's setup.sql, and most defaults 
> s_b=128MB, dbsize=8kb but with forced disabled parallel query (for easier 
> inspection with strace just to be sure//so please don't compare times). 
> 
> run:
> a) master (e_i_c=10)  181760ms, 185680ms, 185384ms @ ~ 340MB/s and 44k IOPS 
> (~122k IOPS practical max here for libaio)
> b) patched(e_i_c=10)  237774ms, 236326ms, ..as you stated it disabled 
> prefetching, fadvise() not occurring
> c) patched(e_i_c=128) 90430ms, 88354ms, 85446ms, 78475ms, 74983ms, 81432ms 
> (mean=83186ms +/- 5947ms) @ ~570MB/s and 75k IOPS (it even peaked for a 
> second on ~122k)
> d) master (e_i_c=128) 116865ms, 101178ms, 89529ms, 95024ms, 89942ms 99939ms 
> (mean=98746ms +/- 10118ms) @ ~510MB/s and 65k IOPS (rare peaks to 90..100k 
> IOPS)
> 
> ~16% benefit sounds good (help me understand: L1i cache?). Maybe it is worth 
> throwing that patch onto more advanced / complete performance test farm too ? 
> (although it's only for bitmap heap scans)
> 
> run a: looked interleaved as you said:
> fadvise64(160, 1064157184, 8192, POSIX_FADV_WILLNEED) = 0
> pread64(160, "@\0\0\0\200\303/_\0\0\4\0(\0\200\0\0 \4 \0\0\0\0 
> \230\300\17@\220\300\17"..., 8192, 1064009728) = 8192
> fadvise64(160, 1064173568, 8192, POSIX_FADV_WILLNEED) = 0
> pread64(160, "@\0\0\0\0\0040_\0\0\4\0(\0\200\0\0 \4 \0\0\0\0 
> \230\300\17@\220\300\17"..., 8192, 1064026112) = 8192
> [..]
> 
> BTW: interesting note, for run b, the avgrq-sz from extended iostat jumps is 
> flipping between 16(*512=8kB) to ~256(*512=~128kB!) as if kernel was doing 
> some own prefetching heuristics on and off in cycles, while when calling 
> e_i_c/fadvise() is in action then it seems to be always 8kB requests. So with 
> disabled fadivse() one IMHO might have problems deterministically 
> benchmarking short queries as kernel voodoo might be happening (?)
> 

Yes, kernel certainly does it's own read-ahead, which works pretty well
for sequential patterns. What does

   blockdev --getra /dev/...

say?

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: effective_io_concurrency and NVMe devices

Reply via email to