Re: pgcon unconference / impact of block size on performance

Tomas Vondra Tue, 07 Jun 2022 06:30:02 -0700

On 6/7/22 11:46, Jakub Wartak wrote:
> Hi Tomas,
> 
>> Well, there's plenty of charts in the github repositories, including the 
>> charts I
>> think you're asking for:
> 
> Thanks.
> 
>> I also wonder how is this related to filesystem page size - in all the 
>> benchmarks I
>> did I used the default (4k), but maybe it'd behave if the filesystem page 
>> matched
>> the data page.
> 
> That may be it - using fio on raw NVMe device (without fs/VFS at all) shows:
> 
> [root@x libaio-raw]# grep -r -e 'write:' -e 'read :' *
> nvme/randread/128/1k/1.txt:  read : io=7721.9MB, bw=131783KB/s, iops=131783, 
> runt= 60001msec [b]
> nvme/randread/128/2k/1.txt:  read : io=15468MB, bw=263991KB/s, iops=131995, 
> runt= 60001msec [b] 
> nvme/randread/128/4k/1.txt:  read : io=30142MB, bw=514408KB/s, iops=128602, 
> runt= 60001msec [b]
> nvme/randread/128/8k/1.txt:  read : io=56698MB, bw=967635KB/s, iops=120954, 
> runt= 60001msec
> nvme/randwrite/128/1k/1.txt:  write: io=4140.9MB, bw=70242KB/s, iops=70241, 
> runt= 60366msec [a]
> nvme/randwrite/128/2k/1.txt:  write: io=8271.5MB, bw=141161KB/s, iops=70580, 
> runt= 60002msec [a]
> nvme/randwrite/128/4k/1.txt:  write: io=16543MB, bw=281164KB/s, iops=70291, 
> runt= 60248msec
> nvme/randwrite/128/8k/1.txt:  write: io=22924MB, bw=390930KB/s, iops=48866, 
> runt= 60047msec
> 
> So, I've found out two interesting things while playing with raw vs ext4:
> a) I've got 70k IOPS always randwrite even on 1k,2k,4k without ext4 (so as 
> expected, this was ext4 4kb default fs page size impact as you was thinking 
> about when fio 1k was hitting ext4 4kB block)


Right. Interesting, so for randread we get a consistent +30% speedup on
raw devices with all page sizes, while on randwrite it's about 1.0x for
4K. The really puzzling thing is why is the filesystem so much slower
for smaller pages. I mean, why would writing 1K be 1/3 of writing 4K?
Why would a filesystem have such effect?

> b) Another thing that you could also include in testing is that I've spotted 
> a couple of times single-threaded fio might could be limiting factor 
> (numjobs=1 by default), so I've tried with numjobs=2,group_reporting=1 and 
> got this below ouput on ext4 defaults even while dropping caches (echo 3) 
> each loop iteration -- something that I cannot explain (ext4 direct I/O 
> caching effect? how's that even possible? reproduced several times even with 
> numjobs=1) - the point being 206643 1kb IOPS @ ext4 direct-io > 131783 1kB 
> IOPS @ raw, smells like some caching effect because for randwrite it does not 
> happen. I've triple-checked with iostat -x... it cannot be any internal 
> device cache as with direct I/O that doesn't happen:
> 
> [root@x libaio-ext4]# grep -r -e 'write:' -e 'read :' *
> nvme/randread/128/1k/1.txt:  read : io=12108MB, bw=206644KB/s, iops=206643, 
> runt= 60001msec [b]
> nvme/randread/128/2k/1.txt:  read : io=18821MB, bw=321210KB/s, iops=160604, 
> runt= 60001msec [b]
> nvme/randread/128/4k/1.txt:  read : io=36985MB, bw=631208KB/s, iops=157802, 
> runt= 60001msec [b]
> nvme/randread/128/8k/1.txt:  read : io=57364MB, bw=976923KB/s, iops=122115, 
> runt= 60128msec
> nvme/randwrite/128/1k/1.txt:  write: io=1036.2MB, bw=17683KB/s, iops=17683, 
> runt= 60001msec [a, as before]
> nvme/randwrite/128/2k/1.txt:  write: io=2023.2MB, bw=34528KB/s, iops=17263, 
> runt= 60001msec [a, as before]
> nvme/randwrite/128/4k/1.txt:  write: io=16667MB, bw=282977KB/s, iops=70744, 
> runt= 60311msec [reproduced benefit, as per earlier email]
> nvme/randwrite/128/8k/1.txt:  write: io=22997MB, bw=391839KB/s, iops=48979, 
> runt= 60099msec
> 

No idea what might be causing this. BTW so you're not using direct-io to
access the raw device? Or am I just misreading this?

>>> One way or another it would be very nice to be able to select the
>>> tradeoff using initdb(1) without the need to recompile, which then
>>> begs for some initdb --calibrate /mnt/nvme (effective_io_concurrency,
>>> DB page size, ...).> Do you envision any plans for this we still in a
>>> need to gather more info exactly why this happens? (perf reports?)
>>>
>>
>> Not sure I follow. Plans for what? Something that calibrates cost parameters?
>> That might be useful, but that's a rather separate issue from what's 
>> discussed
>> here - page size, which needs to happen before initdb (at least with how 
>> things
>> work currently).
> [..]
> 
> Sorry, I was too far teched and assumed you guys were talking very long term. 
> 

Np, I think that'd be an useful tool, but it seems more like a
completely separate discussion.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pgcon unconference / impact of block size on performance

Reply via email to