On 6/8/22 16:15, Jakub Wartak wrote:
> Hi, got some answers!
> 
> TL;DR for fio it would make sense to use many stressfiles (instead of 1) and 
> same for numjobs ~ VCPU to avoid various pitfails.
> >>>> The really
>>>> puzzling thing is why is the filesystem so much slower for smaller
>>>> pages. I mean, why would writing 1K be 1/3 of writing 4K?
>>>> Why would a filesystem have such effect?
>>>
>>> Ha! I don't care at this point as 1 or 2kB seems too small to handle
>>> many real world scenarios ;)
> [..]
>> Independently of that, it seems like an interesting behavior and it might 
>> tell us
>> something about how to optimize for larger pages.
> 
> OK, curiosity won:
> 
> With randwrite on ext4 directio using 4kb the avgqu-sz reaches ~90-100 (close 
> to fio's 128 queue depth?) and I'm getting ~70k IOPS [with maxdepth=128]
> With randwrite on ext4 directio using 1kb the avgqu-sz is just 0.7 and I'm 
> getting just ~17-22k IOPS [with maxdepth=128] ->  conclusion: something is 
> being locked thus preventing queue to build up
> With randwrite on ext4 directio using 4kb the avgqu-sz reaches ~2.3 (so 
> something is queued) and I'm also getting ~70k IOPS with minimal possible 
> maxdepth=4 ->  conclusion: I just need to split the lock contention by 4.
> 
> The 1kB (slow) profile top function is aio_write() -> .... -> 
> iov_iter_get_pages() -> internal_get_user_pages_fast() and there's sadly 
> plenty of "lock" keywords inside {related to memory manager, padding to full 
> page size, inode locking} also one can find some articles / commits related 
> to it [1] which didn't made a good feeling to be honest as the fio is using 
> just 1 file (even while I'm on kernel 5.10.x). So I've switched to 4x files 
> and numjobs=4 and got easily 60k IOPS, contention solved whatever it was :) 
> So I would assume PostgreSQL (with it's splitting data files by default on 
> 1GB boundaries and multiprocess architecture) should be relatively safe from 
> such ext4 inode(?)/mm(?) contentions even with smallest 1kb block sizes on 
> Direct I/O some day.  
> 

Interesting. So what parameter values would you suggest?

FWIW some of the tests I did were on xfs, so I wonder if that might be
hitting similar/other bottlenecks.

> [1] - https://www.phoronix.com/scan.php?page=news_item&px=EXT4-DIO-Faster-DBs
> 
>>> Both scenarios (raw and fs) have had direct=1 set. I just cannot understand
>> how having direct I/O enabled (which disables caching) achieves better read
>> IOPS on ext4 than on raw device... isn't it contradiction?
>>>
>>
>> Thanks for the clarification. Not sure what might be causing this. Did you 
>> use the
>> same parameters (e.g. iodepth) in both cases?
> 
> Explanation: it's the CPU scheduler migrations mixing the performance result 
> during the runs of fio  (as you have in your framework). Various VCPUs seem 
> to be having varying max IOPS characteristics (sic!) and CPU scheduler seems 
> to be unaware of it. At least on 1kB and 4kB blocksize this happens also 
> notice that some VCPUs [XXXX marker] don't reach 100% CPU reaching almost 
> twice the result; while cores 0, 3 do reach 100% and lack CPU power to 
> perform more. The only thing that I don't get is that it doesn't make sense 
> from extened lscpu output (but maybe it's AWS XEN mixing real CPU mappings, 
> who knows).

Uh, that's strange. I haven't seen anything like that, but I'm running
on physical HW and not AWS, so it's either that or maybe I just didn't
do the same test.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply via email to