On 6/8/22 16:15, Jakub Wartak wrote: > Hi, got some answers! > > TL;DR for fio it would make sense to use many stressfiles (instead of 1) and > same for numjobs ~ VCPU to avoid various pitfails. > >>>> The really >>>> puzzling thing is why is the filesystem so much slower for smaller >>>> pages. I mean, why would writing 1K be 1/3 of writing 4K? >>>> Why would a filesystem have such effect? >>> >>> Ha! I don't care at this point as 1 or 2kB seems too small to handle >>> many real world scenarios ;) > [..] >> Independently of that, it seems like an interesting behavior and it might >> tell us >> something about how to optimize for larger pages. > > OK, curiosity won: > > With randwrite on ext4 directio using 4kb the avgqu-sz reaches ~90-100 (close > to fio's 128 queue depth?) and I'm getting ~70k IOPS [with maxdepth=128] > With randwrite on ext4 directio using 1kb the avgqu-sz is just 0.7 and I'm > getting just ~17-22k IOPS [with maxdepth=128] -> conclusion: something is > being locked thus preventing queue to build up > With randwrite on ext4 directio using 4kb the avgqu-sz reaches ~2.3 (so > something is queued) and I'm also getting ~70k IOPS with minimal possible > maxdepth=4 -> conclusion: I just need to split the lock contention by 4. > > The 1kB (slow) profile top function is aio_write() -> .... -> > iov_iter_get_pages() -> internal_get_user_pages_fast() and there's sadly > plenty of "lock" keywords inside {related to memory manager, padding to full > page size, inode locking} also one can find some articles / commits related > to it [1] which didn't made a good feeling to be honest as the fio is using > just 1 file (even while I'm on kernel 5.10.x). So I've switched to 4x files > and numjobs=4 and got easily 60k IOPS, contention solved whatever it was :) > So I would assume PostgreSQL (with it's splitting data files by default on > 1GB boundaries and multiprocess architecture) should be relatively safe from > such ext4 inode(?)/mm(?) contentions even with smallest 1kb block sizes on > Direct I/O some day. >
Interesting. So what parameter values would you suggest? FWIW some of the tests I did were on xfs, so I wonder if that might be hitting similar/other bottlenecks. > [1] - https://www.phoronix.com/scan.php?page=news_item&px=EXT4-DIO-Faster-DBs > >>> Both scenarios (raw and fs) have had direct=1 set. I just cannot understand >> how having direct I/O enabled (which disables caching) achieves better read >> IOPS on ext4 than on raw device... isn't it contradiction? >>> >> >> Thanks for the clarification. Not sure what might be causing this. Did you >> use the >> same parameters (e.g. iodepth) in both cases? > > Explanation: it's the CPU scheduler migrations mixing the performance result > during the runs of fio (as you have in your framework). Various VCPUs seem > to be having varying max IOPS characteristics (sic!) and CPU scheduler seems > to be unaware of it. At least on 1kB and 4kB blocksize this happens also > notice that some VCPUs [XXXX marker] don't reach 100% CPU reaching almost > twice the result; while cores 0, 3 do reach 100% and lack CPU power to > perform more. The only thing that I don't get is that it doesn't make sense > from extened lscpu output (but maybe it's AWS XEN mixing real CPU mappings, > who knows). Uh, that's strange. I haven't seen anything like that, but I'm running on physical HW and not AWS, so it's either that or maybe I just didn't do the same test. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company