On Tue, May 26, 2020 at 10:59 AM Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > On Mon, May 25, 2020 at 12:49:45PM -0700, Jeff Davis wrote: > >Do you think the difference in IO patterns is due to a difference in > >handling reads vs. writes in the kernel? Or do you think that 128 > >blocks is not enough to amortize the cost of a seek for that device? > > I don't know. I kinda imagined it was due to the workers interfering > with each other, but that should affect the sort the same way, right? > I don't have any data to support this, at the moment - I can repeat > the iosnoop tests and analyze the data, of course.
About the reads vs writes question: I know that reading and writing two interleaved sequential "streams" through the same fd confuses the read-ahead/write-behind heuristics on FreeBSD UFS (I mean: w(1), r(42), w(2), r(43), w(3), r(44), ...) so the performance is terrible on spinning media. Andrew Gierth reported that as a problem for sequential scans that are also writing back hint bits, and vacuum. However, in a quick test on a Linux 4.19 XFS system, using a program to generate interleaving read and write streams 1MB apart, I could see that it was still happily generating larger clustered I/Os. I have no clue for other operating systems. That said, even on Linux, reads and writes still have to compete for scant IOPS on slow-seek media (albeit hopefully in larger clustered I/Os)... Jumping over large interleaving chunks with no prefetching from other tapes *must* produce stalls though... and if you crank up the read ahead size to be a decent percentage of the contiguous chunk size, I guess you must also waste I/O bandwidth on unwanted data past the end of each chunk, no? In an off-list chat with Jeff about whether Hash Join should use logtape.c for its partitions too, the first thought I had was that to be competitive with separate files, perhaps you'd need to write out a list of block ranges for each tape (rather than just next pointers on each block), so that you have the visibility required to control prefetching explicitly. I guess that would be a bit like the list of physical extents that Linux commands like filefrag(8) and xfs_bmap(8) can show you for regular files. (Other thoughts included worrying about how to make it allocate and stream blocks in parallel queries, ...!?#$)