On Tue, 2020-05-19 at 17:12 +0200, Tomas Vondra wrote: > I think there are two related problem - with costing and with > excessive > I/O due to using logical tapes.
Thank you for the detailed analysis. I am still digesting this information. > This kinda makes me question whether logical tapes are the right tool > for hashagg. I've read the explanation in logtape.c why it's about > the > same amount of I/O as using separate files, but IMO that only really > works for I/O patters similar to merge sort - the more I think about > this, the more I'm convinced we should just do what hashjoin is > doing. Fundamentally, sort writes sequentially and reads randomly; while HashAgg writes randomly and reads sequentially. If the random writes of HashAgg end up fragmented too much on disk, then clearly the sequential reads are not so sequential anyway. The only way to avoid fragmentation on disk is to preallocate for the tape/file. BufFile (relying more on the OS) would probably do a better job of preallocating the disk space in a useful way; whereas logtape.c makes it easier to manage buffers and the overall number of files created (thereby allowing higher fanout of partitions). We have a number of possibilities here: 1. Improve costing to reflect that HashAgg is creating more random IOs than Sort. 2. Reduce the partition fanout in the hopes that the OS does a better job with readahead. 3. Switch back to BufFile, in which case we probably need to reduce the fanout for other reasons. 4. Change logtape.c to allow preallocation or to write in larger blocks. 5. Change BufFile to allow more control over buffer usage, and switch to that. #1 or #2 are the least invasive, and I think we can get a satisfactory solution by combining those. I saw good results with the high fanout and low work_mem when there is still a lot of system memory. That's a nice benefit, but perhaps it's safer to use a lower fanout (which will lead to recursion) until we get a better handle on the IO patterns. Perhaps you can try recompiling with a lower max partitions and rerun the query? How much would we have to lower it for either the cost to approach reality or the OS readahead to become effective? Regards, Jeff Davis