On Thu, 23 Feb 2023 at 02:10, Ronan Dunklau <ronan.dunk...@aiven.io> wrote: > I haven't looked too deeply into it, but it seems reasonable that the whole > sort would cost cheaper than individual sorts on partitions + incremental > sorts, except when the the whole sort would spill to disk much more than the > incremental ones. I find it quite difficult to reason about what that > threshold > should be, but I managed to find a case which could fit in a test:
Thanks for coming up with that test case. It's a little disappointing to see that so many rows had to be added to get the plan to change. I wonder if it's really worth testing this particular case. ~1800 rows is a little more significant than I'd have hoped. The buildfarm has a few dinosaurs that would likely see a noticeable slowdown from that. What's on my mind now is if turning 1 Sort into N Sorts is a particularly good idea from a work_mem standpoint. I see that we don't do tuplesort_end() until executor shutdown, so that would mean that we could end up using 1 x work_mem per Sort node. I idly wondered if we couldn't do tuplesort_end() after spitting out the final tuple when EXEC_FLAG_REWIND is not set, but that would still mean we could use N work_mems when EXEC_FLAG_REWIND *is* set. We only really have visibility of that during execution too, so can't really make a decision at plan time based on that. I'm not quite sure if I'm being overly concerned here or not. All it would take to get a sort per partition today would be to put a suitable index on just 1 of the partitions. So this isn't exactly a new problem, it's just making an old problem perhaps a little more likely. The problem does also exist for things like partition-wise joins too for Hash and Merge joins. Partition-wise joins are disabled by default, however. David