On Fri, May 17, 2019 at 4:39 AM Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > I think this is a step in the right direction, but as I said on the other > thread(s), I think we should not disable growth forever and recheck once > in a while. Otherwise we'll end up in sad situation with non-uniform data > sets, as poined out by Hubert Zhang in [1]. It's probably even truer with > this less strict logic, using 95% as a threshold (instead of 100%). > > I kinda like the idea with increasing the spaceAllowed value. Essentially, > if we decide adding batches would be pointless, increasing the memory > budget is the only thing we can do anyway.
But that's not OK, we need to fix THAT. > The problem however is that we only really look at a single bit - it may > be that doubling the batches would not help, but doing it twice would > actually reduce the memory usage. For example, assume there are 2 distinct > values in the batch, with hash values (in binary) Yes, that's a good point, and not a case that we should ignore. But if we had a decent fall-back strategy that respected work_mem, we wouldn't care so much if we get it wrong in a corner case. I'm arguing that we should use Grace partitioning as our primary partitioning strategy, but fall back to looping (or possibly sort-merging) for the current batch if Grace doesn't seem to be working. You'll always be able to find cases where if you'd just tried one more round, Grace would work, but that seems acceptable to me, because getting it wrong doesn't melt your computer, it just probably takes longer. Or maybe it doesn't. How much longer would it take to loop twice? Erm, twice as long, and each loop makes actual progress, unlike extra speculative Grace partition expansions which apply not just to the current batch but all batches, might not actually work, and you *have* to abandon at some point. The more I think about it, the more I think that a loop-base escape valve, though unpalatably quadratic, is probably OK because we're in a sink-or-swim situation at this point, and our budget is work_mem, not work_time. I'm concerned that we're trying to find ways to treat the symptoms, allowing us to exceed work_mem but maybe not so much, instead of focusing on the fundamental problem, which is that we don't yet have an algorithm that is guaranteed to respect work_mem. Admittedly I don't have a patch, just a bunch of handwaving. One reason I haven't attempted to write it is because although I know how to do the non-parallel version using a BufFile full of match bits in sync with the tuples for outer joins, I haven't figured out how to do it for parallel-aware hash join, because then each loop over the outer batch could see different tuples in each participant. You could use the match bit in HashJoinTuple header, but then you'd have to write all the tuples out again, which is more IO than I want to do. I'll probably start another thread about that. -- Thomas Munro https://enterprisedb.com