On Tue, May 07, 2019 at 04:28:36PM +1200, Thomas Munro wrote:
On Tue, May 7, 2019 at 3:15 PM Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
On Tue, May 07, 2019 at 01:48:40PM +1200, Thomas Munro wrote:
>Seems expensive for large numbers of slices -- you need to join the
>outer batch against each inner slice.
Nope, that's not how it works. It's the array of batches that gets
sliced, not the batches themselves.
Sorry, I read only the description and not the code, and got confused
about that. So, I see three separate but related problems:
A. Broken escape valve: sometimes we generate a huge number of
batches while trying to split up many duplicates, because of the
presence of other more uniformly distributed keys. We could fix that
with (say) a 95% rule.
B. Lack of good alternative execution strategy when the escape valve
is triggered. A batch cannot be split effectively, but cannot fit in
work_mem, so for now we decide to ignore work_mem.
C. Unmetered explosion of batches and thus BufFiles, probably usually
caused by problem A, but theoretically also due to a real need for
partitions.
Right. I don't think a single solution addressing all those issues exists.
It's more likely we need multiple improvements.
>But I wonder how we'd deal with outer joins, as Tom Lane asked in
>another thread:
>
>https://www.postgresql.org/message-id/12185.1488932980%40sss.pgh.pa.us
That seems unrelated - we slice the array of batches, to keep memory
needed for BufFile under control. The hash table remains intact, so
there's no issue with outer joins.
Right, sorry, my confusion. I thought you were describing
https://en.wikipedia.org/wiki/Block_nested_loop. (I actually think we
can make that work for left outer joins without too much fuss by
writing out a stream of match bits to a new temporary file. Googling,
I see that MySQL originally didn't support BNL for outer joins and
then added some match flag propagation thing recently.)
Possibly, I'm not against implementing that, although I don't have very
good idea what the benefits of BNL joins are (performance-wise). In any
case, I think entirely unrelated to hash joins.
I agree we should relax the 0%/100% split condition, and disable the
growth sooner. But I think we should also re-evaluate that decision
after a while - the data set may be correlated in some way, in which
case we may disable the growth prematurely. It may not reduce memory
usage now, but it may help in the future.
It's already an issue, but it would be even more likely if we disabled
growth e.g. with just 5%/95% splits.
FWIW I believe this is mostly orthogonal issue to what's discussed in
this thread.
But isn't problem A the root cause of problem C, in most cases? There
must also be "genuine" cases of problem C that would occur even if we
fix that, of course: someone has small work_mem, and data that can be
effectively partitioned to fit it, but it just takes a huge number of
partitions to do it. So that we don't behave badly in those cases, I
agree with you 100%: we should fix the memory accounting to count
BufFile overheads as you are proposing, and then I guess ideally
switch to our alternative strategy (BNL or sort-merge or ...) when we
see that BufFiles are wasting to much work_mem and its time to try
something else. It seems you don't actually have one of those cases
here, though?
Maybe. Or maybe not. I don't have enough data to make such judgements
about the causes in general. We have one query from pgsql-performance.
There might be more, but IMO that's probably biased data set.
But even that reported query actually is not the case that A causes C.
The outer side of the hash join was significantly underestimated (34619
vs. 113478127) due to highly-correlated conditions.
And in that case it's trivial to cause nbatch explosion even with perfect
data sets with no duplicates (so no escape valve failure).
I think we should fix problem A. Then handle problem C by accounting
for BufFiles, and figure out a way to switch to our alternative
strategy (currently: ignore work_mem), when we think that creating
more BufFiles will be futile (not sure exactly what the rule there
should be). And then work on fixing B properly with a good strategy.
Here's a straw-man idea: we could adopt BNL, and then entirely remove
our repartitioning code. If the planner's number of partitions turns
out to be not enough, we'll just handle it using BNL loops.
Yeah, something like that.
I think we can fix A by relaxing the escape valve condition, and then
rechecking it once in a while. So we fill work_mem, realize it didn't
actually reduce the batch size significantly and disable nbatch growth.
But at the same time we increase the threshold to 2x work_mem, and after
reaching it we "consider" a nbatch increase. That is, we walk the batch
and see how many tuples would move if we increased nbatch (that should be
fairly cheap) - if it helps, great, enable growth and split the batch. If
not, double the threshold again. Rinse and repeat.
For C, I think we can use either of the two approaches I proposed. I like
the second option better, as it actually enforces work_mem. The first
option kinda helped with A too, although in different way, ana I think the
solution I outlined in the previous paragraph will work better.
No opinion regarding the switch to BNL, at the moment.
Switching to some other algorithm during execution moves the goal posts
to the next galaxy, I'm afraid.
The main problem I'm aware of with sort-merge join is: not all that is
hashable is sortable. So BNL is actually the only solution I'm aware
of for problem B that doesn't involve changing a fundamental thing
about PostgreSQL's data type requirements.
Sure, each of those algorithms has limitations. But I think that's mostly
irrelevant to the main issue - switching between algorithms mid-execution.
At that point some of the tuples might have been already sent sent to the
other nodes, and I have no idea how to "resume" the tuple stream short of
buffering everything locally until the join completes. And that would be
rather terrible, I guess.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services