Re: Consider parallel for lateral subqueries with limit

James Coleman Mon, 19 Sep 2022 12:59:00 -0700

On Tue, Mar 1, 2022 at 5:35 PM Tom Lane <t...@sss.pgh.pa.us> wrote:
>
> But more generally, I don't think you've addressed the fundamental
> concern, which is that a query involving Limit is potentially
> nondeterministic (if it lacks a fully-deterministic ORDER BY),
> so that different workers could get different answers from it if
> they're using a plan type that permits that to happen.  (See the
> original discussion that led to 75f9c4ca5, at [1].)  I do not see
> how a lateral dependency removes that hazard.
> ...
> > In this case we are already conceivably getting different
> > results for each execution of the subquery "Limit(Y)" even if we're
> > not running those executions across multiple workers.
>
> That seems to be about the same argument Andres made initially
> in the old thread, but we soon shot that down as not being the
> level of guarantee we want to provide.  There's nothing in the
> SQL standard that says that
>     select * from events where account in
>         (select account from events
>          where data->>'page' = 'success.html' limit 3);
> (the original problem query) shall execute the sub-query
> only once, but people expect it to act that way.


I understand that particular case being a bit of a gotcha (i.e., what
we would naturally expect exceeds what the spec says).

But in the case where there's correlation via LATERAL we already don't
guarantee unique executions for a given set of params into the lateral
subquery execution, right? For example, suppose we have:

  select *
  from foo
  left join lateral (
    select n
    from bar
    where bar.a = foo.a
    limit 1
  ) on true

and suppose that foo.a (in the table scan) returns these values in
order: 1, 2, 1. In that case we'll execute the lateral subquery 3
separate times rather than attempting to order the results of foo.a
such that we can re-execute the subquery only when the param changes
to a new unique value (and we definitely don't cache the results to
guarantee unique subquery executions).

So, assuming that we can guarantee we're talking about the proper
lateral reference (not something unrelated as you pointed out earlier)
then I don't believe we're actually changing even implicit guarantees
about how the query executes. I think that probably true both in
theory (I claim that once someone declares a lateral join they're
definitionally expecting a subquery execution per outer row) and in
practice (e.g., your hypothesis about synchronized seq scans would
apply here also to subsequent executions of the subquery).

Is there still something I'm missing about the concern you have?

> If you want to improve this area, my feeling is that it'd be
> better to look into what was speculated about in the old
> thread: LIMIT doesn't create nondeterminism if the query has
> an ORDER BY that imposes a unique row ordering, ie
> order-by-primary-key.  We didn't have planner infrastructure
> that would allow checking that cheaply in 2018, but maybe
> there is some now?

Assuming what I argued earlier holds true for lateral [at minimum when
correlated] subquery execution, then I believe your suggestion here is
orthogonal and would expand the use cases even more. For example, if
we were able to guarantee a unique result set (including order), then
we could allow parallelizing subqueries even if they're not lateral
and correlated.

James Coleman

Re: Consider parallel for lateral subqueries with limit

Reply via email to