On Tue, Mar 1, 2022 at 5:35 PM Tom Lane <t...@sss.pgh.pa.us> wrote: > > But more generally, I don't think you've addressed the fundamental > concern, which is that a query involving Limit is potentially > nondeterministic (if it lacks a fully-deterministic ORDER BY), > so that different workers could get different answers from it if > they're using a plan type that permits that to happen. (See the > original discussion that led to 75f9c4ca5, at [1].) I do not see > how a lateral dependency removes that hazard. > ... > > In this case we are already conceivably getting different > > results for each execution of the subquery "Limit(Y)" even if we're > > not running those executions across multiple workers. > > That seems to be about the same argument Andres made initially > in the old thread, but we soon shot that down as not being the > level of guarantee we want to provide. There's nothing in the > SQL standard that says that > select * from events where account in > (select account from events > where data->>'page' = 'success.html' limit 3); > (the original problem query) shall execute the sub-query > only once, but people expect it to act that way.
I understand that particular case being a bit of a gotcha (i.e., what we would naturally expect exceeds what the spec says). But in the case where there's correlation via LATERAL we already don't guarantee unique executions for a given set of params into the lateral subquery execution, right? For example, suppose we have: select * from foo left join lateral ( select n from bar where bar.a = foo.a limit 1 ) on true and suppose that foo.a (in the table scan) returns these values in order: 1, 2, 1. In that case we'll execute the lateral subquery 3 separate times rather than attempting to order the results of foo.a such that we can re-execute the subquery only when the param changes to a new unique value (and we definitely don't cache the results to guarantee unique subquery executions). So, assuming that we can guarantee we're talking about the proper lateral reference (not something unrelated as you pointed out earlier) then I don't believe we're actually changing even implicit guarantees about how the query executes. I think that probably true both in theory (I claim that once someone declares a lateral join they're definitionally expecting a subquery execution per outer row) and in practice (e.g., your hypothesis about synchronized seq scans would apply here also to subsequent executions of the subquery). Is there still something I'm missing about the concern you have? > If you want to improve this area, my feeling is that it'd be > better to look into what was speculated about in the old > thread: LIMIT doesn't create nondeterminism if the query has > an ORDER BY that imposes a unique row ordering, ie > order-by-primary-key. We didn't have planner infrastructure > that would allow checking that cheaply in 2018, but maybe > there is some now? Assuming what I argued earlier holds true for lateral [at minimum when correlated] subquery execution, then I believe your suggestion here is orthogonal and would expand the use cases even more. For example, if we were able to guarantee a unique result set (including order), then we could allow parallelizing subqueries even if they're not lateral and correlated. James Coleman