Thanks, both! That's a very interesting thread. I was confident this was a subject that had been discussed--just wasn't sure where--so thank you for forwarding.
I guess the big-picture summary is that NOT IN's definition introduces complexity (the nature of which I now understand better) that is usually unwarranted by the question the querier is asking. So NOT EXISTS will almost always be preferable when a subquery is involved, unless the behavior around NULL values is specifically desired. On Fri, Nov 9, 2018 at 8:45 AM Merlin Moncure <mmonc...@gmail.com> wrote: > On Thu, Nov 8, 2018 at 3:12 PM David Rowley > <david.row...@2ndquadrant.com> wrote: > > > > On 9 November 2018 at 08:35, Lincoln Swaine-Moore > > <lswainemo...@gmail.com> wrote: > > > My primary question is: why is this approach only possible (for data > too > > > large for memory) when using NOT EXISTS, and not when using NOT IN? > > > > > > I understand that there is a slight difference in the meaning of the > two > > > expressions, in that NOT IN will produce NULL if there are any NULL > values > > > in the right hand side (in this case there are none, and the queries > should > > > return the same COUNT). But if anything, I would expect that to improve > > > performance of the NOT IN operation, since a single pass through that > data > > > should reveal if there are any NULL values, at which point that > information > > > could be used to short-circuit. So I am a bit baffled. > > > > The problem is that the planner makes the plan and would have to know > > beforehand that no NULLs could exist on either side of the join. > > Yeah, the core issue is the SQL rules that define NOT IN behaves as: > postgres=# select 1 not in (select 2); > ?column? > ────────── > t > (1 row) > > postgres=# select 1 not in (select 2 union all select null); > ?column? > ────────── > > (1 row) > > There's a certain logic to it but it's a death sentence for performance. > > merlin > -- Lincoln Swaine-Moore