Try perhaps something along these lines: ``` SELECT t.id, i.insight_id FROM taxpayers AS t JOIN LATERAL ( SELECT x.id AS insight_id FROM insights AS x WHERE x.taxpayer_id = t.id AND x.year IS NOT NULL ORDER BY year DESC LIMIT 1 ) AS i ON true WHERE t.id IN (?, ?) ```
If you don't have millions of ? in that IN clause, then that might be faster. On Fri, Aug 30, 2024 at 1:36 PM James Brown <ja...@instrumentl.com> wrote: > Hello: > > I'm attempting to figure out whether an optimizer behavior I'm seeing is a > PostgreSQL bug or expected behavior. The scenario: > > I have two tables: one named taxpayers which has a goodish number of > columns an an integer PK id, and one named insights, which has > a taxpayer_id foreign key to taxpayers, a year, and (again) a lot of other > columns. There's an index on insights (taxpayer_id, year DESC). I'm > executing the following SQL: > > ``` > SELECT taxpayers.id, insight_id > FROM taxpayers > JOIN ( > WITH ordered_insights AS ( > SELECT taxpayer_id, id, RANK() OVER (PARTITION BY taxpayer_id > ORDER BY year DESC) AS rank > FROM insights > WHERE year IS NOT NULL > ) > SELECT taxpayer_id, id AS insight_id > FROM ordered_insights > WHERE rank = 1 > ) latest_insights ON latest_insights.taxpayer_id = taxpayers.id > WHERE taxpayers.id IN (?, ?) > ``` > > (this is simplified example; the real code has the subselect in a view so > that it can execute this kind of join from an ORM; it also joins quite a > few tables downstream after this) > > If there's only a single value in the IN clause, the EXPLAIN plan looks > great: > > Nested Loop (cost=0.86..53.30 rows=1 width=16) > -> Index Only Scan using taxpayers_pkey on taxpayers (cost=0.43..8.45 > rows=1 width=8) > Index Cond: (id = 650974) > -> Subquery Scan on ordered_insights (cost=0.43..44.83 rows=1 width=16) > Filter: (ordered_insights.rank = 1) > -> WindowAgg (cost=0.43..44.71 rows=10 width=28) > Run Condition: (rank() OVER (?) <= 1) > -> Index Scan using index_insights_on_taxpayer_id_year_desc > on insights (cost=0.43..44.53 rows=10 width=20) > Index Cond: ((taxpayer_id = 650974) AND (year IS NOT > NULL)) > (9 rows) > > However, if there are multiple rows in the IN clause, the optimizer > decides to execute the subselect against the entire giant table, and it is > not great: > > Hash Join (cost=2611586.97..2800201.15 rows=1 width=16) > Hash Cond: (ordered_insights.taxpayer_id = taxpayers.id) > -> Subquery Scan on ordered_insights (cost=2611570.10..2799818.65 > rows=28961 width=16) > Filter: (ordered_insights.rank = 1) > -> WindowAgg (cost=2611570.10..2727415.36 rows=5792263 width=28) > Run Condition: (rank() OVER (?) <= 1) > -> Sort (cost=2611570.10..2626050.76 rows=5792263 width=20) > Sort Key: insights.taxpayer_id, insights.year DESC > -> Seq Scan on insights (cost=0.00..1723354.01 > rows=5792263 width=20) > Filter: (year IS NOT NULL) > -> Hash (cost=16.85..16.85 rows=2 width=8) > -> Index Only Scan using taxpayers_pkey on taxpayers > (cost=0.43..16.85 rows=2 width=8) > Index Cond: (id = ANY ('{650974,243848}'::bigint[])) > > If I add in a second repetitive WHERE clause, it goes back to being happy, > but that feels a bit like a hack: > > # EXPLAIN SELECT taxpayers.id, insight_id > FROM taxpayers > JOIN ( > WITH ordered_insights AS ( > SELECT taxpayer_id, id, RANK() OVER (PARTITION BY taxpayer_id > ORDER BY year DESC) AS rank > FROM insights > WHERE year IS NOT NULL > ) > SELECT taxpayer_id, id AS insight_id > FROM ordered_insights > WHERE rank = 1 > ) latest_insights ON latest_insights.taxpayer_id = taxpayers.id > WHERE taxpayers.id IN (650974, 243848) AND latest_insights.taxpayer_id IN > (650974, 243848); > QUERY PLAN > > > ----------------------------------------------------------------------------------------------------------------------------- > Nested Loop (cost=0.86..110.57 rows=1 width=16) > Join Filter: (taxpayers.id = ordered_insights.taxpayer_id) > -> Subquery Scan on ordered_insights (cost=0.43..93.69 rows=1 width=16) > Filter: (ordered_insights.rank = 1) > -> WindowAgg (cost=0.43..93.42 rows=21 width=28) > Run Condition: (rank() OVER (?) <= 1) > -> Index Scan using index_insights_on_taxpayer_id_year_desc > on insights (cost=0.43..93.06 rows=21 width=20) > Index Cond: ((taxpayer_id = ANY > ('{650974,243848}'::bigint[])) AND (year IS NOT NULL)) > -> Index Only Scan using taxpayers_pkey on taxpayers (cost=0.43..16.85 > rows=2 width=8) > Index Cond: (id = ANY ('{650974,243848}'::bigint[])) > > This feels like a bug to me, but maybe I'm missing something obvious. I > don't really get why the optimizer wouldn't be able to infer the second > condition given that I'm doing a join on a non-nullable integer column (so > there's no NaN nonsense to worry about), but maybe I'm missing something > obvious. > > I've reproduced this on PostgreSQL 15.7 and 17beta3. > > Thanks for any insights y'all can provide! > > -- > James Brown >