Try perhaps something along these lines:

```
SELECT t.id, i.insight_id
  FROM taxpayers AS t
  JOIN LATERAL (
           SELECT x.id AS insight_id
             FROM insights AS x
            WHERE x.taxpayer_id = t.id
              AND x.year IS NOT NULL
            ORDER BY year DESC
            LIMIT 1
       ) AS i ON true
 WHERE t.id IN (?, ?)
```

If you don't have millions of ? in that IN clause, then that might be
faster.

On Fri, Aug 30, 2024 at 1:36 PM James Brown <ja...@instrumentl.com> wrote:

> Hello:
>
> I'm attempting to figure out whether an optimizer behavior I'm seeing is a
> PostgreSQL bug or expected behavior. The scenario:
>
> I have two tables: one named taxpayers which has a goodish number of
> columns an an integer PK id, and one named insights, which has
> a taxpayer_id foreign key to taxpayers, a year, and (again) a lot of other
> columns. There's an index on insights (taxpayer_id, year DESC). I'm
> executing the following SQL:
>
> ```
> SELECT taxpayers.id, insight_id
>    FROM taxpayers
>    JOIN (
>        WITH ordered_insights AS (
>            SELECT taxpayer_id, id, RANK() OVER (PARTITION BY taxpayer_id
> ORDER BY year DESC) AS rank
>            FROM insights
>            WHERE year IS NOT NULL
>        )
>        SELECT taxpayer_id, id AS insight_id
>        FROM ordered_insights
>        WHERE rank = 1
>    ) latest_insights ON latest_insights.taxpayer_id = taxpayers.id
> WHERE taxpayers.id IN (?, ?)
> ```
>
> (this is simplified example; the real code has the subselect in a view so
> that it can execute this kind of join from an ORM; it also joins quite a
> few tables downstream after this)
>
> If there's only a single value in the IN clause, the EXPLAIN plan looks
> great:
>
> Nested Loop  (cost=0.86..53.30 rows=1 width=16)
>   ->  Index Only Scan using taxpayers_pkey on taxpayers  (cost=0.43..8.45
> rows=1 width=8)
>         Index Cond: (id = 650974)
>   ->  Subquery Scan on ordered_insights  (cost=0.43..44.83 rows=1 width=16)
>         Filter: (ordered_insights.rank = 1)
>         ->  WindowAgg  (cost=0.43..44.71 rows=10 width=28)
>               Run Condition: (rank() OVER (?) <= 1)
>               ->  Index Scan using index_insights_on_taxpayer_id_year_desc
> on insights  (cost=0.43..44.53 rows=10 width=20)
>                     Index Cond: ((taxpayer_id = 650974) AND (year IS NOT
> NULL))
> (9 rows)
>
> However, if there are multiple rows in the IN clause, the optimizer
> decides to execute the subselect against the entire giant table, and it is
> not great:
>
> Hash Join  (cost=2611586.97..2800201.15 rows=1 width=16)
>   Hash Cond: (ordered_insights.taxpayer_id = taxpayers.id)
>   ->  Subquery Scan on ordered_insights  (cost=2611570.10..2799818.65
> rows=28961 width=16)
>         Filter: (ordered_insights.rank = 1)
>         ->  WindowAgg  (cost=2611570.10..2727415.36 rows=5792263 width=28)
>               Run Condition: (rank() OVER (?) <= 1)
>               ->  Sort  (cost=2611570.10..2626050.76 rows=5792263 width=20)
>                     Sort Key: insights.taxpayer_id, insights.year DESC
>                     ->  Seq Scan on insights  (cost=0.00..1723354.01
> rows=5792263 width=20)
>                           Filter: (year IS NOT NULL)
>   ->  Hash  (cost=16.85..16.85 rows=2 width=8)
>         ->  Index Only Scan using taxpayers_pkey on taxpayers
>  (cost=0.43..16.85 rows=2 width=8)
>               Index Cond: (id = ANY ('{650974,243848}'::bigint[]))
>
> If I add in a second repetitive WHERE clause, it goes back to being happy,
> but that feels a bit like a hack:
>
> # EXPLAIN SELECT taxpayers.id, insight_id
>    FROM taxpayers
>    JOIN (
>        WITH ordered_insights AS (
>            SELECT taxpayer_id, id, RANK() OVER (PARTITION BY taxpayer_id
> ORDER BY year DESC) AS rank
>            FROM insights
>            WHERE year IS NOT NULL
>        )
>        SELECT taxpayer_id, id AS insight_id
>        FROM ordered_insights
>        WHERE rank = 1
>    ) latest_insights ON latest_insights.taxpayer_id = taxpayers.id
> WHERE taxpayers.id IN (650974, 243848) AND latest_insights.taxpayer_id IN
> (650974, 243848);
>                                                         QUERY PLAN
>
>
> -----------------------------------------------------------------------------------------------------------------------------
> Nested Loop  (cost=0.86..110.57 rows=1 width=16)
>   Join Filter: (taxpayers.id = ordered_insights.taxpayer_id)
>   ->  Subquery Scan on ordered_insights  (cost=0.43..93.69 rows=1 width=16)
>         Filter: (ordered_insights.rank = 1)
>         ->  WindowAgg  (cost=0.43..93.42 rows=21 width=28)
>               Run Condition: (rank() OVER (?) <= 1)
>               ->  Index Scan using index_insights_on_taxpayer_id_year_desc
> on insights  (cost=0.43..93.06 rows=21 width=20)
>                     Index Cond: ((taxpayer_id = ANY
> ('{650974,243848}'::bigint[])) AND (year IS NOT NULL))
>   ->  Index Only Scan using taxpayers_pkey on taxpayers  (cost=0.43..16.85
> rows=2 width=8)
>         Index Cond: (id = ANY ('{650974,243848}'::bigint[]))
>
> This feels like a bug to me, but maybe I'm missing something obvious. I
> don't really get why the optimizer wouldn't be able to infer the second
> condition given that I'm doing a join on a non-nullable integer column (so
> there's no NaN nonsense to worry about), but maybe I'm missing something
> obvious.
>
> I've reproduced this on PostgreSQL 15.7 and 17beta3.
>
> Thanks for any insights y'all can provide!
>
> --
> James Brown
>

Reply via email to