postgres chooses objectively wrong index

Merlin Moncure Tue, 17 Mar 2026 14:01:48 -0700

I've been maintaining an airflow style orchestrator in pl/pgsql, and it's
revealed a performance issue I just can't solve.  There is a table, task,
which may normally contain billions of rows, but only a tiny portion is
interesting for specific reasons—a common pattern in task-type systems.


CREATE TABLE async.task
(
  task_id BIGSERIAL PRIMARY KEY,
  target TEXT REFERENCES async.target ON UPDATE CASCADE ON DELETE CASCADE,
  priority INT DEFAULT 0,
  entered TIMESTAMPTZ DEFAULT clock_timestamp(),
  consumed TIMESTAMPTZ,
  processed TIMESTAMPTZ,
  yielded TIMESTAMPTZ,
  times_up TIMESTAMPTZ,
  concurrency_pool TEXT
);

CREATE OR REPLACE FUNCTION async.task_execution_state(t async.task)
  RETURNS async.task_execution_state_t AS
$$
  SELECT
    CASE
      WHEN t.processed IS NOT NULL THEN 'FINISHED'
      WHEN t.consumed IS NULL AND t.yielded IS NULL THEN 'READY'
      WHEN t.yielded IS NOT NULL THEN 'YIELDED'
      WHEN t.consumed IS NOT NULL AND t.yielded IS NULL THEN 'RUNNING'
    END::async.task_execution_state_t;
$$ LANGUAGE SQL IMMUTABLE;

"processed NOT NULL" defines the 'needle', let's say typically <0.01%.   Of
those cases, a few patterns need defense from a performance standpoint.
Naturally, partial indexes are used because we don't want to index the
entire table.

/* supports fetching eligible tasks */
CREATE INDEX ON async.task(concurrency_pool, priority, entered)
WHERE async.task_execution_state(task) = 'READY';

/* look up expired tasks.  Times up qual is to prevent index being used for
 * any other purpose.
 */
CREATE INDEX ON async.task(times_up)
  WHERE
    async.task_execution_state(task) IN('READY', 'RUNNING', 'YIELDED')
    AND times_up IS NOT NULL;

/* supports cleaning up dead tasks on startup and other needs for
 * processing unfinished tasks.
 */
CREATE INDEX ON async.task(task_id)
  WHERE async.task_execution_state(task) IN('READY', 'RUNNING', 'YIELDED');

These indexes support queries called in a tight loop, for example:

SELECT *
FRROM async.task
WHERE
  async.task_execution_state(task.*) =
'READY'::async.task_execution_state_t
  AND concurrency_pool = 'xyz'
ORDER BY priority, entered
LIMIT 10;

Usually, we get a plan that looks like this:

 Limit  (cost=0.38..39.74 rows=10 width=563) (actual time=0.054..0.054
rows=0 loops=1)
   ->  Index Scan using task_concurrency_pool_priority_entered_idx on task
 (cost=0.38..705.08 rows=179 width=563) (actual time=0.053..0.053 rows=0
loops=1)
         Index Cond: (concurrency_pool = 'xyz'::text)
 Planning Time: 0.234 ms
 Execution Time: 0.072 ms

Let's note that the partial index predicate exactly matches the where
clause, and that the index from left to right matches in terms of equality
and ordering.  No sorting is required, and the results are excellent.  The
final costing here is IMNSHO very high: 39.74, and I believe that is the
fundamental issue.

Sometimes, based on a certain data distribution, we get results like this:

Limit  (cost=25.75..25.78 rows=10 width=563) (actual time=8.909..8.911
rows=0 loops=1)
  ->  Sort  (cost=25.75..26.20 rows=179 width=563) (actual
time=8.908..8.909 rows=0 loops=1)
        Sort Key: priority, entered
        Sort Method: quicksort  Memory: 25kB
        ->  Bitmap Heap Scan on task  (cost=9.10..21.89 rows=179 width=563)
(actual time=8.902..8.903 rows=0 loops=1)
              Recheck Cond: ((async.task_execution_state(task.*) = ANY
('{READY,RUNNING,YIELDED}'::async.task_execution_state_t[])) AND
(concurrency_pool = 'xyz'::text) AND (async.task_execution_state(task.*) =
'READY'::async.task_execution_state_t))
              ->  BitmapAnd  (cost=9.10..9.10 rows=3 width=0) (actual
time=8.883..8.883 rows=0 loops=1)
                    ->  Bitmap Index Scan on task_task_id_idx
 (cost=0.00..4.38 rows=575191 width=0) (actual time=8.828..8.828 rows=16
loops=1)
                    ->  Bitmap Index Scan on
task_concurrency_pool_priority_entered_idx  (cost=0.00..4.38 rows=179
width=0) (actual time=0.053..0.053 rows=0 loops=1)
                          Index Cond: (concurrency_pool = 'xyz'::text)
Planning Time: 0.262 ms
Execution Time: 8.946 ms

In this case, we get an explicit sort and other unnecessary work for a 100x
degradation in runtime.  My basic issue is that I do not believe any data
distribution that allows plan #2 to beat plan #1, given the more specific
predicate and index order matching result order.  I suspect this is a very
long standing issue concerning insufficient weight given to partial
indexes, predicate matching, and possibly index-supported sorting.  I've
dealt with some variant of this problem for many years.

Sometimes, there can be even worse plans, running into 10-20 seconds, for a
~ 10 order of magnitude miss.  I can manage this at the query level by:
* turning off various planner directives, heap scan, etc
* adding faux columns to the table to support forcing index selection in
particular cases (CREATE INDEX ON foo WHERE this_case IS NULL....SELECT *
FROM foo WHERE this_case IS NULL...)

I'm wondering if there are other tricks that might apply here, for example,
multi column index statistics...curious if anyone has thoughts on that.

Any suggestions?

merlin

postgres chooses objectively wrong index

Reply via email to