On Thu, 2020-12-17 at 12:21 -0500, Craig McIlwee wrote:
> Our application uses a queue-like table to assign tasks to users and this has
> worked well for us for a few years. Now we are in the process of adding some
> restrictions to which tasks a user can
> work on and that is based on an attribute of each task that does not change
> for the task's lifespan. Users may have access to work on one or more or
> types of tasks. To improve query time when
> finding the set of tasks that we assign, we are introducing partitioning into
> our task queue table. When assigning tasks, we issue an update statement to
> mark the tasks as reserved using a subquery
> that orders the tasks by age. With the introduction of partitioning, we are
> seeing that the update statement affects more rows than expected. An example
> query is:
>
> ---
> update task_parent
> set reserved = true
> from (
> select id
> from task_parent
> where reserved = false
> and task_type = 1 or task_type = 2
> order by task_timestamp
> limit 50
> for update skip locked) as sub
> where sub.id = task_parent.id
> returning task_parent.id
> ---
>
> In the statement above, we have a subquery to limit the number of tasks to 50
> yet the update statement sometimes returns more than 50 records. I have
> narrowed this down to a small, reproducible
> example shown below. The first time I run the update statement I get ~65
> records, then typically ~53 the next few runs, and then it starts
> consistently giving me 50 records after that. Then if I
> bump the limit to 100 I will get more than 100 initially and after several
> executions it starts to settle into always giving the expected 100.
>
> Below is the full setup that can be used to reproduce what I'm seeing. It
> was initially observed on PostgreSQL 11.8 but I can also reproduce it on 13.0.
>
> ---
> create table task_parent (
> id bigint not null,
> task_type smallint not null,
> reserved boolean not null,
> task_timestamp timestamp not null
> ) partition by list (task_type);
>
> create table task_child_1
> partition of task_parent for values in (1);
>
> create table task_child_2
> partition of task_parent for values in (2);
>
> insert into task_parent
> select
> generate_series(1, 500000),
> case when random() < 0.5 then 1 else 2 end,
> false,
> now() - (random() * '1 day'::interval);
>
> create index task_parent_task_time_idx
> on task_parent (task_timestamp);
>
> update task_parent
> set reserved = true
> from (
> select id
> from task_parent
> where reserved = false
> and task_type = 1 or task_type = 2
> order by task_timestamp
> limit 50
> for update skip locked) as sub
> where sub.id = task_parent.id
> returning task_parent.id;
> ---
>
> A couple of interesting observations:
> 1) If I remove the order by clause I always get the expected number of results
> 2) If I rewrite the query to use a CTE for the task IDs instead of a subquery
> then I always get the expected number of results
>
> At its surface, this seems like it could be a bug but maybe there is
> something about this usage pattern that is known/expected to cause this
> behavior. So that's the question - is this a bug that
> should be reported to pgsql-bugs, or is this expected and if so, why?
Yes, this must be a bug:
EXPLAIN (COSTS OFF) update task_parent
set reserved = true
from (
select id
from task_parent
where reserved = false
and task_type = 1 or task_type = 2
order by task_timestamp
limit 50
for update skip locked) as sub
where sub.id = task_parent.id
returning task_parent.id;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Update on task_parent
Update on task_child_1 task_parent_1
Update on task_child_2 task_parent_2
-> Hash Join
Hash Cond: (task_parent_1.id = sub.id)
-> Seq Scan on task_child_1 task_parent_1
-> Hash
-> Subquery Scan on sub
-> Limit
-> LockRows
-> Merge Append
Sort Key: task_parent_3.task_timestamp
-> Index Scan using
task_child_1_task_timestamp_idx on task_child_1 task_parent_4
Filter: (((NOT reserved) AND
(task_type = 1)) OR (task_type = 2))
-> Index Scan using
task_child_2_task_timestamp_idx on task_child_2 task_parent_5
Filter: (((NOT reserved) AND
(task_type = 1)) OR (task_type = 2))
-> Hash Join
Hash Cond: (task_parent_2.id = sub_1.id)
-> Seq Scan on task_child_2 task_parent_2
-> Hash
-> Subquery Scan on sub_1
-> Limit
-> LockRows
-> Merge Append
Sort Key: task_parent_6.task_timestamp
-> Index Scan using
task_child_1_task_timestamp_idx on task_child_1 task_parent_7
Filter: (((NOT reserved) AND
(task_type = 1)) OR (task_type = 2))
-> Index Scan using
task_child_2_task_timestamp_idx on task_child_2 task_parent_8
Filter: (((NOT reserved) AND
(task_type = 1)) OR (task_type = 2))
(29 rows)
The subquery is executed twice, and the two executions obviously don't
return the same results. I am at a loss for an explanation ...
Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com