On Thu, 2020-12-17 at 12:21 -0500, Craig McIlwee wrote: > Our application uses a queue-like table to assign tasks to users and this has > worked well for us for a few years. Now we are in the process of adding some > restrictions to which tasks a user can > work on and that is based on an attribute of each task that does not change > for the task's lifespan. Users may have access to work on one or more or > types of tasks. To improve query time when > finding the set of tasks that we assign, we are introducing partitioning into > our task queue table. When assigning tasks, we issue an update statement to > mark the tasks as reserved using a subquery > that orders the tasks by age. With the introduction of partitioning, we are > seeing that the update statement affects more rows than expected. An example > query is: > > --- > update task_parent > set reserved = true > from ( > select id > from task_parent > where reserved = false > and task_type = 1 or task_type = 2 > order by task_timestamp > limit 50 > for update skip locked) as sub > where sub.id = task_parent.id > returning task_parent.id > --- > > In the statement above, we have a subquery to limit the number of tasks to 50 > yet the update statement sometimes returns more than 50 records. I have > narrowed this down to a small, reproducible > example shown below. The first time I run the update statement I get ~65 > records, then typically ~53 the next few runs, and then it starts > consistently giving me 50 records after that. Then if I > bump the limit to 100 I will get more than 100 initially and after several > executions it starts to settle into always giving the expected 100. > > Below is the full setup that can be used to reproduce what I'm seeing. It > was initially observed on PostgreSQL 11.8 but I can also reproduce it on 13.0. > > --- > create table task_parent ( > id bigint not null, > task_type smallint not null, > reserved boolean not null, > task_timestamp timestamp not null > ) partition by list (task_type); > > create table task_child_1 > partition of task_parent for values in (1); > > create table task_child_2 > partition of task_parent for values in (2); > > insert into task_parent > select > generate_series(1, 500000), > case when random() < 0.5 then 1 else 2 end, > false, > now() - (random() * '1 day'::interval); > > create index task_parent_task_time_idx > on task_parent (task_timestamp); > > update task_parent > set reserved = true > from ( > select id > from task_parent > where reserved = false > and task_type = 1 or task_type = 2 > order by task_timestamp > limit 50 > for update skip locked) as sub > where sub.id = task_parent.id > returning task_parent.id; > --- > > A couple of interesting observations: > 1) If I remove the order by clause I always get the expected number of results > 2) If I rewrite the query to use a CTE for the task IDs instead of a subquery > then I always get the expected number of results > > At its surface, this seems like it could be a bug but maybe there is > something about this usage pattern that is known/expected to cause this > behavior. So that's the question - is this a bug that > should be reported to pgsql-bugs, or is this expected and if so, why?
Yes, this must be a bug: EXPLAIN (COSTS OFF) update task_parent set reserved = true from ( select id from task_parent where reserved = false and task_type = 1 or task_type = 2 order by task_timestamp limit 50 for update skip locked) as sub where sub.id = task_parent.id returning task_parent.id; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------- Update on task_parent Update on task_child_1 task_parent_1 Update on task_child_2 task_parent_2 -> Hash Join Hash Cond: (task_parent_1.id = sub.id) -> Seq Scan on task_child_1 task_parent_1 -> Hash -> Subquery Scan on sub -> Limit -> LockRows -> Merge Append Sort Key: task_parent_3.task_timestamp -> Index Scan using task_child_1_task_timestamp_idx on task_child_1 task_parent_4 Filter: (((NOT reserved) AND (task_type = 1)) OR (task_type = 2)) -> Index Scan using task_child_2_task_timestamp_idx on task_child_2 task_parent_5 Filter: (((NOT reserved) AND (task_type = 1)) OR (task_type = 2)) -> Hash Join Hash Cond: (task_parent_2.id = sub_1.id) -> Seq Scan on task_child_2 task_parent_2 -> Hash -> Subquery Scan on sub_1 -> Limit -> LockRows -> Merge Append Sort Key: task_parent_6.task_timestamp -> Index Scan using task_child_1_task_timestamp_idx on task_child_1 task_parent_7 Filter: (((NOT reserved) AND (task_type = 1)) OR (task_type = 2)) -> Index Scan using task_child_2_task_timestamp_idx on task_child_2 task_parent_8 Filter: (((NOT reserved) AND (task_type = 1)) OR (task_type = 2)) (29 rows) The subquery is executed twice, and the two executions obviously don't return the same results. I am at a loss for an explanation ... Yours, Laurenz Albe -- Cybertec | https://www.cybertec-postgresql.com