On Thu, 2020-12-17 at 12:21 -0500, Craig McIlwee wrote:
> Our application uses a queue-like table to assign tasks to users and this has 
> worked well for us for a few years.  Now we are in the process of adding some 
> restrictions to which tasks a user can
> work on and that is based on an attribute of each task that does not change 
> for the task's lifespan.  Users may have access to work on one or more or 
> types of tasks.  To improve query time when
> finding the set of tasks that we assign, we are introducing partitioning into 
> our task queue table.  When assigning tasks, we issue an update statement to 
> mark the tasks as reserved using a subquery
> that orders the tasks by age.  With the introduction of partitioning, we are 
> seeing that the update statement affects more rows than expected.  An example 
> query is:
> 
> ---
> update task_parent
> set reserved = true
> from (
>   select id
>   from task_parent
>   where reserved = false
>     and task_type = 1 or task_type = 2
>   order by task_timestamp
>   limit 50
>   for update skip locked) as sub
> where sub.id = task_parent.id
> returning task_parent.id
> ---  
> 
> In the statement above, we have a subquery to limit the number of tasks to 50 
> yet the update statement sometimes returns more than 50 records.  I have 
> narrowed this down to a small, reproducible
> example shown below.  The first time I run the update statement I get ~65 
> records, then typically ~53 the next few runs, and then it starts 
> consistently giving me 50 records after that.  Then if I
> bump the limit to 100 I will get more than 100 initially and after several 
> executions it starts to settle into always giving the expected 100.
> 
> Below is the full setup that can be used to reproduce what I'm seeing.  It 
> was initially observed on PostgreSQL 11.8 but I can also reproduce it on 13.0.
> 
> ---
> create table task_parent (
>   id bigint not null,
>   task_type smallint not null,
>   reserved boolean not null,
>   task_timestamp timestamp not null
> ) partition by list (task_type);
> 
> create table task_child_1
> partition of task_parent for values in (1);
> 
> create table task_child_2
> partition of task_parent for values in (2);
> 
> insert into task_parent
> select
>   generate_series(1, 500000),
>   case when random() < 0.5 then 1 else 2 end,
>   false,
>   now() - (random() * '1 day'::interval);
>   
> create index task_parent_task_time_idx
> on task_parent (task_timestamp);
> 
> update task_parent
> set reserved = true
> from (
>   select id
>   from task_parent
>   where reserved = false
>     and task_type = 1 or task_type = 2
>   order by task_timestamp
>   limit 50
>   for update skip locked) as sub
> where sub.id = task_parent.id
> returning task_parent.id;
> ---
> 
> A couple of interesting observations:
> 1) If I remove the order by clause I always get the expected number of results
> 2) If I rewrite the query to use a CTE for the task IDs instead of a subquery 
> then I always get the expected number of results
> 
> At its surface, this seems like it could be a bug but maybe there is 
> something about this usage pattern that is known/expected to cause this 
> behavior.  So that's the question - is this a bug that
> should be reported to pgsql-bugs, or is this expected and if so, why?

Yes, this must be a bug:

EXPLAIN (COSTS OFF) update task_parent
set reserved = true
from (
  select id
  from task_parent
  where reserved = false
    and task_type = 1 or task_type = 2
  order by task_timestamp
  limit 50
  for update skip locked) as sub
where sub.id = task_parent.id
returning task_parent.id;

                                                        QUERY PLAN              
                                          
--------------------------------------------------------------------------------------------------------------------------
 Update on task_parent
   Update on task_child_1 task_parent_1
   Update on task_child_2 task_parent_2
   ->  Hash Join
         Hash Cond: (task_parent_1.id = sub.id)
         ->  Seq Scan on task_child_1 task_parent_1
         ->  Hash
               ->  Subquery Scan on sub
                     ->  Limit
                           ->  LockRows
                                 ->  Merge Append
                                       Sort Key: task_parent_3.task_timestamp
                                       ->  Index Scan using 
task_child_1_task_timestamp_idx on task_child_1 task_parent_4
                                             Filter: (((NOT reserved) AND 
(task_type = 1)) OR (task_type = 2))
                                       ->  Index Scan using 
task_child_2_task_timestamp_idx on task_child_2 task_parent_5
                                             Filter: (((NOT reserved) AND 
(task_type = 1)) OR (task_type = 2))
   ->  Hash Join
         Hash Cond: (task_parent_2.id = sub_1.id)
         ->  Seq Scan on task_child_2 task_parent_2
         ->  Hash
               ->  Subquery Scan on sub_1
                     ->  Limit
                           ->  LockRows
                                 ->  Merge Append
                                       Sort Key: task_parent_6.task_timestamp
                                       ->  Index Scan using 
task_child_1_task_timestamp_idx on task_child_1 task_parent_7
                                             Filter: (((NOT reserved) AND 
(task_type = 1)) OR (task_type = 2))
                                       ->  Index Scan using 
task_child_2_task_timestamp_idx on task_child_2 task_parent_8
                                             Filter: (((NOT reserved) AND 
(task_type = 1)) OR (task_type = 2))
(29 rows)

The subquery is executed twice, and the two executions obviously don't
return the same results.  I am at a loss for an explanation ...

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com



Reply via email to