Re: Unexpected result count from update statement on partitioned table

Ron Fri, 18 Dec 2020 05:24:19 -0800

Would (task_type in (1,2)) make any logical difference?


On 12/18/20 6:11 AM, Craig McIlwee wrote:

Despite looking at this query on and off for a couple of days, it wasn'tuntil seeing it in Lauenz's reply that I noticed a logical issue with thequery that changes things a bit. There should be parenthesis around thetask_type predicates, otherwise you end up getting reserved rows in theresult set. After looking at it more, I see that when the original querysettles in to consistently returning the expected number of results theyhappen to be the exact same results each time. Correcting the logicalerror in the query solves that and now consistently gives too manyresults, regardless of the number of executions.


Corrected query (but still not working properly):

update task_parent
set reserved = true
from (
  select id
  from task_parent
  where reserved = false
    and (task_type = 1 or task_type = 2)
  order by task_timestamp
  limit 50
  for update skip locked) as sub
where sub.id <http://sub.id> = task_parent.id <http://task_parent.id>
returning *;

    Yes, this must be a bug:

    EXPLAIN (COSTS OFF) update task_parent
    set reserved = true
    from (
      select id
      from task_parent
      where reserved = false
        and task_type = 1 or task_type = 2
      order by task_timestamp
      limit 50
      for update skip locked) as sub
    where sub.id <http://sub.id/> = task_parent.id <http://task_parent.id/>
    returning task_parent.id <http://task_parent.id/>;

                                                            QUERY PLAN
    
--------------------------------------------------------------------------------------------------------------------------
     Update on task_parent
       Update on task_child_1 task_parent_1
       Update on task_child_2 task_parent_2
       ->  Hash Join
             Hash Cond: (task_parent_1.id <http://task_parent_1.id/> =
    sub.id <http://sub.id/>)
             ->  Seq Scan on task_child_1 task_parent_1
             ->  Hash
                   ->  Subquery Scan on sub
                         ->  Limit
                               ->  LockRows
                                     ->  Merge Append
                                           Sort Key:
    task_parent_3.task_timestamp
                                           ->  Index Scan using
    task_child_1_task_timestamp_idx on task_child_1 task_parent_4
                                                 Filter: (((NOT reserved)
    AND (task_type = 1)) OR (task_type = 2))
                                           ->  Index Scan using
    task_child_2_task_timestamp_idx on task_child_2 task_parent_5
                                                 Filter: (((NOT reserved)
    AND (task_type = 1)) OR (task_type = 2))
       ->  Hash Join
             Hash Cond: (task_parent_2.id <http://task_parent_2.id/> =
    sub_1.id <http://sub_1.id/>)
             ->  Seq Scan on task_child_2 task_parent_2
             ->  Hash
                   ->  Subquery Scan on sub_1
                         ->  Limit
                               ->  LockRows
                                     ->  Merge Append
                                           Sort Key:
    task_parent_6.task_timestamp
                                           ->  Index Scan using
    task_child_1_task_timestamp_idx on task_child_1 task_parent_7
                                                 Filter: (((NOT reserved)
    AND (task_type = 1)) OR (task_type = 2))
                                           ->  Index Scan using
    task_child_2_task_timestamp_idx on task_child_2 task_parent_8
                                                 Filter: (((NOT reserved)
    AND (task_type = 1)) OR (task_type = 2))
    (29 rows)

    The subquery is executed twice, and the two executions obviously don't

return the same results. I am at a loss for an explanation ...

I did notice that the subquery is executed twice and this is thedifference between the subquery and CTE since the CTE is only executedonce and then the CTE result is used in the joins against the childtables. Beyond that, I was unable to rationalize what was going on. Onething I don't know is if the subquery should be executed more than once. If yes, then like you said, I would think that they should give the sameresults.

Another interesting thing I noticed is that multiple executions give taskswith overlapping time ranges. Given 2 executions, the task_time orderingshould give the oldest 50 first and then the next oldest 50 after that. In reality, I see that the second execution has tasks that are older thanthe newest timestamp of the first execution. Wrapping the updatestatement in a CTE and then picking out the min/max shows this:


with updated as (
  update task_parent
  set reserved = true
  from (
    select id, task_type
    from task_parent
    where reserved = false
      and (task_type = 1 or task_type = 2)
    order by task_timestamp
    limit 50
    for update skip locked) as sub
  where sub.id <http://sub.id> = task_parent.id <http://task_parent.id>
    and sub.task_type = task_parent.task_type
  returning *
)
select min(task_timestamp), max(task_timestamp)
from updated;

Execution 1:
            min             |            max
----------------------------+----------------------------
 2020-12-17 11:44:51.192119 | 2020-12-17 11:45:03.881409

Execution 2:
            min             |            max
----------------------------+----------------------------
 2020-12-17 11:44:59.943108 | 2020-12-17 11:45:14.273185

min of execution 2 is older than max of execution 1 which should not happen.

Craig

On Fri, Dec 18, 2020 at 4:06 AM Laurenz Albe <laurenz.a...@cybertec.at<mailto:laurenz.a...@cybertec.at>> wrote:


    On Thu, 2020-12-17 at 12:21 -0500, Craig McIlwee wrote:
    > Our application uses a queue-like table to assign tasks to users and
    this has worked well for us for a few years.  Now we are in the
    process of adding some restrictions to which tasks a user can
    > work on and that is based on an attribute of each task that does not
    change for the task's lifespan.  Users may have access to work on one
    or more or types of tasks.  To improve query time when
    > finding the set of tasks that we assign, we are introducing
    partitioning into our task queue table.  When assigning tasks, we
    issue an update statement to mark the tasks as reserved using a subquery
    > that orders the tasks by age.  With the introduction of
    partitioning, we are seeing that the update statement affects more
    rows than expected.  An example query is:
    >
    > ---
    > update task_parent
    > set reserved = true
    > from (
    >   select id
    >   from task_parent
    >   where reserved = false
    >     and task_type = 1 or task_type = 2
    >   order by task_timestamp
    >   limit 50
    >   for update skip locked) as sub
    > where sub.id <http://sub.id> = task_parent.id <http://task_parent.id>
    > returning task_parent.id <http://task_parent.id>
    > ---
    >
    > In the statement above, we have a subquery to limit the number of
    tasks to 50 yet the update statement sometimes returns more than 50
    records.  I have narrowed this down to a small, reproducible
    > example shown below.  The first time I run the update statement I
    get ~65 records, then typically ~53 the next few runs, and then it
    starts consistently giving me 50 records after that.  Then if I
    > bump the limit to 100 I will get more than 100 initially and after
    several executions it starts to settle into always giving the expected
    100.
    >
    > Below is the full setup that can be used to reproduce what I'm
    seeing.  It was initially observed on PostgreSQL 11.8 but I can also
    reproduce it on 13.0.
    >
    > ---
    > create table task_parent (
    >   id bigint not null,
    >   task_type smallint not null,
    >   reserved boolean not null,
    >   task_timestamp timestamp not null
    > ) partition by list (task_type);
    >
    > create table task_child_1
    > partition of task_parent for values in (1);
    >
    > create table task_child_2
    > partition of task_parent for values in (2);
    >
    > insert into task_parent
    > select
    >   generate_series(1, 500000),
    >   case when random() < 0.5 then 1 else 2 end,
    >   false,
    >   now() - (random() * '1 day'::interval);
    >
    > create index task_parent_task_time_idx
    > on task_parent (task_timestamp);
    >
    > update task_parent
    > set reserved = true
    > from (
    >   select id
    >   from task_parent
    >   where reserved = false
    >     and task_type = 1 or task_type = 2
    >   order by task_timestamp
    >   limit 50
    >   for update skip locked) as sub
    > where sub.id <http://sub.id> = task_parent.id <http://task_parent.id>
    > returning task_parent.id <http://task_parent.id>;
    > ---
    >
    > A couple of interesting observations:
    > 1) If I remove the order by clause I always get the expected number
    of results
    > 2) If I rewrite the query to use a CTE for the task IDs instead of a
    subquery then I always get the expected number of results
    >
    > At its surface, this seems like it could be a bug but maybe there is
    something about this usage pattern that is known/expected to cause
    this behavior.  So that's the question - is this a bug that
    > should be reported to pgsql-bugs, or is this expected and if so, why?

    Yes, this must be a bug:

    EXPLAIN (COSTS OFF) update task_parent
    set reserved = true
    from (
      select id
      from task_parent
      where reserved = false
        and task_type = 1 or task_type = 2
      order by task_timestamp
      limit 50
      for update skip locked) as sub
    where sub.id <http://sub.id> = task_parent.id <http://task_parent.id>
    returning task_parent.id <http://task_parent.id>;

                                                            QUERY PLAN
    
--------------------------------------------------------------------------------------------------------------------------
     Update on task_parent
       Update on task_child_1 task_parent_1
       Update on task_child_2 task_parent_2
       ->  Hash Join
             Hash Cond: (task_parent_1.id <http://task_parent_1.id> =
    sub.id <http://sub.id>)
             ->  Seq Scan on task_child_1 task_parent_1
             ->  Hash
                   ->  Subquery Scan on sub
                         ->  Limit
                               ->  LockRows
                                     ->  Merge Append
                                           Sort Key:
    task_parent_3.task_timestamp
                                           ->  Index Scan using
    task_child_1_task_timestamp_idx on task_child_1 task_parent_4
                                                 Filter: (((NOT reserved)
    AND (task_type = 1)) OR (task_type = 2))
                                           ->  Index Scan using
    task_child_2_task_timestamp_idx on task_child_2 task_parent_5
                                                 Filter: (((NOT reserved)
    AND (task_type = 1)) OR (task_type = 2))
       ->  Hash Join
             Hash Cond: (task_parent_2.id <http://task_parent_2.id> =
    sub_1.id <http://sub_1.id>)
             ->  Seq Scan on task_child_2 task_parent_2
             ->  Hash
                   ->  Subquery Scan on sub_1
                         ->  Limit
                               ->  LockRows
                                     ->  Merge Append
                                           Sort Key:
    task_parent_6.task_timestamp
                                           ->  Index Scan using
    task_child_1_task_timestamp_idx on task_child_1 task_parent_7
                                                 Filter: (((NOT reserved)
    AND (task_type = 1)) OR (task_type = 2))
                                           ->  Index Scan using
    task_child_2_task_timestamp_idx on task_child_2 task_parent_8
                                                 Filter: (((NOT reserved)
    AND (task_type = 1)) OR (task_type = 2))
    (29 rows)

    The subquery is executed twice, and the two executions obviously don't
    return the same results.  I am at a loss for an explanation ...

    Yours,
    Laurenz Albe

--Cybertec | https://www.cybertec-postgresql.com


--
Angular momentum makes the world go 'round.

Re: Unexpected result count from update statement on partitioned table

Reply via email to