Re: Is there a known bug with SKIP LOCKED and "tuple to be locked was already moved to another partition due to concurrent update"?

Jim Jarvie Tue, 11 Aug 2020 15:59:42 -0700

Hi Gunther & List,

I think I have an extremely similar issue and things point in the samedirection of a potential issue for skip locked on partitioned tables.

Background is I had a queue table on v9.6 with fairly high volume (>50Mitems, growth in the 1+M/daily).

Processing the queue with FOR UPDATE SKIP LOCKED was reliable buttraffic volumes on v9.6 and the fact v12 is current let to migrating tov12 and using a partitioned table.

Queue has distinct categories of items, so the table is partitioned bylist on each category. Processing in 1 category results in it beingupdated to the next logical category (i.e. it will migrate partitiononce it is processed).

Within each category, there can be 10'sM rows, so the list partition ishash partitioned as well. I don't think this is the issue but ismentioned for completeness.

Now, when processing the queue, there are regular transaction abortswith "tuple to be locked was already moved to another partition due toconcurrent update".

From everything I can trace, it really does look like this is caused byrows which should be locked/skipped as they are processed by a differentthread.

I tried switching 'for update' to 'for key share' and that created acascade of deadlock aborts, so was worse for my situation.

For now, I roll back and repeat the select for update skip locked untilit succeeds - which it eventually does.

However, it really feels like these should just have been skipped byPostgreSQL without the rollback/retry until success.

So, am I missing something/doing it wrong? Or could there be apotential issue that needs raised?


Thanks

Jim



On 30-Jun.-2020 12:10, Gunther Schadow wrote:

Hi all,
long time ago I devised with your help a task queuing system whichuses SELECT ... FOR UPDATE SKIP LOCKED for many parallel workers tofind tasks in the queue, and it used a partitioned table where the hotpart of the queue is short and so the query for a job is quick and theskip locked locking makes sure that one job is only assigned to oneworker. And this works pretty well for me, except that when we runmany workers we find a lot of these failures occurring:
"tuple to be locked was already moved to another partition due toconcurrent update"
This would not exactly look like a bug, because the message says "tobe locked", so at least it's not allowing two workers to lock the sametuple. But it seems that the skip-locked mode should not make an errorout of this, but treat it as the tuple was already locked. Why wouldit want to lock the tuple (representing the job) if another worker hasalready finished his UPDATE of the job to mark it as "done" (which iswhat makes the tuple move to the "completed" partition.)
Either the SELECT for jobs to do returned a wrong tuple, which wasalready update, or there is some lapse in the locking.
Either way it would seem to be a waste of time throwing all theseerrors when the tuple should not even have been selected for updateand locking.
I wonder if anybody knows anything about that issue? Of course you'llwant to see the DDL and SQL queries, etc. but you can't really try itout unless you do some massively parallel magic. So I figured I just ask.
regards,
-Gunther

Re: Is there a known bug with SKIP LOCKED and "tuple to be locked was already moved to another partition due to concurrent update"?

Reply via email to