Re: long-standing data loss bug in initial sync of logical replication

2025-04-25 Thread Amit Kapila
On Fri, Apr 25, 2025 at 10:45 AM Shlok Kyal wrote: > > In the commits, I saw that the filenames are misspelled for files > invalidation_distrubution.out and invalidation_distrubution.spec. > This is present in branches from REL_13 to HEAD. I have attached > patches to fix the same. > Thanks for s

Re: long-standing data loss bug in initial sync of logical replication

2025-04-24 Thread Shlok Kyal
On Thu, 24 Apr 2025 at 14:39, Amit Kapila wrote: > > On Wed, Apr 23, 2025 at 10:28 PM Masahiko Sawada > wrote: > > > > > > > > Fair enough. OTOH, we can leave the 13 branch considering following: > > > (a) it is near EOL, (b) bug happens in rare cases (when the DDLs like > > > ALTER PUBLICATION

Re: long-standing data loss bug in initial sync of logical replication

2025-04-24 Thread Amit Kapila
On Wed, Apr 23, 2025 at 10:28 PM Masahiko Sawada wrote: > > > > > Fair enough. OTOH, we can leave the 13 branch considering following: > > (a) it is near EOL, (b) bug happens in rare cases (when the DDLs like > > ALTER PUBLICATION ... ADD TABLE ... or ALTER TYPE ... that don't take > > a strong l

Re: long-standing data loss bug in initial sync of logical replication

2025-04-23 Thread Masahiko Sawada
On Tue, Apr 22, 2025 at 11:31 PM Amit Kapila wrote: > > On Tue, Apr 22, 2025 at 10:57 PM Masahiko Sawada > wrote: > > > > On Tue, Mar 18, 2025 at 2:55 AM Amit Kapila wrote: > > > > > > On Mon, Mar 17, 2025 at 4:56 PM Hayato Kuroda (Fujitsu) > > > wrote: > > > > > > > > > Regarding the PG13, it

Re: long-standing data loss bug in initial sync of logical replication

2025-04-22 Thread Amit Kapila
On Tue, Apr 22, 2025 at 10:57 PM Masahiko Sawada wrote: > > On Tue, Mar 18, 2025 at 2:55 AM Amit Kapila wrote: > > > > On Mon, Mar 17, 2025 at 4:56 PM Hayato Kuroda (Fujitsu) > > wrote: > > > > > > > Regarding the PG13, it cannot be > > > > applied > > > > as-is thus some adjustments are needed.

Re: long-standing data loss bug in initial sync of logical replication

2025-04-22 Thread Masahiko Sawada
On Tue, Mar 18, 2025 at 2:55 AM Amit Kapila wrote: > > On Mon, Mar 17, 2025 at 4:56 PM Hayato Kuroda (Fujitsu) > wrote: > > > > > Regarding the PG13, it cannot be > > > applied > > > as-is thus some adjustments are needed. I will share it in upcoming posts. > > > > Here is a patch set for PG13. A

Re: long-standing data loss bug in initial sync of logical replication

2025-04-10 Thread Tomas Vondra
On 4/10/25 11:45, Amit Kapila wrote: > On Tue, Apr 8, 2025 at 3:40 PM Amit Kapila wrote: >> >> On Tue, Mar 18, 2025 at 3:25 PM Amit Kapila wrote: >>> >>> Sawada-San, and others involved here, do you have any suggestions on >>> this matter? >>> >> >> Seeing no responses for a long time, I am plann

Re: long-standing data loss bug in initial sync of logical replication

2025-04-10 Thread Amit Kapila
On Tue, Apr 8, 2025 at 3:40 PM Amit Kapila wrote: > > On Tue, Mar 18, 2025 at 3:25 PM Amit Kapila wrote: > > > > Sawada-San, and others involved here, do you have any suggestions on > > this matter? > > > > Seeing no responses for a long time, I am planning to push the fix > till 14 tomorrow unle

Re: long-standing data loss bug in initial sync of logical replication

2025-04-08 Thread Amit Kapila
On Tue, Mar 18, 2025 at 3:25 PM Amit Kapila wrote: > > On Mon, Mar 17, 2025 at 4:56 PM Hayato Kuroda (Fujitsu) > wrote: > > > > > Regarding the PG13, it cannot be > > > applied > > > as-is thus some adjustments are needed. I will share it in upcoming posts. > > > > Here is a patch set for PG13. A

Re: long-standing data loss bug in initial sync of logical replication

2025-04-05 Thread Amit Kapila
On Mon, Mar 17, 2025 at 4:56 PM Hayato Kuroda (Fujitsu) wrote: > > > Regarding the PG13, it cannot be > > applied > > as-is thus some adjustments are needed. I will share it in upcoming posts. > > Here is a patch set for PG13. Apart from PG14-17, the patch could be created > as-is, > because... >

RE: long-standing data loss bug in initial sync of logical replication

2025-03-17 Thread Hayato Kuroda (Fujitsu)
Dear hackers, Attached patch set contains proper commit message. It briefly describes the background and handlings. Regarding the PG13, the same commit message is used for 0001. 0002 is still rough. Renamed backpatches to .txt, to make cfbot happy. Thanks Hou working for it. Best regards, Haya

RE: long-standing data loss bug in initial sync of logical replication

2025-03-17 Thread Hayato Kuroda (Fujitsu)
Dear hackers, > Regarding the PG13, it cannot be > applied > as-is thus some adjustments are needed. I will share it in upcoming posts. Here is a patch set for PG13. Apart from PG14-17, the patch could be created as-is, because... 1. WAL record for invalidation messages (XLOG_XACT_INVALIDATIONS

RE: long-standing data loss bug in initial sync of logical replication

2025-03-17 Thread Hayato Kuroda (Fujitsu)
Dear Amit, > A few comments: > == > 1. > +SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr > lsn, TransactionId xid) > { > dlist_iter txn_i; > ReorderBufferTXN *txn; > + ReorderBufferTXN *curr_txn; > + > + curr_txn = ReorderBufferTXNByXid(builder->reorder, xid, false,

Re: long-standing data loss bug in initial sync of logical replication

2025-03-16 Thread Amit Kapila
On Mon, Mar 17, 2025 at 6:53 AM Hayato Kuroda (Fujitsu) wrote: > > OK, let me share patched for back branches. Mostly the same fix patched as > master > can be used for PG14-PG17, like attached. > A few comments: == 1. +SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecP

RE: long-standing data loss bug in initial sync of logical replication

2025-03-16 Thread Hayato Kuroda (Fujitsu)
Dear Amit, > IIUC, workloads C and D will have regression in back branches, and > HEAD will have regression only for workload D. We have avoided > workload C regression in HEAD via commits 7c99dc587a and 3abe9dc188. Right. > We can backpatch those commits if required, but I think it is better >

Re: long-standing data loss bug in initial sync of logical replication

2025-03-14 Thread Amit Kapila
On Thu, Mar 13, 2025 at 2:12 PM Hayato Kuroda (Fujitsu) wrote: > > > Workload C. DDL is happening on publication but on unrelated table > > We did not run the workload because we expected this could be same results as > D. > 588acf6 is needed to optimi

RE: long-standing data loss bug in initial sync of logical replication

2025-03-13 Thread Hayato Kuroda (Fujitsu)
Hi hackers, > Our team (mainly Shlok) did a performance testing with several workloads. Let > me > share them on -hackers. We did it for master/REL_17 branches, and in this post > master's one will be discussed. I posted benchmark results for master [1]. In this post contains a result for back br

RE: long-standing data loss bug in initial sync of logical replication

2025-03-13 Thread Hayato Kuroda (Fujitsu)
Hi Hackers, Our team (mainly Shlok) did a performance testing with several workloads. Let me share them on -hackers. We did it for master/REL_17 branches, and in this post master's one will be discussed. The observed trend is: We observed that the performance regression exists primarily during fr

RE: long-standing data loss bug in initial sync of logical replication

2025-03-12 Thread Hayato Kuroda (Fujitsu)
Dear hackers, I found that the patch needs to be rebased due to ac4494, PSA new version. It could be applied atop HEAD and tests worked well. Best regards, Hayato Kuroda FUJITSU LIMITED v18-0001-Distribute-invalidatons-if-change-in-catalog-tab.patch Description: v18-0001-Distribute-invalidato

Re: long-standing data loss bug in initial sync of logical replication

2025-03-03 Thread Benoit Lobréau
On 3/3/25 8:41 AM, Zhijie Hou (Fujitsu) wrote: A nitpick with the data for the Concurrent Transaction (2000) case. The results show that the HEAD's data appears worse than the patch data, which seems unusual. However, I confirmed that the details in the attachment are as expected, so, this seems

RE: long-standing data loss bug in initial sync of logical replication

2025-03-02 Thread Zhijie Hou (Fujitsu)
On Friday, February 28, 2025 4:28 PM Benoit Lobréau wrote: > > It took me a while but I ran the test on my laptop with 20 runs per test. I > asked > for a dedicated server and will re-run the tests if/when I have it. > > count of partitions | Head (sec) |Fix (sec) |Degradation (%) >

Re: long-standing data loss bug in initial sync of logical replication

2025-02-28 Thread Amit Kapila
On Fri, Feb 28, 2025 at 6:15 AM Masahiko Sawada wrote: > > On Thu, Feb 27, 2025 at 12:14 AM Zhijie Hou (Fujitsu) > wrote: > > > > On Monday, February 24, 2025 5:50 PM Amit Kapila > > wrote: > > > > > > On Wed, Dec 11, 2024 at 12:37 PM Masahiko Sawada > > > wrote: > > > > > > > > I confirmed th

Re: long-standing data loss bug in initial sync of logical replication

2025-02-28 Thread Amit Kapila
On Fri, Feb 28, 2025 at 9:45 AM Amit Kapila wrote: > > On Fri, Feb 28, 2025 at 6:15 AM Masahiko Sawada wrote: > > > > On Thu, Feb 27, 2025 at 12:14 AM Zhijie Hou (Fujitsu) > > > > > > I think distributing invalidations to a transaction that has not yet > > > built a > > > base snapshot is un-nec

Re: long-standing data loss bug in initial sync of logical replication

2025-02-28 Thread Amit Kapila
On Mon, Feb 24, 2025 at 4:49 PM Shlok Kyal wrote: > > Patches need a rebase. Attached the rebased patch. > I would like to discuss 0002 patch: publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue) { publications_valid = false; - - /* - * Also invalidate per-relation cache so t

Re: long-standing data loss bug in initial sync of logical replication

2025-02-28 Thread Benoit Lobréau
Hi, It took me a while but I ran the test on my laptop with 20 runs per test. I asked for a dedicated server and will re-run the tests if/when I have it. count of partitions | Head (sec) |Fix (sec) |Degradation (%)

Re: long-standing data loss bug in initial sync of logical replication

2025-02-27 Thread Masahiko Sawada
On Thu, Feb 27, 2025 at 12:14 AM Zhijie Hou (Fujitsu) wrote: > > On Monday, February 24, 2025 5:50 PM Amit Kapila > wrote: > > > > On Wed, Dec 11, 2024 at 12:37 PM Masahiko Sawada > > wrote: > > > > > > I confirmed that the proposed patch fixes these issues. I have one > > > question about the

RE: long-standing data loss bug in initial sync of logical replication

2025-02-27 Thread Zhijie Hou (Fujitsu)
On Monday, February 24, 2025 5:50 PM Amit Kapila wrote: > > On Wed, Dec 11, 2024 at 12:37 PM Masahiko Sawada > wrote: > > > > I confirmed that the proposed patch fixes these issues. I have one > > question about the patch: > > > > In the main loop in SnapBuildDistributeSnapshotAndInval(), we ha

Re: long-standing data loss bug in initial sync of logical replication

2025-02-26 Thread Amit Kapila
On Wed, Feb 26, 2025 at 9:21 AM Shlok Kyal wrote: > > I have done the performance testing for cases where we distribute a > small amount of invalidation messages to many concurrently decoded > transactions. > Here are results: > > Concurrent Txn |Head (sec)|Patch (sec) | Degradati

Re: long-standing data loss bug in initial sync of logical replication

2025-02-25 Thread Shlok Kyal
On Wed, 11 Dec 2024 at 12:37, Masahiko Sawada wrote: > > On Tue, Oct 8, 2024 at 2:51 AM Shlok Kyal wrote: > > > > On Wed, 31 Jul 2024 at 03:27, Masahiko Sawada wrote: > > > > > > On Wed, Jul 24, 2024 at 9:53 PM Amit Kapila > > > wrote: > > > > > > > > On Wed, Jul 17, 2024 at 5:25 PM vignesh C

Re: long-standing data loss bug in initial sync of logical replication

2025-02-25 Thread Amit Kapila
On Tue, Feb 25, 2025 at 3:56 PM Benoit Lobréau wrote: > > After reading the thread and doing a bit of testing, the problem seems > significant and is still present. The fact that it's probably not well > known makes it more concerning, in my opinion. I was wondering what > could be done to help mo

Re: long-standing data loss bug in initial sync of logical replication

2025-02-25 Thread Benoit Lobréau
Hi, After reading the thread and doing a bit of testing, the problem seems significant and is still present. The fact that it's probably not well known makes it more concerning, in my opinion. I was wondering what could be done to help move this topic forward (given my limited abilities)? --

Re: long-standing data loss bug in initial sync of logical replication

2025-02-24 Thread Shlok Kyal
On Wed, 11 Dec 2024 at 09:13, Shlok Kyal wrote: > > On Tue, 8 Oct 2024 at 11:11, Shlok Kyal wrote: > > > > Hi Kuroda-san, > > > > > > I have also modified the tests in 0001 patch. These changes are only > > > > related to syntax of writing tests. > > > > > > LGTM. I found small improvements, plea

Re: long-standing data loss bug in initial sync of logical replication

2025-02-24 Thread Amit Kapila
On Wed, Dec 11, 2024 at 12:37 PM Masahiko Sawada wrote: > > I confirmed that the proposed patch fixes these issues. I have one > question about the patch: > > In the main loop in SnapBuildDistributeSnapshotAndInval(), we have the > following code: > >/* > * If we don't have a base

Re: long-standing data loss bug in initial sync of logical replication

2024-12-10 Thread Masahiko Sawada
On Tue, Oct 8, 2024 at 2:51 AM Shlok Kyal wrote: > > On Wed, 31 Jul 2024 at 03:27, Masahiko Sawada wrote: > > > > On Wed, Jul 24, 2024 at 9:53 PM Amit Kapila wrote: > > > > > > On Wed, Jul 17, 2024 at 5:25 PM vignesh C wrote: > > > > > > > > On Wed, 17 Jul 2024 at 11:54, Amit Kapila > > > > w

Re: long-standing data loss bug in initial sync of logical replication

2024-12-10 Thread Shlok Kyal
On Tue, 8 Oct 2024 at 11:11, Shlok Kyal wrote: > > Hi Kuroda-san, > > > > I have also modified the tests in 0001 patch. These changes are only > > > related to syntax of writing tests. > > > > LGTM. I found small improvements, please find the attached. > > I have applied the changes and updated th

Re: long-standing data loss bug in initial sync of logical replication

2024-12-10 Thread Michael Paquier
On Tue, Dec 10, 2024 at 01:50:16PM -0800, Masahiko Sawada wrote: > Sorry I lost track of this thread. I'll check the test results and > patch soon. Thanks. -- Michael signature.asc Description: PGP signature

Re: long-standing data loss bug in initial sync of logical replication

2024-12-10 Thread Masahiko Sawada
On Mon, Dec 9, 2024 at 10:27 PM Michael Paquier wrote: > > On Tue, Oct 08, 2024 at 03:21:38PM +0530, Shlok Kyal wrote: > > I have tested the scenario shared by you on the thread [1]. And I > > confirm that the latest patch [2] fixes this issue. > > > > [1] > > https://www.postgresql.org/message-i

Re: long-standing data loss bug in initial sync of logical replication

2024-12-09 Thread Michael Paquier
On Tue, Oct 08, 2024 at 03:21:38PM +0530, Shlok Kyal wrote: > I have tested the scenario shared by you on the thread [1]. And I > confirm that the latest patch [2] fixes this issue. > > [1] > https://www.postgresql.org/message-id/CAD21AoAenVqiMjpN-PvGHL1N9DWnHSq673bfgr6phmBUzx=k...@mail.gmail.com

Re: long-standing data loss bug in initial sync of logical replication

2024-10-08 Thread Shlok Kyal
On Wed, 31 Jul 2024 at 03:27, Masahiko Sawada wrote: > > On Wed, Jul 24, 2024 at 9:53 PM Amit Kapila wrote: > > > > On Wed, Jul 17, 2024 at 5:25 PM vignesh C wrote: > > > > > > On Wed, 17 Jul 2024 at 11:54, Amit Kapila wrote: > > > > > > > > On Tue, Jul 16, 2024 at 6:54 PM vignesh C wrote: > >

Re: long-standing data loss bug in initial sync of logical replication

2024-10-07 Thread Shlok Kyal
Hi Kuroda-san, > > I have also modified the tests in 0001 patch. These changes are only > > related to syntax of writing tests. > > LGTM. I found small improvements, please find the attached. I have applied the changes and updated the patch. Thanks & Regards, Shlok Kyal From 07f94de76be177d0e397

RE: long-standing data loss bug in initial sync of logical replication

2024-10-07 Thread Hayato Kuroda (Fujitsu)
Dear Shlok, > I have tested this part. I observed that ,whenever we insert data in a > partition table, the function 'get_rel_sync_entry' is called and a > hash entry is created for the corresponding leaf node relid. So I feel > while invalidating here we can specify 'PUBLICATION_PART_LEAF' . I >

Re: long-standing data loss bug in initial sync of logical replication

2024-10-07 Thread Shlok Kyal
On Fri, 4 Oct 2024 at 12:52, Shlok Kyal wrote: > > Hi Kuroda-san, > > Thanks for reviewing the patch. > > > > 1. > > I feel the name of SnapBuildDistributeNewCatalogSnapshot() should be > > updated because it > > distributes two objects: catalog snapshot and invalidation messages. Do you > > hav

Re: long-standing data loss bug in initial sync of logical replication

2024-10-04 Thread Shlok Kyal
Hi Kuroda-san, Thanks for reviewing the patch. > > 1. > I feel the name of SnapBuildDistributeNewCatalogSnapshot() should be updated > because it > distributes two objects: catalog snapshot and invalidation messages. Do you > have good one > in your mind? I considered > "SnapBuildDistributeNewC

RE: long-standing data loss bug in initial sync of logical replication

2024-10-03 Thread Hayato Kuroda (Fujitsu)
Dear Shlok-san, Thanks for updating the patch. Here are comments. 1. I feel the name of SnapBuildDistributeNewCatalogSnapshot() should be updated because it distributes two objects: catalog snapshot and invalidation messages. Do you have good one in your mind? I considered "SnapBuildDistribute

Re: long-standing data loss bug in initial sync of logical replication

2024-10-02 Thread Shlok Kyal
On Mon, 30 Sept 2024 at 16:56, Hayato Kuroda (Fujitsu) wrote: > > > 2. > > Similarly with above, the relcache won't be invalidated when ALTER > > PUBLICATION > > OWNER TO is executed. This means that privilage checks may be ignored if the > > entry > > is still valid. Not sure, but is there a poss

Re: long-standing data loss bug in initial sync of logical replication

2024-10-02 Thread Shlok Kyal
> > I have addressed the comment for 0002 patch and attached the patches. > > Also, I have moved the tests in the 0002 to 0001 patch. > > Thanks for updating the patch. 0002 patch seems to remove cache invalidations > from publication_invalidation_cb(). Related with it, I found an issue and had >

RE: long-standing data loss bug in initial sync of logical replication

2024-09-30 Thread Hayato Kuroda (Fujitsu)
> 2. > Similarly with above, the relcache won't be invalidated when ALTER > PUBLICATION > OWNER TO is executed. This means that privilage checks may be ignored if the > entry > is still valid. Not sure, but is there a possibility this causes an > inconsistency? Hmm, IIUC, the attribute pubowner i

RE: long-standing data loss bug in initial sync of logical replication

2024-09-30 Thread Hayato Kuroda (Fujitsu)
Dear Shlok, > I have addressed the comment for 0002 patch and attached the patches. > Also, I have moved the tests in the 0002 to 0001 patch. Thanks for updating the patch. 0002 patch seems to remove cache invalidations from publication_invalidation_cb(). Related with it, I found an issue and had

Re: long-standing data loss bug in initial sync of logical replication

2024-09-29 Thread Shlok Kyal
On Thu, 26 Sept 2024 at 11:39, Shlok Kyal wrote: > > > In the v7 patch, I am looping through the reorder buffer of the > > current committed transaction and storing all invalidation messages in > > a list. Then I am distributing those invalidations. > > But I found that for a transaction we alread

Re: long-standing data loss bug in initial sync of logical replication

2024-09-27 Thread Shlok Kyal
Hi Kuroda-san, Thanks for reviewing the patch. > > Solution: > > 1. When we alter a publication using commands like ‘ALTER PUBLICATION > > pub_name DROP TABLE table_name’, first all tables in the publications > > are invalidated using the function ‘rel_sync_cache_relation_cb’. Then > > again ‘rel

RE: long-standing data loss bug in initial sync of logical replication

2024-09-26 Thread Hayato Kuroda (Fujitsu)
Dear Shlok, > Hi, > > I tried to add changes to selectively invalidate the cache to reduce > the performance degradation during the distribution of invalidations. Thanks for improving the patch! >... > > Solution: > 1. When we alter a publication using commands like ‘ALTER PUBLICATION > pub_na

Re: long-standing data loss bug in initial sync of logical replication

2024-09-25 Thread Shlok Kyal
> In the v7 patch, I am looping through the reorder buffer of the > current committed transaction and storing all invalidation messages in > a list. Then I am distributing those invalidations. > But I found that for a transaction we already store all the > invalidation messages (see [1]). So we don

Re: long-standing data loss bug in initial sync of logical replication

2024-09-12 Thread Shlok Kyal
On Mon, 9 Sept 2024 at 10:41, Shlok Kyal wrote: > > On Mon, 2 Sept 2024 at 10:12, Amit Kapila wrote: > > > > On Fri, Aug 30, 2024 at 3:06 PM Shlok Kyal wrote: > > > > > > Next I am planning to test solely on the logical decoding side and > > > will share the results. > > > > > > > Thanks, the ne

Re: long-standing data loss bug in initial sync of logical replication

2024-09-10 Thread Nitin Motiani
On Thu, Sep 5, 2024 at 4:04 PM Amit Kapila wrote: > > On Mon, Sep 2, 2024 at 9:19 PM Nitin Motiani wrote: > > > > I think that the partial data replication for one table is a bigger > > issue than the case of data being sent for a subset of the tables in > > the transaction. This can lead to inco

RE: long-standing data loss bug in initial sync of logical replication

2024-09-09 Thread Zhijie Hou (Fujitsu)
On Friday, August 9, 2024 7:21 PM Shlok Kyal wrote: Hi, > > In the v7 patch, I am looping through the reorder buffer of the current > committed > transaction and storing all invalidation messages in a list. Then I am > distributing those invalidations. > But I found that for a transaction we a

Re: long-standing data loss bug in initial sync of logical replication

2024-09-08 Thread Shlok Kyal
On Mon, 2 Sept 2024 at 10:12, Amit Kapila wrote: > > On Fri, Aug 30, 2024 at 3:06 PM Shlok Kyal wrote: > > > > Next I am planning to test solely on the logical decoding side and > > will share the results. > > > > Thanks, the next set of proposed tests makes sense to me. It will also > be useful

Re: long-standing data loss bug in initial sync of logical replication

2024-09-08 Thread Shlok Kyal
On Mon, 2 Sept 2024 at 10:12, Amit Kapila wrote: > > On Fri, Aug 30, 2024 at 3:06 PM Shlok Kyal wrote: > > > > Next I am planning to test solely on the logical decoding side and > > will share the results. > > > > Thanks, the next set of proposed tests makes sense to me. It will also > be useful

Re: long-standing data loss bug in initial sync of logical replication

2024-09-05 Thread Amit Kapila
On Mon, Sep 2, 2024 at 9:19 PM Nitin Motiani wrote: > > I think that the partial data replication for one table is a bigger > issue than the case of data being sent for a subset of the tables in > the transaction. This can lead to inconsistent data if the same row is > updated multiple times or de

Re: long-standing data loss bug in initial sync of logical replication

2024-09-02 Thread Nitin Motiani
On Tue, Aug 20, 2024 at 4:10 PM Amit Kapila wrote: > > On Thu, Aug 15, 2024 at 9:31 PM vignesh C wrote: > > Since we are applying invalidations to all in-progress transactions, > > the publisher will only replicate half of the transaction data up to > > the point of invalidation, while the remain

Re: long-standing data loss bug in initial sync of logical replication

2024-09-01 Thread Amit Kapila
On Fri, Aug 30, 2024 at 3:06 PM Shlok Kyal wrote: > > Next I am planning to test solely on the logical decoding side and > will share the results. > Thanks, the next set of proposed tests makes sense to me. It will also be useful to generate some worst-case scenarios where the number of invalidat

Re: long-standing data loss bug in initial sync of logical replication

2024-08-30 Thread Shlok Kyal
> BTW, we should do some performance testing by having a mix of DML and > DDLs to see the performance impact of this patch. > > [1] - > https://www.postgresql.org/message-id/CAD21AoAenVqiMjpN-PvGHL1N9DWnHSq673bfgr6phmBUzx=k...@mail.gmail.com > I did some performance testing and I found some perfo

Re: long-standing data loss bug in initial sync of logical replication

2024-08-20 Thread vignesh C
On Tue, 20 Aug 2024 at 16:10, Amit Kapila wrote: > > On Thu, Aug 15, 2024 at 9:31 PM vignesh C wrote: > > > > On Thu, 8 Aug 2024 at 16:24, Shlok Kyal wrote: > > > > > > On Wed, 31 Jul 2024 at 11:17, Shlok Kyal wrote: > > > > > > > > > > Created a patch for distributing invalidations. > > > Here

Re: long-standing data loss bug in initial sync of logical replication

2024-08-20 Thread Amit Kapila
On Thu, Aug 15, 2024 at 9:31 PM vignesh C wrote: > > On Thu, 8 Aug 2024 at 16:24, Shlok Kyal wrote: > > > > On Wed, 31 Jul 2024 at 11:17, Shlok Kyal wrote: > > > > > > > Created a patch for distributing invalidations. > > Here we collect the invalidation messages for the current transaction > >

Re: long-standing data loss bug in initial sync of logical replication

2024-08-15 Thread vignesh C
On Thu, 8 Aug 2024 at 16:24, Shlok Kyal wrote: > > On Wed, 31 Jul 2024 at 11:17, Shlok Kyal wrote: > > > > On Wed, 31 Jul 2024 at 09:36, Amit Kapila wrote: > > > > > > On Wed, Jul 31, 2024 at 3:27 AM Masahiko Sawada > > > wrote: > > > > > > > > On Wed, Jul 24, 2024 at 9:53 PM Amit Kapila > >

Re: long-standing data loss bug in initial sync of logical replication

2024-08-09 Thread Shlok Kyal
On Thu, 8 Aug 2024 at 16:24, Shlok Kyal wrote: > > On Wed, 31 Jul 2024 at 11:17, Shlok Kyal wrote: > > > > On Wed, 31 Jul 2024 at 09:36, Amit Kapila wrote: > > > > > > On Wed, Jul 31, 2024 at 3:27 AM Masahiko Sawada > > > wrote: > > > > > > > > On Wed, Jul 24, 2024 at 9:53 PM Amit Kapila > >

Re: long-standing data loss bug in initial sync of logical replication

2024-08-08 Thread Shlok Kyal
On Wed, 31 Jul 2024 at 11:17, Shlok Kyal wrote: > > On Wed, 31 Jul 2024 at 09:36, Amit Kapila wrote: > > > > On Wed, Jul 31, 2024 at 3:27 AM Masahiko Sawada > > wrote: > > > > > > On Wed, Jul 24, 2024 at 9:53 PM Amit Kapila > > > wrote: > > > > > > > > On Wed, Jul 17, 2024 at 5:25 PM vignesh

Re: long-standing data loss bug in initial sync of logical replication

2024-07-30 Thread Shlok Kyal
On Wed, 31 Jul 2024 at 09:36, Amit Kapila wrote: > > On Wed, Jul 31, 2024 at 3:27 AM Masahiko Sawada wrote: > > > > On Wed, Jul 24, 2024 at 9:53 PM Amit Kapila wrote: > > > > > > On Wed, Jul 17, 2024 at 5:25 PM vignesh C wrote: > > > > > > > > On Wed, 17 Jul 2024 at 11:54, Amit Kapila > > > >

Re: long-standing data loss bug in initial sync of logical replication

2024-07-30 Thread Amit Kapila
On Wed, Jul 31, 2024 at 3:27 AM Masahiko Sawada wrote: > > On Wed, Jul 24, 2024 at 9:53 PM Amit Kapila wrote: > > > > On Wed, Jul 17, 2024 at 5:25 PM vignesh C wrote: > > > > > > On Wed, 17 Jul 2024 at 11:54, Amit Kapila wrote: > > > > > > > > On Tue, Jul 16, 2024 at 6:54 PM vignesh C wrote: >

Re: long-standing data loss bug in initial sync of logical replication

2024-07-30 Thread Masahiko Sawada
On Wed, Jul 24, 2024 at 9:53 PM Amit Kapila wrote: > > On Wed, Jul 17, 2024 at 5:25 PM vignesh C wrote: > > > > On Wed, 17 Jul 2024 at 11:54, Amit Kapila wrote: > > > > > > On Tue, Jul 16, 2024 at 6:54 PM vignesh C wrote: > > > > > > BTW, I noticed that we don't take any table-level locks for C

Re: long-standing data loss bug in initial sync of logical replication

2024-07-24 Thread Amit Kapila
On Wed, Jul 17, 2024 at 5:25 PM vignesh C wrote: > > On Wed, 17 Jul 2024 at 11:54, Amit Kapila wrote: > > > > On Tue, Jul 16, 2024 at 6:54 PM vignesh C wrote: > > > > BTW, I noticed that we don't take any table-level locks for Create > > Publication .. For ALL TABLES (and Drop Publication). Can

Re: long-standing data loss bug in initial sync of logical replication

2024-07-18 Thread Nitin Motiani
On Thu, Jul 18, 2024 at 3:30 PM Nitin Motiani wrote: > > On Thu, Jul 18, 2024 at 3:25 PM Nitin Motiani wrote: > > > > On Thu, Jul 18, 2024 at 3:05 PM Nitin Motiani > > wrote: > > > > > > On Wed, Jul 17, 2024 at 5:25 PM vignesh C wrote: > > > > > > > > I tested these scenarios, and as you expec

Re: long-standing data loss bug in initial sync of logical replication

2024-07-18 Thread Nitin Motiani
On Thu, Jul 18, 2024 at 3:25 PM Nitin Motiani wrote: > > On Thu, Jul 18, 2024 at 3:05 PM Nitin Motiani wrote: > > > > On Wed, Jul 17, 2024 at 5:25 PM vignesh C wrote: > > > > > > I tested these scenarios, and as you expected, it throws an error for > > > the create publication case: > > > 2024-0

Re: long-standing data loss bug in initial sync of logical replication

2024-07-18 Thread Nitin Motiani
On Thu, Jul 18, 2024 at 3:05 PM Nitin Motiani wrote: > > On Wed, Jul 17, 2024 at 5:25 PM vignesh C wrote: > > > > I tested these scenarios, and as you expected, it throws an error for > > the create publication case: > > 2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not receive > > d

Re: long-standing data loss bug in initial sync of logical replication

2024-07-18 Thread Nitin Motiani
On Wed, Jul 17, 2024 at 5:25 PM vignesh C wrote: > > On Wed, 17 Jul 2024 at 11:54, Amit Kapila wrote: > > > > On Tue, Jul 16, 2024 at 6:54 PM vignesh C wrote: > > > > > > On Tue, 16 Jul 2024 at 11:59, Amit Kapila wrote: > > > > > > > > On Tue, Jul 16, 2024 at 9:29 AM Amit Kapila > > > > wrote

Re: long-standing data loss bug in initial sync of logical replication

2024-07-17 Thread vignesh C
On Wed, 17 Jul 2024 at 11:54, Amit Kapila wrote: > > On Tue, Jul 16, 2024 at 6:54 PM vignesh C wrote: > > > > On Tue, 16 Jul 2024 at 11:59, Amit Kapila wrote: > > > > > > On Tue, Jul 16, 2024 at 9:29 AM Amit Kapila > > > wrote: > > > > > > > > One related comment: > > > > @@ -1219,8 +1219,14 @

Re: long-standing data loss bug in initial sync of logical replication

2024-07-16 Thread Amit Kapila
On Tue, Jul 16, 2024 at 6:54 PM vignesh C wrote: > > On Tue, 16 Jul 2024 at 11:59, Amit Kapila wrote: > > > > On Tue, Jul 16, 2024 at 9:29 AM Amit Kapila wrote: > > > > > > One related comment: > > > @@ -1219,8 +1219,14 @@ AlterPublicationTables(AlterPublicationStmt > > > *stmt, HeapTuple tup, >

Re: long-standing data loss bug in initial sync of logical replication

2024-07-16 Thread vignesh C
On Tue, 16 Jul 2024 at 11:59, Amit Kapila wrote: > > On Tue, Jul 16, 2024 at 9:29 AM Amit Kapila wrote: > > > > One related comment: > > @@ -1219,8 +1219,14 @@ AlterPublicationTables(AlterPublicationStmt > > *stmt, HeapTuple tup, > > oldrel = palloc(sizeof(PublicationRelInfo)); > > oldrel->wh

Re: long-standing data loss bug in initial sync of logical replication

2024-07-15 Thread Amit Kapila
On Tue, Jul 16, 2024 at 9:29 AM Amit Kapila wrote: > > One related comment: > @@ -1219,8 +1219,14 @@ AlterPublicationTables(AlterPublicationStmt > *stmt, HeapTuple tup, > oldrel = palloc(sizeof(PublicationRelInfo)); > oldrel->whereClause = NULL; > oldrel->columns = NIL; > + > + /* > + * Data

Re: long-standing data loss bug in initial sync of logical replication

2024-07-15 Thread Amit Kapila
On Tue, Jul 16, 2024 at 12:48 AM Nitin Motiani wrote: > > A couple of questions on the latest patch : > > 1. I see there is this logic in PublicationDropSchemas to first check > if there is a valid entry for the schema in pg_publication_namespace > > psid = GetSysCacheOid2(PUBLICAT

Re: long-standing data loss bug in initial sync of logical replication

2024-07-15 Thread Nitin Motiani
On Mon, Jul 15, 2024 at 11:42 PM vignesh C wrote: > > On Mon, 15 Jul 2024 at 15:31, Amit Kapila wrote: > > > > On Thu, Jul 11, 2024 at 6:19 PM Nitin Motiani > > wrote: > > > I looked further into the scenario of adding the tables in schema to > > > the publication. Since in that case, the entry

Re: long-standing data loss bug in initial sync of logical replication

2024-07-15 Thread vignesh C
On Mon, 15 Jul 2024 at 15:31, Amit Kapila wrote: > > On Thu, Jul 11, 2024 at 6:19 PM Nitin Motiani wrote: > > > > On Wed, Jul 10, 2024 at 11:22 PM Nitin Motiani > > wrote: > > > > > > On Wed, Jul 10, 2024 at 10:39 PM vignesh C wrote: > > > > > > > > On Wed, 10 Jul 2024 at 12:28, Amit Kapila

Re: long-standing data loss bug in initial sync of logical replication

2024-07-15 Thread Amit Kapila
On Thu, Jul 11, 2024 at 6:19 PM Nitin Motiani wrote: > > On Wed, Jul 10, 2024 at 11:22 PM Nitin Motiani > wrote: > > > > On Wed, Jul 10, 2024 at 10:39 PM vignesh C wrote: > > > > > > On Wed, 10 Jul 2024 at 12:28, Amit Kapila wrote: > > > > The patch missed to use the ShareRowExclusiveLock for

Re: long-standing data loss bug in initial sync of logical replication

2024-07-11 Thread Nitin Motiani
On Wed, Jul 10, 2024 at 11:22 PM Nitin Motiani wrote: > > On Wed, Jul 10, 2024 at 10:39 PM vignesh C wrote: > > > > On Wed, 10 Jul 2024 at 12:28, Amit Kapila wrote: > > > The patch missed to use the ShareRowExclusiveLock for partitions, see > > > attached. I haven't tested it but they should als

Re: long-standing data loss bug in initial sync of logical replication

2024-07-10 Thread Nitin Motiani
On Wed, Jul 10, 2024 at 10:39 PM vignesh C wrote: > > On Wed, 10 Jul 2024 at 12:28, Amit Kapila wrote: > > The patch missed to use the ShareRowExclusiveLock for partitions, see > > attached. I haven't tested it but they should also face the same > > problem. Apart from that, I have changed the co

Re: long-standing data loss bug in initial sync of logical replication

2024-07-10 Thread vignesh C
On Wed, 10 Jul 2024 at 12:28, Amit Kapila wrote: > > On Tue, Jul 9, 2024 at 8:14 PM vignesh C wrote: > > > > On Tue, 9 Jul 2024 at 17:05, Amit Kapila wrote: > > > > > > On Mon, Jul 1, 2024 at 10:51 AM vignesh C wrote: > > > > > > > > > > > > This issue is present in all supported versions. I wa

Re: long-standing data loss bug in initial sync of logical replication

2024-07-09 Thread Amit Kapila
On Tue, Jul 9, 2024 at 8:14 PM vignesh C wrote: > > On Tue, 9 Jul 2024 at 17:05, Amit Kapila wrote: > > > > On Mon, Jul 1, 2024 at 10:51 AM vignesh C wrote: > > > > > > > > > This issue is present in all supported versions. I was able to > > > reproduce it using the steps recommended by Andres a

Re: long-standing data loss bug in initial sync of logical replication

2024-07-09 Thread vignesh C
On Tue, 9 Jul 2024 at 17:05, Amit Kapila wrote: > > On Mon, Jul 1, 2024 at 10:51 AM vignesh C wrote: > > > > > > This issue is present in all supported versions. I was able to > > reproduce it using the steps recommended by Andres and Tomas's > > scripts. I also conducted a small test through TAP

Re: long-standing data loss bug in initial sync of logical replication

2024-07-09 Thread Amit Kapila
On Mon, Jul 1, 2024 at 10:51 AM vignesh C wrote: > > > This issue is present in all supported versions. I was able to > reproduce it using the steps recommended by Andres and Tomas's > scripts. I also conducted a small test through TAP tests to verify the > problem. Attached is the alternate_lock_

Re: long-standing data loss bug in initial sync of logical replication

2024-06-30 Thread vignesh C
On Thu, 27 Jun 2024 at 08:38, Amit Kapila wrote: > > On Wed, Jun 26, 2024 at 4:57 PM Tomas Vondra > wrote: > > > > On 6/25/24 07:04, Amit Kapila wrote: > > > On Mon, Jun 24, 2024 at 8:06 PM Tomas Vondra > > > wrote: > > >> > > >> On 6/24/24 12:54, Amit Kapila wrote: > > >>> ... > > > >

Re: long-standing data loss bug in initial sync of logical replication

2024-06-26 Thread Amit Kapila
On Wed, Jun 26, 2024 at 4:57 PM Tomas Vondra wrote: > > On 6/25/24 07:04, Amit Kapila wrote: > > On Mon, Jun 24, 2024 at 8:06 PM Tomas Vondra > > wrote: > >> > >> On 6/24/24 12:54, Amit Kapila wrote: > >>> ... > > >> I'm not sure there are any cases where using SRE instead of AE would >

Re: long-standing data loss bug in initial sync of logical replication

2024-06-26 Thread Tomas Vondra
On 6/25/24 07:04, Amit Kapila wrote: > On Mon, Jun 24, 2024 at 8:06 PM Tomas Vondra > wrote: >> >> On 6/24/24 12:54, Amit Kapila wrote: >>> ... >> I'm not sure there are any cases where using SRE instead of AE would >> cause >> problems for logical decoding, but it seems very har

Re: long-standing data loss bug in initial sync of logical replication

2024-06-24 Thread Amit Kapila
On Mon, Jun 24, 2024 at 8:06 PM Tomas Vondra wrote: > > On 6/24/24 12:54, Amit Kapila wrote: > > ... > >> > I'm not sure there are any cases where using SRE instead of AE would > cause > problems for logical decoding, but it seems very hard to prove. I'd be > very > surp

Re: long-standing data loss bug in initial sync of logical replication

2024-06-24 Thread Tomas Vondra
On 6/24/24 12:54, Amit Kapila wrote: > ... >> I'm not sure there are any cases where using SRE instead of AE would cause problems for logical decoding, but it seems very hard to prove. I'd be very surprised if just using SRE would not lead to corrupted cache contents in some >>

Re: long-standing data loss bug in initial sync of logical replication

2024-06-24 Thread Amit Kapila
On Sun, Nov 19, 2023 at 7:48 AM Andres Freund wrote: > > On 2023-11-19 02:15:33 +0100, Tomas Vondra wrote: > > > > If understand correctly, with the current code (which only gets > > ShareUpdateExclusiveLock), we may end up in a situation like this > > (sessions A and B): > > > > A: starts "ALTE

Re: long-standing data loss bug in initial sync of logical replication

2024-01-16 Thread Vadim Lakt
Hi, On 19.11.2023 09:18, Andres Freund wrote: Yea, the situation is much simpler during logical decoding than "originally" - there's no concurrency. Greetings, Andres Freund We've encountered a similar error on our industrial server. The case: After adding a table to logical replication, ta

Re: long-standing data loss bug in initial sync of logical replication

2023-11-18 Thread Andres Freund
On 2023-11-19 02:15:33 +0100, Tomas Vondra wrote: > > > On 11/18/23 22:05, Andres Freund wrote: > > Hi, > > > > On 2023-11-18 21:45:35 +0100, Tomas Vondra wrote: > >> On 11/18/23 19:12, Andres Freund wrote: > If we increase the locks from ShareUpdateExclusive to ShareRowExclusive, > we

Re: long-standing data loss bug in initial sync of logical replication

2023-11-18 Thread Tomas Vondra
On 11/18/23 22:05, Andres Freund wrote: > Hi, > > On 2023-11-18 21:45:35 +0100, Tomas Vondra wrote: >> On 11/18/23 19:12, Andres Freund wrote: If we increase the locks from ShareUpdateExclusive to ShareRowExclusive, we're making it conflict with RowExclusive. Which is just DML, and I

Re: long-standing data loss bug in initial sync of logical replication

2023-11-18 Thread Andres Freund
Hi, On 2023-11-18 21:45:35 +0100, Tomas Vondra wrote: > On 11/18/23 19:12, Andres Freund wrote: > >> If we increase the locks from ShareUpdateExclusive to ShareRowExclusive, > >> we're making it conflict with RowExclusive. Which is just DML, and I > >> think we need to do that. > > > > From what

Re: long-standing data loss bug in initial sync of logical replication

2023-11-18 Thread Tomas Vondra
On 11/18/23 19:12, Andres Freund wrote: > Hi, > > On 2023-11-18 11:56:47 +0100, Tomas Vondra wrote: >>> I guess it's not really feasible to just increase the lock level here though >>> :(. The use of ShareUpdateExclusiveLock isn't new, and suddenly using AEL >>> would perhaps lead to new deadlocks

Re: long-standing data loss bug in initial sync of logical replication

2023-11-18 Thread Andres Freund
Hi, On 2023-11-18 11:56:47 +0100, Tomas Vondra wrote: > > I guess it's not really feasible to just increase the lock level here though > > :(. The use of ShareUpdateExclusiveLock isn't new, and suddenly using AEL > > would perhaps lead to new deadlocks and such? But it also seems quite wrong. > >

  1   2   >