On Thu, Feb 4, 2021 at 4:00 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > > About 0001, have we tried to reproduce the actual bug here which means > > > when the error_callback is called we should face some problem? I feel > > > with the correct testcase we should hit the Assert > > > (Assert(IsTransactionState());) in SearchCatCacheInternal because > > > there we expect the transaction to be in a valid state. I understand > > > that the transaction is in a broken state at that time but having a > > > testcase to hit the actual bug makes it easy to test the fix. > > > > I have not tried hitting the Assert(IsTransactionState() in > > SearchCatCacheInternal. To do that, I need to figure out hitting > > "incorrect binary data format in logical replication column" error in > > either slot_modify_data or slot_store_data so that we will enter the > > error callback slot_store_error_callback and then IsTransactionState() > > should return false i.e. txn shouldn't be in TRANS_INPROGRESS. > > > > Even, if you hit that via debugger it will be sufficient or you can > write another elog/ereport there to achieve the same. The exact test > case to hit that error is not mandatory.
Thanks Amit. I verified it with gdb. I attached gdb to the logical replication worker. In slot_store_data's for loop, I intentionally set CurrentTransactionState->state = TRANS_DEFAULT, and jumped to the existing error "incorrect binary data format in logical replication column", so that the slot_store_error_callback is called. While we are in the error context callback: On master: since the system catalogues are accessed in slot_store_error_callback, the Assert(IsTransactionState() in SearchCatCacheInternal failed and the error we intend to see is not logged and we see below in the subscriber server log and the session in the subscriber gets restarted. 2021-02-04 17:26:27.517 IST [2269230] ERROR: could not send data to WAL stream: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. 2021-02-04 17:26:27.518 IST [2269190] LOG: background worker "logical replication worker" (PID 2269230) exited with exit code 1 With patch: since we avoided system catalogue access in slot_store_error_callback, we see the error that we intentionally jumped to, in the subscriber server log. 2021-02-04 17:27:37.542 IST [2269424] ERROR: incorrect binary data format in logical replication column 1 2021-02-04 17:27:37.542 IST [2269424] CONTEXT: processing remote data for replication target relation "public.t1" column "a1", remote type integer, local type integer With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com