How to solve the problem of one backend process crashing and causing other processes to restart?

2023-11-12 Thread yuansong
In PostgreSQL, when a backend process crashes, it can cause other backend 
processes to also require a restart, primarily to ensure data consistency. I 
understand that the correct approach is to analyze and identify the cause of 
the crash and resolve it. However, it is also important to be able to handle a 
backend process crash without affecting the operation of other processes, thus 
minimizing the scope of negative impact and improving availability. To achieve 
this goal, could we mimic the Oracle process by introducing a "pmon" process 
dedicated to rolling back crashed process transactions and performing resource 
cleanup? I wonder if anyone has attempted such a strategy or if there have been 
previous discussions on this topic.

Re:Re: How to solve the problem of one backend process crashing and causing other processes to restart?

2023-11-13 Thread yuansong
Enhancing the overall fault tolerance of the entire system for this feature is 
quite important. No one can avoid bugs, and I don't believe that this approach 
is a more advanced one. It might be worth considering adding it to the roadmap 
so that interested parties can conduct relevant research.

The current major issue is that when one process crashes, resetting all 
connections has a significant impact on other connections. Is it possible to 
only disconnect the crashed connection and have the other connections simply 
roll back the current transaction without reconnecting? Perhaps this problem 
can be mitigated through the use of a connection pool.

If we want to implement this feature, would it be sufficient to clean up or 
restore the shared memory and disk changes caused by the crashed backend? 
Besides clearing various known locks, what else needs to be changed? Does 
anyone have any insights? Please help me. Thank you.

















At 2023-11-13 13:53:29, "Laurenz Albe"  wrote:
>On Sun, 2023-11-12 at 21:55 -0500, Tom Lane wrote:
>> yuansong  writes:
>> > In PostgreSQL, when a backend process crashes, it can cause other backend
>> > processes to also require a restart, primarily to ensure data consistency.
>> > I understand that the correct approach is to analyze and identify the
>> > cause of the crash and resolve it. However, it is also important to be
>> > able to handle a backend process crash without affecting the operation of
>> > other processes, thus minimizing the scope of negative impact and
>> > improving availability. To achieve this goal, could we mimic the Oracle
>> > process by introducing a "pmon" process dedicated to rolling back crashed
>> > process transactions and performing resource cleanup? I wonder if anyone
>> > has attempted such a strategy or if there have been previous discussions
>> > on this topic.
>> 
>> The reason we force a database-wide restart is that there's no way to
>> be certain that the crashed process didn't corrupt anything in shared
>> memory.  (Even with the forced restart, there's a window where bad
>> data could reach disk before we kill off the other processes that
>> might write it.  But at least it's a short window.)  "Corruption"
>> here doesn't just involve bad data placed into disk buffers; more
>> often it's things like unreleased locks, which would block other
>> processes indefinitely.
>> 
>> I seriously doubt that anything like what you're describing
>> could be made reliable enough to be acceptable.  "Oracle does
>> it like this" isn't a counter-argument: they have a much different
>> (and non-extensible) architecture, and they also have an army of
>> programmers to deal with minutiae like undoing resource acquisition.
>> Even with that, you'd have to wonder about the number of bugs
>> existing in such necessarily-poorly-tested code paths.
>
>Yes.
>I think that PostgreSQL's approach is superior: rather than investing in
>code to mitigate the impact of data corruption caused by a crash, invest
>in quality code that doesn't crash in the first place.
>
>Euphemistically naming a crash "ORA-600 error" seems to be part of
>their strategy.
>
>Yours,
>Laurenz Albe
>


Re:Re:Re: How to solve the problem of one backend process crashing and causing other processes to restart?

2023-11-20 Thread yuansong
thanks,After reconsideration, I realized that what I really want is for other 
connections to remain unaffected when a process crashes. This is something that 
a connection pool cannot solve.







At 2023-11-14 09:41:03, "Thomas wen"  wrote:

Hi yuansong
  there is connnection pool path 
(https://commitfest.postgresql.org/34/3043/) ,but it  has been dormant for few 
years,You can check this patch to get what you want to need
发件人: yuansong 
发送时间: 2023年11月13日 17:13
收件人: Laurenz Albe 
抄送: pgsql-hackers@lists.postgresql.org 
主题: Re:Re: How to solve the problem of one backend process crashing and causing 
other processes to restart?
 

Enhancing the overall fault tolerance of the entire system for this feature is 
quite important. No one can avoid bugs, and I don't believe that this approach 
is a more advanced one. It might be worth considering adding it to the roadmap 
so that interested parties can conduct relevant research.

The current major issue is that when one process crashes, resetting all 
connections has a significant impact on other connections. Is it possible to 
only disconnect the crashed connection and have the other connections simply 
roll back the current transaction without reconnecting? Perhaps this problem 
can be mitigated through the use of a connection pool.

If we want to implement this feature, would it be sufficient to clean up or 
restore the shared memory and disk changes caused by the crashed backend? 
Besides clearing various known locks, what else needs to be changed? Does 
anyone have any insights? Please help me. Thank you.

















At 2023-11-13 13:53:29, "Laurenz Albe"  wrote:
>On Sun, 2023-11-12 at 21:55 -0500, Tom Lane wrote:
>> yuansong  writes:
>> > In PostgreSQL, when a backend process crashes, it can cause other backend
>> > processes to also require a restart, primarily to ensure data consistency.
>> > I understand that the correct approach is to analyze and identify the
>> > cause of the crash and resolve it. However, it is also important to be
>> > able to handle a backend process crash without affecting the operation of
>> > other processes, thus minimizing the scope of negative impact and
>> > improving availability. To achieve this goal, could we mimic the Oracle
>> > process by introducing a "pmon" process dedicated to rolling back crashed
>> > process transactions and performing resource cleanup? I wonder if anyone
>> > has attempted such a strategy or if there have been previous discussions
>> > on this topic.
>> 
>> The reason we force a database-wide restart is that there's no way to
>> be certain that the crashed process didn't corrupt anything in shared
>> memory.  (Even with the forced restart, there's a window where bad
>> data could reach disk before we kill off the other processes that
>> might write it.  But at least it's a short window.)  "Corruption"
>> here doesn't just involve bad data placed into disk buffers; more
>> often it's things like unreleased locks, which would block other
>> processes indefinitely.
>> 
>> I seriously doubt that anything like what you're describing
>> could be made reliable enough to be acceptable.  "Oracle does
>> it like this" isn't a counter-argument: they have a much different
>> (and non-extensible) architecture, and they also have an army of
>> programmers to deal with minutiae like undoing resource acquisition.
>> Even with that, you'd have to wonder about the number of bugs
>> existing in such necessarily-poorly-tested code paths.
>
>Yes.
>I think that PostgreSQL's approach is superior: rather than investing in
>code to mitigate the impact of data corruption caused by a crash, invest
>in quality code that doesn't crash in the first place.
>
>Euphemistically naming a crash "ORA-600 error" seems to be part of
>their strategy.
>
>Yours,
>Laurenz Albe
>


Re:Re: Re:Re:Re: backup server core when redo btree_xlog_insert that type is XLOG_BTREE_INSERT_POST

2024-12-01 Thread yuansong
sorry, I write something wrong。

I confirmed that there's an issue with prematurely ending the loop. When I 
enter the if (unlikely(result == 0 && key->scantid != NULL)) block, low = 2, 
mid = 4, and high = 4. At this point, the offset of insertstate->postingoff is 
obtained from the mid value, which is 4. However, the loop doesn't terminate 
and continues, causing the value of low to become 3. This is where I applied my 
fix based on master.




I aslo don't know how to fix the bug and commit it to the community master.


At 2024-12-01 21:33:28, "Peter Geoghegan"  wrote:
>On Sun, Dec 1, 2024 at 8:11 AM yuansong  wrote:
>> the _bt_binsrch_insert function always returns low, but during the post list 
>> search, are there cases where low and mid are unequal?
>>
>> If so, this could potentially cause an offset in the subsequent 
>> _bt_insertonpg function.
>>
>> maybe we fix it like this ?
>
>Why?
>
>The "if (unlikely(result == 0 && key->scantid != NULL))" block is only
>hit when we have an "exactly-equal" index tuple -- which means an
>index tuple with a posting list covering a range of heap TIDs that our
>scantid is covered by. There should be at most one index tuple like
>this on any leaf page -- otherwise the index is corrupt.
>
>The hardening that I added for the posting list split stuff was added
>to 13.4 and 13.5. You're running a very old point release (13.2), so
>you weren't running a version of the server with that hardening.
>
>--
>Peter Geoghegan


0001-Fix-the-error-in-the-B-tree-binary-search-for-the-po.patch
Description: Binary data


Re:Re: Re:Re:Re: backup server core when redo btree_xlog_insert that type is XLOG_BTREE_INSERT_POST

2024-12-01 Thread yuansong
I confirmed that there's an issue with prematurely ending the loop. When I 
enter the if (unlikely(result == 0 && key->scantid != NULL)) block, low = 2, 
mid = 4, and high = 4. At this point, the offset of insertstate->postingoff is 
obtained from the mid value, which is 3. However, the loop doesn't terminate 
and continues, causing the value of low to become 4. This is where I applied my 
fix based on master.




I aslo don't know how to fix the bug and commit it to the community master.


At 2024-12-01 21:33:28, "Peter Geoghegan"  wrote:
>On Sun, Dec 1, 2024 at 8:11 AM yuansong  wrote:
>> the _bt_binsrch_insert function always returns low, but during the post list 
>> search, are there cases where low and mid are unequal?
>>
>> If so, this could potentially cause an offset in the subsequent 
>> _bt_insertonpg function.
>>
>> maybe we fix it like this ?
>
>Why?
>
>The "if (unlikely(result == 0 && key->scantid != NULL))" block is only
>hit when we have an "exactly-equal" index tuple -- which means an
>index tuple with a posting list covering a range of heap TIDs that our
>scantid is covered by. There should be at most one index tuple like
>this on any leaf page -- otherwise the index is corrupt.
>
>The hardening that I added for the posting list split stuff was added
>to 13.4 and 13.5. You're running a very old point release (13.2), so
>you weren't running a version of the server with that hardening.
>
>--
>Peter Geoghegan


0001-Fix-the-error-in-the-B-tree-binary-search-for-the-po.patch
Description: Binary data


Re:Re:Re:Re: backup server core when redo btree_xlog_insert that type is XLOG_BTREE_INSERT_POST

2024-12-04 Thread yuansong
the _bt_binsrch_insert function always returns low, but during the post list 
search, are there cases where low and mid are unequal?

If so, this could potentially cause an offset in the subsequent _bt_insertonpg 
function.

maybe we fix it like this ?




OffsetNumber

_bt_binsrch_insert(Relation rel, BTInsertState insertstate)

{

 ..

while (high > low)

{

OffsetNumber mid = low + ((high - low) / 2);

/*

* If tuple at offset located by binary search is a posting list whose

* TID range overlaps with caller's scantid, perform posting list

* binary search to set postingoff for caller.  Caller must split the

* posting list when postingoff is set.  This should happen

* infrequently.

*/

if (unlikely(result == 0 && key->scantid != NULL))

{

/*

* postingoff should never be set more than once per leaf page

* binary search.  That would mean that there are duplicate table

* TIDs in the index, which is never okay.  Check for that here.

*/

if (insertstate->postingoff != 0)

ereport(ERROR,

(errcode(ERRCODE_INDEX_CORRUPTED),

errmsg_internal("table tid from new index tuple (%u,%u) cannot find insert 
offset between offsets %u and %u of block %u in index \"%s\"",

ItemPointerGetBlockNumber(key->scantid),

ItemPointerGetOffsetNumber(key->scantid),

low, stricthigh,

BufferGetBlockNumber(insertstate->buf),

RelationGetRelationName(rel;




insertstate->postingoff = _bt_binsrch_posting(key, page, mid);

  // Here, will low and mid ever be unequal? If low is returned in such 
cases, would it result in an error? maybe we fix it like this ?

  //low = mid;

  //  break;  

}

}

  

return low;

}




At 2024-11-27 18:53:20, "yuansong"  wrote:

we find crash reson




We have identified the cause of the crash: it was due to the 
XLOG_BTREE_INSERT_POST XLOG having an OffsetNumber offnum that was one less 
than what was stored in the index. I experimented with adding +1, and the index 
data remained normal in both cases. This issue is likely caused by concurrent 
operations on the B-tree, and upon reviewing the corresponding WAL logs, we 
found SPLIT_L and INSERT_LEAF operations on the same block before the crash. 
This might be a bug. I'm not sure if there's a related fix.




At 2024-11-21 23:58:03, "Peter Geoghegan"  wrote:
>On Thu, Nov 21, 2024 at 10:03 AM yuansong  wrote:
>> Should nhtids be less than or equal to IndexTupleSize(oposting)?
>>  Why is nhtids larger than IndexTupleSize(oposting) ? I think there should 
>> be an error in the master host writing the wal log.
>> Does anyone know when this will happen?
>
>It'll happen whenever there is a certain kind of data corruption.
>
>There were complaints about issues like this in the past. But those
>complaints seem to have gone away when more hardening was added to the
>code that runs during original execution (not the REDO routine code,
>which can only do what it is told to do by the WAL record).
>
>You're using PostgreSQL 13.2, which is a very old point release that
>lacks this hardening -- the current 13 point release is 13.18, so
>you're missing a lot. Had you been on a later point release you'd very
>probably have still had the issue with corruption (which could be from
>bad hardware), but you likely would have avoided the problem with the
>REDO routine crashing like this.
>
>-- 
>Peter Geoghegan


backup server core when redo btree_xlog_insert that type is XLOG_BTREE_INSERT_POST

2024-11-21 Thread yuansong



My database version is Postgresql 13.2 , backup server core when redo  a  
btree_xlog_insert that type is XLOG_BTREE_INSERT_POST




#0 0x2aab66695d86inmemmove ssse3 krom /lib64/libc.so.6

#1 
0x004f5574in_bt_swap_posting(newitem=0x125d998,oposting=0x2aabcd55dcc8ostingoff=13)
 at nbtdedup.c:796

#2 0x0050cf08 in btree_xlog_insert(isleaf=true,ismeta=false, 
posting=true,r ecord=0x122d7b8)at nbtxlog.c:224

43 0x0050ed53 in btree_redo(record=0x122d7b8) at nbtxlog.c:969

44 0x005487b5 in StartupXLOG()at log.c:7320

#5 0x008292ab in StartupProcessMain () atstartup.c:204

46 0x0055c916 in AuxiliaryProcessMain(argc=2,argv=0x7fff071f8a00) at 
bootstrap.c:443

#7 0x0082820d in StartChildProcess(type=StartupProcess) at 
postmaster.c:5492

48 0x008236ce in PostmasterMain(argc=1argv=0x11fb2e0)at 
postmaster.c:1404

#9 0x007398f7 in main(argc=1,argv=0x11fb2e0)at main.c:210




gdb) f 1

#1 0x004f5574in 
btswap_posting(newitem=0x125d998,oposting=0x2aabcd55dcc8ostingoff=13) at 
nbtdedup.c.796

796

gdb) p((Size)((oposting)->t_info & 0x1FFF)) //oposting index tuple size 

 $12 = 16

gdb) p nhtids

$13= 17

gdb) p postingoff

$14=13

gdb) p nmovebytes

$15=18




IndexTuple

_bt_swap_posting(IndexTuple newitem, IndexTuple oposting, int postingoff)

{

intnhtids;

char   *replacepos;

char   *replaceposright;

Sizenmovebytes;

IndexTuplenposting;




nhtids = BTreeTupleGetNPosting(oposting);

   ===  nhtids = 17, postingoff = 13




Assert(_bt_posting_valid(oposting));

Assert(postingoff > 0 && postingoff < nhtids);

/*

* Move item pointers in posting list to make a gap for the new item's

* heap TID.  We shift TIDs one place to the right, losing original

* rightmost TID. (nmovebytes must not include TIDs to the left of

* postingoff, nor the existing rightmost/max TID that gets overwritten.)

*/

nposting = CopyIndexTuple(oposting);

===  IndexTupleSize(oposting) = 16, nposting memory size 16




replacepos = (char *) BTreeTupleGetPostingN(nposting, postingoff);

replaceposright = (char *) BTreeTupleGetPostingN(nposting, postingoff + 1);

nmovebytes = (nhtids - postingoff - 1) * sizeof(ItemPointerData);

memmove(replaceposright, replacepos, nmovebytes);

===core here,  nmovebytes = 18, for nposting size is 16, so here is  out of 
memory

}




Should nhtids be less than or equal to IndexTupleSize(oposting)?
 Why is nhtids larger than IndexTupleSize(oposting) ? I think there should be 
an error in the master host writing the wal log.
Does anyone know when this will happen?



Re:Re: backup server core when redo btree_xlog_insert that type is XLOG_BTREE_INSERT_POST

2024-11-21 Thread yuansong
There may be something wrong with my previous description, "Should nhtids be 
less than or equal to IndexTupleSize(oposting)?

Why is nhtids larger than IndexTupleSize(oposting) " Here nhtids should be 
nmovebytes.

It is normal whether nhtids is larger than IndexTupleSize(oposting) or smaller 
than IndexTupleSize(oposting).

Should nmovebytes be smaller than IndexTupleSize(oposting)?




I checked the latest code of the master branch btree(backend\access\nbtree) and 
did not find any related fixes. 
This is also a low-probability event and is difficult to reproduce.

At 2024-11-21 23:58:03, "Peter Geoghegan"  wrote:
>On Thu, Nov 21, 2024 at 10:03 AM yuansong  wrote:
>> Should nhtids be less than or equal to IndexTupleSize(oposting)?
>>  Why is nhtids larger than IndexTupleSize(oposting) ? I think there should 
>> be an error in the master host writing the wal log.
>> Does anyone know when this will happen?
>
>It'll happen whenever there is a certain kind of data corruption.
>
>There were complaints about issues like this in the past. But those
>complaints seem to have gone away when more hardening was added to the
>code that runs during original execution (not the REDO routine code,
>which can only do what it is told to do by the WAL record).
>
>You're using PostgreSQL 13.2, which is a very old point release that
>lacks this hardening -- the current 13 point release is 13.18, so
>you're missing a lot. Had you been on a later point release you'd very
>probably have still had the issue with corruption (which could be from
>bad hardware), but you likely would have avoided the problem with the
>REDO routine crashing like this.
>
>-- 
>Peter Geoghegan


Re:Re: Re:Re:Re: backup server core when redo btree_xlog_insert that type is XLOG_BTREE_INSERT_POST

2025-01-11 Thread yuansong
Thank you very much for Peter Geoghegan's help. The issue has been resolved for 
some time now. It was indeed caused by corrupted index data, which resulted 
from collation differences between the hosts before and after the migration. 
Although the collation names were consistent, the differing glibc versions on 
the servers led to collation discrepancies, which caused the index sorting to 
become chaotic.

However, I noticed that the values of insertstate->postingoff are a bit 
disordered. In _bt_binsrch_posting, when if (!BTreeTupleIsPosting(itup)) return 
0; is encountered, the normal search return value low starts from 0. In theory, 
there could also be a situation where the return value is 0. Therefore, it 
might be worth considering initializing or using non-PostList values as -1, and 
using -2 to represent ItemIdIsDead. Any other valid values should be greater 
than 0.

Additionally, I don’t think it's a good implementation for _bt_binsrch_insert 
to continue searching after finding the correct postingoff using binary search. 
The current approach is confusing and inefficient. Moreover, the idea of 
continuing the search to check if another postingoff exists in order to trigger 
an error seems to have minimal impact, as it only detects certain cases. This 
results in a function doing unrelated tasks. If such index anomalies need to be 
checked, it would be better to do so using external tools rather than checking 
during the core search process. The theory of single responsibility for a 
function。 If you agree with this proposal, I can try to implement the related 
fix later.