On Tue, Mar 16, 2021 at 9:50 AM Avinash Kumar <avinash.vallar...@gmail.com> wrote: > Yes, it was on the failover-over server where the issue is currently seen. > Took a snapshot of the data directory so that the issue can be analyzed.
I would be very cautious when using LVM snapshots with a Postgres data directory, or VM-based snapshotting tools. There are many things that can go wrong with these tools, which are usually not sensitive to the very specific requirements of a database system like Postgres (e.g. inconsistencies between WAL and data files can emerge in many scenarios). My general recommendation is to avoid these tools completely -- consistently use a backup solution like pgBackrest instead. BTW, running pg_repack is something that creates additional risk of database corruption, at least to some degree. That seems less likely to have been the problem here (I think that it's probably something with snapshots). Something to consider. > I can do this. But, to add here, when we do a pg_repack or rebuild of > Indexes, automatically this is resolved. Your bug report was useful to me, because it made me realize that the posting list split code in _bt_swap_posting() is unnecessarily trusting of the on-disk data -- especially compared to _bt_split(), the page split code. While I consider it unlikely that the problem that you see is truly a bug in Postgres, it is still true that the crash that you saw should probably have just been an error. We don't promise that the database cannot crash even with corrupt data, but we do try to avoid it whenever possible. I may be able to harden _bt_swap_posting(), to make failures like this a little more friendly. It's an infrequently hit code path, so we can easily afford to make the code more careful/less trusting. -- Peter Geoghegan