On Tue, Mar 16, 2021 at 9:50 AM Avinash Kumar
<avinash.vallar...@gmail.com> wrote:
> Yes, it was on the failover-over server where the issue is currently seen. 
> Took a snapshot of the data directory so that the issue can be analyzed.

I would be very cautious when using LVM snapshots with a Postgres data
directory, or VM-based snapshotting tools. There are many things that
can go wrong with these tools, which are usually not sensitive to the
very specific requirements of a database system like Postgres (e.g.
inconsistencies between WAL and data files can emerge in many
scenarios).

My general recommendation is to avoid these tools completely --
consistently use a backup solution like pgBackrest instead.

BTW, running pg_repack is something that creates additional risk of
database corruption, at least to some degree. That seems less likely
to have been the problem here (I think that it's probably something
with snapshots). Something to consider.

> I can do this. But, to add here, when we do a pg_repack or rebuild of 
> Indexes, automatically this is resolved.

Your bug report was useful to me, because it made me realize that the
posting list split code in _bt_swap_posting() is unnecessarily
trusting of the on-disk data -- especially compared to _bt_split(),
the page split code. While I consider it unlikely that the problem
that you see is truly a bug in Postgres, it is still true that the
crash that you saw should probably have just been an error.

We don't promise that the database cannot crash even with corrupt
data, but we do try to avoid it whenever possible. I may be able to
harden _bt_swap_posting(), to make failures like this a little more
friendly. It's an infrequently hit code path, so we can easily afford
to make the code more careful/less trusting.

-- 
Peter Geoghegan


Reply via email to