Re: Unexpected behavior after OOM errors

Matthias van de Meent Thu, 18 Jun 2026 02:27:59 -0700

On Thu, 18 Jun 2026 at 06:37, Michael Paquier <[email protected]> wrote:
>
> On Wed, Jun 17, 2026 at 02:27:25PM +0200, Matthias van de Meent wrote:
> > On Wed, 17 Jun 2026 at 08:00, Alexander Lakhin <[email protected]> wrote:
> >> 1) An issue in lookup_type_cache()
> >
> > I believe this is caused by partial subsystem initialization. Attached
> > patch 0001 should address this failure without causing the server to
> > restart on OOM.
>
> Hmm.  I think that this is an ordering problem.  We could make the
> callbacks be registered last, once we are sure that the two hash
> tables and the in-progress list have been initialized.


I don't disagree that there's an ordering problem, but in my view the
main problem is the fallible initialization of N components, gated
behind a single condition. The question of when to register the
callbacks is just one part of many.

> I am not sure
> that this requires a new facility; it is also an advantage to keep the
> initialization sequence in a one code path, without an abstraction.

I'm not a huge fan of templating if(!global) { init_global(); } all
around. But, it works.

> RelIdToTypeIdCacheHash and RelIdToTypeIdCacheHash are in the
> TopMemoryContext, static to the process, so we could just check them
> for NULL-ness to make the initialization repeatable.  That gives me
> the attached v2.  Reusing Alexander's randomness trick, that looks
> stable here.

This un-fixes one of the unlikely issues that was fixed in my patch -
though unrelated to OOM:

Each of the calls to
CacheRegisterSyscacheCallback/CacheRegisterRelcacheCallback can throw
an ERROR when all slots have been used.  This would leave the typcache
in an invalid state, so I think that must be wrapped in a critical
section: neither syscache nor relcache has options to release
callbacks, and we can't safely continue without the callbacks
installed, so once an error is thrown here this backend can't ever be
properly initialized.  This is unlike OOMs, whose conditions for
failure may (and often do) change as workloads change in other
backends.

I think Heikki's suggestion for a FATAL critical section option would
be a good alternative. It wouldn't always be sufficient, but would fix
issues here.

> >> 2) An issue in GetSnapshotData()
> >
> > Again, caused by partial initialization, though in this case it's of a
> > SnapshotData* which is later checked again. Attached patch 0002 should
> > address this failure.
>
> Yeah, that seems right to make repeated calls of GetSnapshotData()
> able to work.  LGTM.

Thanks for committing!


Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)

Re: Unexpected behavior after OOM errors

Reply via email to