Re: [bug?] Missed parallel safety checks, and wrong parallel safety

Robert Haas Mon, 14 Jun 2021 08:39:01 -0700

On Mon, Jun 14, 2021 at 2:32 AM Amit Kapila <[email protected]> wrote:
> Why do you think we don't need to check index AM functions? Say we
> have an index expression that uses function and if its parallel safety
> is changed then probably that also impacts whether we can do insert in
> parallel. Because otherwise, we will end up executing some parallel
> unsafe function in parallel mode during index insertion.


I'm not saying that we don't need to check index expressions. I agree
that we need to check those. The index AM functions are things like
btint4cmp(). I don't think that a function like that should ever be
parallel-unsafe.

> Yeah, this could be one idea but I think even if we use pg_proc OID,
> we still need to check all the rel cache entries to find which one
> contains the invalidated OID and that could be expensive. I wonder
> can't we directly find the relation involved and register invalidation
> for the same? We are able to find the relation to which trigger
> function is associated during drop function via findDependentObjects
> by scanning pg_depend. Assuming, we are able to find the relation for
> trigger function by scanning pg_depend, what kinds of problems do we
> envision in registering the invalidation for the same?

I don't think that finding the relation involved and registering an
invalidation for the same will work properly. Suppose there is a
concurrently-running transaction which has created a new table and
attached a trigger function to it. You can't see any of the catalog
entries for that relation yet, so you can't invalidate it, but
invalidation needs to happen. Even if you used some snapshot that can
see those catalog entries before they are committed, I doubt it fixes
the race condition. You can't hold any lock on that relation, because
the creating transaction holds AccessExclusiveLock, but the whole
invalidation mechanism is built around the assumption that the sender
puts messages into the shared queue first and then releases locks,
while the receiver first acquires a conflicting lock and then
processes messages from the queue. Without locks, that synchronization
algorithm can't work reliably. As a consequence of all that, I believe
that, not just in this particular case but in general, the
invalidation message needs to describe the thing that has actually
changed, rather than any derived property. We can make invalidations
that say "some function's parallel-safety flag has changed" or "this
particular function's parallel-safety flag has changed" or "this
particular function has changed in some way" (this one, we have
already), but anything that involves trying to figure out what the
consequences of such a change might be and saying "hey, you, please
update XYZ because I changed something somewhere that could affect
that" is not going to be correct.

> I think we probably need to worry about the additional cost to find
> dependent objects and if there are any race conditions in doing so as
> pointed out by Tom in his email [1]. The concern related to cost could
> be addressed by your idea of registering such an invalidation only
> when the user changes the parallel safety of the function which we
> don't expect to be a frequent operation. Now, here the race condition,
> I could think of could be that by the time we change parallel-safety
> (say making it unsafe) of a function, some of the other sessions might
> have already started processing insert on a relation where that
> function is associated via trigger or check constraint in which case
> there could be a problem. I think to avoid that we need to acquire an
> Exclusive lock on the relation as we are doing in Rename Policy kind
> of operations.

Well, the big issue here is that we don't actually lock functions
while they are in use. So there's absolutely nothing that prevents a
function from being altered in any arbitrary way, or even dropped,
while code that uses it is running. I don't really know what happens
in practice if you do that sort of thing: can you get the same query
to run with one function definition for the first part of execution
and some other definition for the rest of execution? I tend to doubt
it, because I suspect we cache the function definition at some point.
If that's the case, caching the parallel-safety marking at the same
point seems OK too, or at least no worse than what we're doing
already. But on the other hand if it is possible for a query's notion
of the function definition to shift while the query is in flight, then
this is just another example of that and no worse than any other.
Instead of changing the parallel-safety flag, somebody could redefine
the function so that it divides by zero or produces a syntax error and
kaboom, running queries break. Either way, I don't see what the big
deal is. As long as we make the handling of parallel-safety consistent
with other ways the function could be concurrently redefined, it won't
suck any more than the current system already does, or in any
fundamentally new ways.

Even if this line of thinking is correct, there's a big issue for
partitioning hierarchies because there you need to know stuff about
relations that you don't have any other reason to open. I'm just
arguing that if there's no partitioning, the problem seems reasonably
solvable. Either you changed something about the relation, in which
case you've got to lock it and issue invalidations, or you've changed
something about the function, which could be handled via a new type of
invalidation. I don't really see why the cost would be particularly
bad. Suppose that for every relation, you have a flag which is either
PARALLEL_DML_SAFE, PARALLEL_DML_RESTRICTED, PARALLEL_DML_UNSAFE, or
PARALLEL_DML_SAFETY_UNKNOWN. When someone sends a message saying "some
existing function's parallel-safety changed!" you reset that flag for
every relation in the relcache to PARALLEL_DML_SAFETY_UNKNOWN. Then if
somebody does DML on that relation and we want to consider
parallelism, it's got to recompute that flag. None of that sounds
horribly expensive.

I mean, it could be somewhat annoying if you have 100k relations open
and sit around all day flipping parallel-safety markings on and off
and then doing a single-row insert after each flip, but if that's the
only scenario where we incur significant extra overhead from this kind
of design, it seems clearly better than forcing users to set a flag
manually. Maybe it isn't, but I don't really see what the other
problem would be right now. Except, of course, for partitioning, which
I'm not quite sure what to do about.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [bug?] Missed parallel safety checks, and wrong parallel safety

Reply via email to