Re: Adding facility for injection points (or probe points?) for more advanced tests

Michael Paquier Mon, 20 Nov 2023 17:26:27 -0800

On Mon, Nov 20, 2023 at 04:53:45PM +0530, Ashutosh Bapat wrote:
> On Wed, Oct 25, 2023 at 9:43 AM Michael Paquier <mich...@paquier.xyz> wrote:
>> I have added some documentation to explain that, as well.  I am not
>> wedded to the name proposed in the patch, so if you feel there is
>> better, feel free to propose ideas.
> 
> Actually with Attach and Detach terminology, INJECTION_POINT becomes
> the place where we "declare" the injection point. So the documentation
> needs to first explain INJECTION_POINT and then explain the other
> operations.


Sure.

>> This facility is hidden behind a specific configure/Meson switch,
>> making it a no-op by default:
>> --enable-injection-points
>> -Dinjection_points={ true | false }
> 
> That's useful, but we will also see demand to enable those in
> production (under controlled circumstances). But let the functionality
> mature under a separate flag and developer builds before used in
> production.

Hmm.  Okay.  I'd still keep that under a compile switch for now
anyway to keep a cleaner separation and avoid issues where it would be
possible to inject code in a production build.  Note that I was
planning to switch one of my buildfarm animals to use it should this
stuff make it into the tree.  And people would be free to use it if
they want.  If in production, that would be a risk, IMO.

> +/*
> + * Local cache of injection callbacks already loaded, stored in
> + * TopMemoryContext.
> + */
> +typedef struct InjectionPointArrayEntry
> +{
> + char name[INJ_NAME_MAXLEN];
> + InjectionPointCallback callback;
> +} InjectionPointArrayEntry;
> +
> +static InjectionPointArrayEntry *InjectionPointCacheArray = NULL;
> 
> Initial size of this array is small, but given the size of code in a
> given path to be tested, I fear that this may not be sufficient. I
> feel it's better to use hash table whose APIs are already available.

I've never seen in recent years a need for a given test to use more
than 4~5 points.  But I guess that you've seen more than that wanted
in a prod environment with a fork of Postgres?

I'm OK to switch that to use a hash table in the initial
implementation, even for a low number of entries with points that are
not in hot code paths that should be OK.  At least for my cases
related to testing that's OK.  Am I right to assume that you mean a
HTAB in TopMemoryContext?

> I find it pretty useless to expose that as a SQL API. Also adding it
> in tests would make it usable as an example. In this patchset
> INJECTION_POINT should be the only way to trigger an injection point.

That's useful to test the cache logic while providing simple coverage
for the core API, and that's cheap.  So I'd rather keep this test
module around with these tests.

> That also brings another question. The INJECTION_POINT needs to be added in 
> the
> core code but can only be used through an extension. I think there should be
> some rudimentary albeit raw test in core for this. Extension does some good
> things like provides built-in actions when the injection is triggered. So, we
> should keep those too.

Yeah, I'd like to agree with that but everything I've seen in the
recent years is that every setup finishes by having different
assumptions about what they want to do, so my intention is to trim
down the core of the patch to a bare acceptable minimum and then work
on top of that.

> I am glad that it covers the frequently needed injections error, notice and
> wait. If someone adds multiple injection points for wait and all of them are
> triggered (in different backends), test_injection_points_wake() would
> wake them all. When debugging cascaded functionality it's not easy to
> follow sequence add, trigger, wake for multiple injections. We should
> associate a conditional variable with the required injection points. Attach
> should create conditional variable in the shared memory for that injection
> point and detach should remove it. I see this being mentioned in the commit
> message, but I think it's something we need in the first version of "wait" to
> be useful.

More to the point, I actually disagree with that, because it could be
possible as well that the same condition variable is used across
multiple points.  At the end, IMHO, the central hash table should hold
zero meta-data associated to an injection point like a counter, an
elog, a condition variable, a sleep time, etc. or any combination of
all these ones, and should just know about how to load a callback with
a library path and a routine name.  I understand that this is leaving
the responsibility of what a callback should do down to the individual
developer implementing a callback into their own extension, where they
should be free to have conditions of their own.

Something that I agree would be very useful for the in-core APIs,
depending on the cases, is to be able to push some information to the
callback at runtime to let a callback decide what to do depending on a
running state, including a condition registered when a point was
attached.  See my argument about an _ARG macro that passes down to the
callback a (void *).

> At some point we may want to control how many times an injection is
> triggered by using a count. Often, I have wanted an ERROR to be thrown
> in a code path once or twice and then stop triggering it. For example
> to test WAL sender restart after a certain event. We can't really time
> Detach correctly to avoid multiple restarts. A count is useful is such
> a case.

Yeah.  That's also something that can be achieved outside the shmem
hash table, so this is intentionally not part of InjectionPointHash.

> +/*
> + * Attach a new injection point.
> + */
> +void
> +InjectionPointAttach(const char *name,
> + const char *library,
> + const char *function)
> +{
> +#ifdef USE_INJECTION_POINTS
> 
> In a non-injection-point build this function would be compiled and a call to
> this function would throw an error. This means that any test we write which
> uses injection points can not do so optionally. Tests which can be run with 
> and
> without injection points built will reduce duplication. We should define this
> function as no-op in non-injection-point build to faciliate such tests.

The argument goes both ways, I guess.  I'd rather choose a hard
failure to know that what I am doing is not silently ignored, which is
why I made this choice in the patch.

> Those tests need to also install extension. That's another pain point.
> So anyone wants to run the tests needs to compile the extension too. I
> am wondering whether we should have this functionality in the core
> itself somewhere and will be only useful when built with injection.

That's something that may be discussed on top of the backend APIs,
though this comes down to how and what kind of meta-data should be
associated to the central shmem hash table.  Keeping the shmem hash as
small as possible to keep minimal the traces of this code in core is a
design choice that I'd rather not change.

> Many a times it's only a single backend which needs to be subjected to
> an injection. For inducing ERROR and NOTICE, many a times it's also
> the same backend attached the client session.

Yep.  I've used that across multiple sessions.  For the basic
facility, I think that's the absolute minimum.

> For WAIT, however we
> need a way to inject from some other session.

You can do that already with the patch, no?  If you know that a
different session would cross a given path, you could set a macro in
it.  If you wish for this session to wait before that, it is possible
to use a second point to make it do so.  I've used such techniques as
well for more complex reproducible failures than what I've posted in
the patch series.  In the last months, I've topped a TAP test to rely
on 5 deterministic points, I think.  Or perhaps 6.  That was a fun
exercise, for a TAP test coded while self-complaining about the core
backend code that does not make this stuff easier.

> We might be able to use
> current signalling mechanism for that (wake sends SIGUSR1 with
> reason). Leaving aside WAIT for a moment when the same backend's
> behaviour is being controlled, do we really need shared memory and
> also affect all the running backends. I see some discussion about
> being able to trigger only for a given PID, but when that PID of the
> current backend itself shared memory is not required.

I am not convinced that there is any need for signalling in most
cases, as long as you know beforehand the PID of the session you'd
like to stop, because this would still require a second session to
register a condition based on the PID known.  Another possibility that
I can think of is to use a custom wait event with a second point to
setup a different condition.  At the end, my point is that it is
possible to control everything in some extension code that holds the
callbacks, with an extra shmem area in the extension that associates
some meta-data with a point name, for instance.
--
Michael

signature.asc
Description: PGP signature

Re: Adding facility for injection points (or probe points?) for more advanced tests

Reply via email to