On Mon, Nov 20, 2023 at 04:53:45PM +0530, Ashutosh Bapat wrote: > On Wed, Oct 25, 2023 at 9:43 AM Michael Paquier <mich...@paquier.xyz> wrote: >> I have added some documentation to explain that, as well. I am not >> wedded to the name proposed in the patch, so if you feel there is >> better, feel free to propose ideas. > > Actually with Attach and Detach terminology, INJECTION_POINT becomes > the place where we "declare" the injection point. So the documentation > needs to first explain INJECTION_POINT and then explain the other > operations.
Sure. >> This facility is hidden behind a specific configure/Meson switch, >> making it a no-op by default: >> --enable-injection-points >> -Dinjection_points={ true | false } > > That's useful, but we will also see demand to enable those in > production (under controlled circumstances). But let the functionality > mature under a separate flag and developer builds before used in > production. Hmm. Okay. I'd still keep that under a compile switch for now anyway to keep a cleaner separation and avoid issues where it would be possible to inject code in a production build. Note that I was planning to switch one of my buildfarm animals to use it should this stuff make it into the tree. And people would be free to use it if they want. If in production, that would be a risk, IMO. > +/* > + * Local cache of injection callbacks already loaded, stored in > + * TopMemoryContext. > + */ > +typedef struct InjectionPointArrayEntry > +{ > + char name[INJ_NAME_MAXLEN]; > + InjectionPointCallback callback; > +} InjectionPointArrayEntry; > + > +static InjectionPointArrayEntry *InjectionPointCacheArray = NULL; > > Initial size of this array is small, but given the size of code in a > given path to be tested, I fear that this may not be sufficient. I > feel it's better to use hash table whose APIs are already available. I've never seen in recent years a need for a given test to use more than 4~5 points. But I guess that you've seen more than that wanted in a prod environment with a fork of Postgres? I'm OK to switch that to use a hash table in the initial implementation, even for a low number of entries with points that are not in hot code paths that should be OK. At least for my cases related to testing that's OK. Am I right to assume that you mean a HTAB in TopMemoryContext? > I find it pretty useless to expose that as a SQL API. Also adding it > in tests would make it usable as an example. In this patchset > INJECTION_POINT should be the only way to trigger an injection point. That's useful to test the cache logic while providing simple coverage for the core API, and that's cheap. So I'd rather keep this test module around with these tests. > That also brings another question. The INJECTION_POINT needs to be added in > the > core code but can only be used through an extension. I think there should be > some rudimentary albeit raw test in core for this. Extension does some good > things like provides built-in actions when the injection is triggered. So, we > should keep those too. Yeah, I'd like to agree with that but everything I've seen in the recent years is that every setup finishes by having different assumptions about what they want to do, so my intention is to trim down the core of the patch to a bare acceptable minimum and then work on top of that. > I am glad that it covers the frequently needed injections error, notice and > wait. If someone adds multiple injection points for wait and all of them are > triggered (in different backends), test_injection_points_wake() would > wake them all. When debugging cascaded functionality it's not easy to > follow sequence add, trigger, wake for multiple injections. We should > associate a conditional variable with the required injection points. Attach > should create conditional variable in the shared memory for that injection > point and detach should remove it. I see this being mentioned in the commit > message, but I think it's something we need in the first version of "wait" to > be useful. More to the point, I actually disagree with that, because it could be possible as well that the same condition variable is used across multiple points. At the end, IMHO, the central hash table should hold zero meta-data associated to an injection point like a counter, an elog, a condition variable, a sleep time, etc. or any combination of all these ones, and should just know about how to load a callback with a library path and a routine name. I understand that this is leaving the responsibility of what a callback should do down to the individual developer implementing a callback into their own extension, where they should be free to have conditions of their own. Something that I agree would be very useful for the in-core APIs, depending on the cases, is to be able to push some information to the callback at runtime to let a callback decide what to do depending on a running state, including a condition registered when a point was attached. See my argument about an _ARG macro that passes down to the callback a (void *). > At some point we may want to control how many times an injection is > triggered by using a count. Often, I have wanted an ERROR to be thrown > in a code path once or twice and then stop triggering it. For example > to test WAL sender restart after a certain event. We can't really time > Detach correctly to avoid multiple restarts. A count is useful is such > a case. Yeah. That's also something that can be achieved outside the shmem hash table, so this is intentionally not part of InjectionPointHash. > +/* > + * Attach a new injection point. > + */ > +void > +InjectionPointAttach(const char *name, > + const char *library, > + const char *function) > +{ > +#ifdef USE_INJECTION_POINTS > > In a non-injection-point build this function would be compiled and a call to > this function would throw an error. This means that any test we write which > uses injection points can not do so optionally. Tests which can be run with > and > without injection points built will reduce duplication. We should define this > function as no-op in non-injection-point build to faciliate such tests. The argument goes both ways, I guess. I'd rather choose a hard failure to know that what I am doing is not silently ignored, which is why I made this choice in the patch. > Those tests need to also install extension. That's another pain point. > So anyone wants to run the tests needs to compile the extension too. I > am wondering whether we should have this functionality in the core > itself somewhere and will be only useful when built with injection. That's something that may be discussed on top of the backend APIs, though this comes down to how and what kind of meta-data should be associated to the central shmem hash table. Keeping the shmem hash as small as possible to keep minimal the traces of this code in core is a design choice that I'd rather not change. > Many a times it's only a single backend which needs to be subjected to > an injection. For inducing ERROR and NOTICE, many a times it's also > the same backend attached the client session. Yep. I've used that across multiple sessions. For the basic facility, I think that's the absolute minimum. > For WAIT, however we > need a way to inject from some other session. You can do that already with the patch, no? If you know that a different session would cross a given path, you could set a macro in it. If you wish for this session to wait before that, it is possible to use a second point to make it do so. I've used such techniques as well for more complex reproducible failures than what I've posted in the patch series. In the last months, I've topped a TAP test to rely on 5 deterministic points, I think. Or perhaps 6. That was a fun exercise, for a TAP test coded while self-complaining about the core backend code that does not make this stuff easier. > We might be able to use > current signalling mechanism for that (wake sends SIGUSR1 with > reason). Leaving aside WAIT for a moment when the same backend's > behaviour is being controlled, do we really need shared memory and > also affect all the running backends. I see some discussion about > being able to trigger only for a given PID, but when that PID of the > current backend itself shared memory is not required. I am not convinced that there is any need for signalling in most cases, as long as you know beforehand the PID of the session you'd like to stop, because this would still require a second session to register a condition based on the PID known. Another possibility that I can think of is to use a custom wait event with a second point to setup a different condition. At the end, my point is that it is possible to control everything in some extension code that holds the callbacks, with an extra shmem area in the extension that associates some meta-data with a point name, for instance. -- Michael
signature.asc
Description: PGP signature