Hi, On Sat, Jan 25, 2025 at 3:50 AM Tomas Vondra <to...@vondra.me> wrote:
> > > On 1/24/25 14:47, Rahila Syed wrote: > > > > Hi, > > > > > > Just idea; as an another option, how about blocking new requests to > > the target process (e.g., causing them to fail with an error or > > returning NULL with a warning) if a previous request is still > pending? > > Users can simply retry the request if it fails. IMO failing quickly > > seems preferable to getting stuck for a while in cases with > concurrent > > requests. > > > > Thank you for the suggestion. I agree that it is better to fail > > early and avoid waiting for a timeout in such cases. I will add a > > "pending request" tracker for this in shared memory. This approach > > will help prevent sending a concurrent request if a request for the > > same backend is still being processed. > > > > AFAIK these failures should be extremely rare - we're only talking about > that because the workload I used for testing is highly concurrent, i.e. > it requests memory context info extremely often. I doubt anyone sane is > going to do that in practice ... Yes, that makes sense. > > > IMO, one downside of throwing an error in such cases is that the > > users might wonder if they need to take a corrective action, even > > though the issue is actually going to solve itself and they just > > need to retry. Therefore, issuing a warning or displaying previously > > updated statistics might be a better alternative to throwing an > > error. > > > > Wouldn't this be mostly mitigated by adding proper detail/hint to the > error message? Sure, the user can always ignore that (especially when > calling this from a script), but well ... we can only do so much. > OK. All this makes me think about how we shared pgstat data before the shmem > approach was introduced in PG15. Until then the process signaled pgstat > collector, and the collector wrote the statistics into a file, with a > timestamp. And the process used the timestamp to decide if it's fresh > enough ... Wouldn't the same approach work here? > > I imagined it would work something like this: > > requesting backend: > ------------------- > * set request_ts to current timestamp > * signal the target process, to generate memory context info > * wait until the DSA gets filled with stats_ts > request_ts > * return the data, don't erase anything > > target backend > -------------- > * clear the signal > * generate the statistics > * set stats_ts to current timestamp > * wait all the backends waiting for the stats (through CV) > > I see v11 does almost this, except that it accepts somewhat stale data. > That's correct. > But why would that be necessary? I don't think it's needed, and I don't > think we should accept data from before the process sends the signal. > > This is done in an attempt to avoid concurrent requests from timing out. In such cases, data in response to another request is likely to already be in the dynamic shared memory. Hence instead of waiting for the latest data and risking a timeout, the approach displays available statistics that are newer than a defined threshold. Additionally, since we can't distinguish between sequential and concurrent requests, we accept somewhat stale data for all requests. I realize this approach has some issues, mainly regarding how to determine an appropriate threshold value or a limit for old data. Therefore, I agree that it makes sense to display the data that is published after the request is made. If such data can't be published due to concurrent requests or other delays, the function should detect this and return as soon as possible. Thank you, Rahila Syed