Hi,
> > > > Just idea; as an another option, how about blocking new requests to > > the target process (e.g., causing them to fail with an error or > > returning NULL with a warning) if a previous request is still > pending? > > Users can simply retry the request if it fails. IMO failing quickly > > seems preferable to getting stuck for a while in cases with > concurrent > > requests. > > > > Thank you for the suggestion. I agree that it is better to fail > > early and avoid waiting for a timeout in such cases. I will add a > > "pending request" tracker for this in shared memory. This approach > > will help prevent sending a concurrent request if a request for the > > same backend is still being processed. > > Please find attached a patch that adds a request_pending field in shared memory. This allows us to detect concurrent requests early and return a WARNING message immediately, avoiding unnecessary waiting and potential timeouts. This is added in v12-0002* patch. I imagined it would work something like this: > > requesting backend: > ------------------- > * set request_ts to current timestamp > * signal the target process, to generate memory context info > * wait until the DSA gets filled with stats_ts > request_ts > * return the data, don't erase anything > > target backend > -------------- > * clear the signal > * generate the statistics > * set stats_ts to current timestamp > * wait all the backends waiting for the stats (through CV) > The attached v12-0002* patch implements this. We determine the latest statistics based on the stats timestamp, if it is greater than the timestamp when the request was sent, the statistics are considered up to date and are returned immediately. Otherwise, the client waits for the latest statistics to be published until the timeout is reached. With the latest changes, I don't see a dip in tps even when concurrent requests are run in pgbench script. pgbench -n -f monitoring.sql -P 1 postgres -T 60 pgbench (18devel) progress: 1.0 s, 816.9 tps, lat 1.218 ms stddev 0.317, 0 failed progress: 2.0 s, 821.9 tps, lat 1.216 ms stddev 0.177, 0 failed progress: 3.0 s, 817.1 tps, lat 1.224 ms stddev 0.209, 0 failed progress: 4.0 s, 791.0 tps, lat 1.262 ms stddev 0.292, 0 failed progress: 5.0 s, 780.8 tps, lat 1.280 ms stddev 0.326, 0 failed progress: 6.0 s, 675.2 tps, lat 1.482 ms stddev 0.503, 0 failed progress: 7.0 s, 674.0 tps, lat 1.482 ms stddev 0.387, 0 failed progress: 8.0 s, 821.0 tps, lat 1.217 ms stddev 0.272, 0 failed progress: 9.0 s, 903.0 tps, lat 1.108 ms stddev 0.196, 0 failed progress: 10.0 s, 886.9 tps, lat 1.128 ms stddev 0.160, 0 failed progress: 11.0 s, 887.1 tps, lat 1.126 ms stddev 0.243, 0 failed progress: 12.0 s, 871.0 tps, lat 1.147 ms stddev 0.227, 0 failed progress: 13.0 s, 735.0 tps, lat 1.361 ms stddev 0.329, 0 failed progress: 14.0 s, 655.9 tps, lat 1.522 ms stddev 0.331, 0 failed progress: 15.0 s, 674.0 tps, lat 1.484 ms stddev 0.254, 0 failed progress: 16.0 s, 659.0 tps, lat 1.517 ms stddev 0.289, 0 failed progress: 17.0 s, 641.0 tps, lat 1.558 ms stddev 0.281, 0 failed progress: 18.0 s, 707.8 tps, lat 1.412 ms stddev 0.324, 0 failed progress: 19.0 s, 746.3 tps, lat 1.341 ms stddev 0.219, 0 failed progress: 20.0 s, 659.9 tps, lat 1.513 ms stddev 0.372, 0 failed progress: 21.0 s, 651.8 tps, lat 1.533 ms stddev 0.372, 0 failed WARNING: cannot process the request at the moment HINT: Another request is pending, try again progress: 22.0 s, 635.2 tps, lat 1.574 ms stddev 0.519, 0 failed WARNING: cannot process the request at the moment HINT: Another request is pending, try again progress: 23.0 s, 730.0 tps, lat 1.369 ms stddev 0.408, 0 failed WARNING: cannot process the request at the moment HINT: Another request is pending, try again WARNING: cannot process the request at the moment HINT: Another request is pending, try again where monitoring.sql is as follows: SELECT * FROM pg_get_process_memory_contexts( (SELECT pid FROM pg_stat_activity WHERE pid != pg_backend_pid() ORDER BY random() LIMIT 1) , false); I have split the patch into 2 patches with v12-0001* consisting of fixes needed to allow using the MemoryContextStatsInternals for this feature and v12-0002* containing all the remaining changes for the feature. A few outstanding issues are as follows: 1. Currently one DSA is created per backend when the first request for statistics is made and remains for the lifetime of the server. I think I should add logic to periodically destroy DSAs, when memory context statistics are not being *actively* queried from the backend, as determined by the statistics timestamp. 2. The two issues reported by Fujii-san here: [1]. i. I have proposed a fix for the first issue here [2]. ii. I am able to reproduce the second issue. This happens when we try to query statistics of a backend running infinite_recurse.sql. While I am working on finding a root-cause, I think it happens due to some memory being overwritten due to to stack-depth violation, as the issue is not seen when I reduce the max_stack_depth to 100kb. [1]. https://www.postgresql.org/message-id/a1a7e2b7-8f33-4313-baff-42e92ec14fd3%40oss.nttdata.com [2]. https://www.postgresql.org/message-id/CAH2L28shr0j3JE5V3CXDFmDH-agTSnh2V8pR23X0UhRMbDQD9Q%40mail.gmail.com
v12-0002-Function-to-report-memory-context-statistics.patch
Description: Binary data
v12-0001-Preparatory-changes-for-reporting-memory-context-sta.patch
Description: Binary data