Hi, On Mon, Jul 01, 2024 at 04:59:25AM +0000, Bertrand Drouvot wrote: > Hi, > > On Fri, Jun 28, 2024 at 08:07:39PM +0000, Imseih (AWS), Sami wrote: > > > 46ebdfe164 will interrupt the leaders sleep every time a parallel workers > > > reports > > > progress, and we currently don't handle interrupts by restarting the > > > sleep with > > > the remaining time. nanosleep does provide the ability to restart with > > > the remaining > > > time [1], but I don't think it's worth the effort to ensure more accurate > > > vacuum delays for the leader process. > > > > After discussing offline with Bertrand, it may be better to have > > a solution to deal with the interrupts and allows the sleep to continue to > > completion. This will simplify this patch and will be useful > > for other cases in which parallel workers need to send a message > > to the leader. This is the thread [1] for that discussion. > > > > [1] > > https://www.postgresql.org/message-id/01000190606e3d2a-116ead16-84d2-4449-8d18-5053da66b1f4-000000%40email.amazonses.com > > > > Yeah, I think it would make sense to put this thread on hold until we know > more > about [1] (you mentioned above) outcome.
As it looks like we have a consensus not to wait on [0] (as reducing the number of interrupts makes sense on its own), then please find attached v4, a rebase version (that also makes clear in the doc that that new field might show slightly old values, as mentioned in [1]). [0]: https://www.postgresql.org/message-id/flat/01000190606e3d2a-116ead16-84d2-4449-8d18-5053da66b1f4-000000%40email.amazonses.com [1]: https://www.postgresql.org/message-id/ZruMe-ppopQX4uP8%40nathan Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
>From 90196125d1262095d02f0df74bb6cab0d03c75ff Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot <bertranddrouvot...@gmail.com> Date: Mon, 24 Jun 2024 08:43:26 +0000 Subject: [PATCH v4] Report the total amount of time that vacuum has been delayed due to cost delay This commit adds one column: time_delayed to the pg_stat_progress_vacuum system view to show the total amount of time in milliseconds that vacuum has been delayed. This uses the new parallel message type for progress reporting added by f1889729dd. In case of parallel worker, to avoid the leader to be interrupted too frequently (while it might be sleeping for cost delay), the report is done only if the last report has been done more than 1 second ago. Having a time based only approach to throttle the reporting of the parallel workers sounds reasonable. Indeed when deciding about the throttling: 1. The number of parallel workers should not come into play: 1.1) the more parallel workers is used, the less the impact of the leader on the vacuum index phase duration/workload is (because the repartition is done on more processes). 1.2) the less parallel workers is, the less the leader will be interrupted ( less parallel workers would report their delayed time). 2. The cost limit should not come into play as that value is distributed proportionally among the parallel workers (so we're back to the previous point). 3. The cost delay does not come into play as the leader could be interrupted at the beginning, the midle or whatever part of the wait and we are more interested about the frequency of the interrupts. 3. A 1 second reporting "throttling" looks a reasonable threshold as: 3.1 the idea is to have a significant impact when the leader could have been interrupted say hundred/thousand times per second. 3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum multiple times per second (so a one second reporting granularity seems ok). Bump catversion because this changes the definition of pg_stat_progress_vacuum. --- doc/src/sgml/monitoring.sgml | 13 ++++++++ src/backend/catalog/system_views.sql | 2 +- src/backend/commands/vacuum.c | 49 ++++++++++++++++++++++++++++ src/include/catalog/catversion.h | 2 +- src/include/commands/progress.h | 1 + src/test/regress/expected/rules.out | 3 +- 6 files changed, 67 insertions(+), 3 deletions(-) 23.5% doc/src/sgml/ 4.2% src/backend/catalog/ 63.4% src/backend/commands/ 4.6% src/include/ 4.0% src/test/regress/expected/ diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 55417a6fa9..d87604331a 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -6307,6 +6307,19 @@ FROM pg_stat_get_backend_idset() AS backendid; <literal>cleaning up indexes</literal>. </para></entry> </row> + + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>time_delayed</structfield> <type>bigint</type> + </para> + <para> + Total amount of time spent in milliseconds waiting due to <varname>vacuum_cost_delay</varname> + or <varname>autovacuum_vacuum_cost_delay</varname>. In case of parallel + vacuum the reported time is across all the workers and the leader. This + column is updated at a 1 Hz frequency (one time per second) so could show + slightly old values. + </para></entry> + </row> </tbody> </tgroup> </table> diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 19cabc9a47..875df7d0e4 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -1218,7 +1218,7 @@ CREATE VIEW pg_stat_progress_vacuum AS S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count, S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes, S.param8 AS num_dead_item_ids, S.param9 AS indexes_total, - S.param10 AS indexes_processed + S.param10 AS indexes_processed, S.param11 AS time_delayed FROM pg_stat_get_progress_info('VACUUM') AS S LEFT JOIN pg_database D ON S.datid = D.oid; diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c index 7d8e9d2045..5bf2e37d3f 100644 --- a/src/backend/commands/vacuum.c +++ b/src/backend/commands/vacuum.c @@ -40,6 +40,7 @@ #include "catalog/pg_inherits.h" #include "commands/cluster.h" #include "commands/defrem.h" +#include "commands/progress.h" #include "commands/vacuum.h" #include "miscadmin.h" #include "nodes/makefuncs.h" @@ -60,6 +61,12 @@ #include "utils/snapmgr.h" #include "utils/syscache.h" +/* + * Minimum amount of time (in ms) between two reports of the delayed time from a + * parallel worker to the leader. The goal is to avoid the leader to be + * interrupted too frequently while it might be sleeping for cost delay. + */ +#define WORKER_REPORT_DELAY_INTERVAL 1000 /* * GUC parameters @@ -103,6 +110,16 @@ pg_atomic_uint32 *VacuumSharedCostBalance = NULL; pg_atomic_uint32 *VacuumActiveNWorkers = NULL; int VacuumCostBalanceLocal = 0; +/* + * In case of parallel workers, the last time the delay has been reported to + * the leader. + * We assume this initializes to zero. + */ +static instr_time last_report_time; + +/* total nap time between two reports */ +double nap_time_since_last_report = 0; + /* non-export function prototypes */ static List *expand_vacuum_rel(VacuumRelation *vrel, MemoryContext vac_context, int options); @@ -2377,13 +2394,45 @@ vacuum_delay_point(void) /* Nap if appropriate */ if (msec > 0) { + instr_time delay_start; + instr_time delay_end; + instr_time delayed_time; + if (msec > vacuum_cost_delay * 4) msec = vacuum_cost_delay * 4; pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY); + INSTR_TIME_SET_CURRENT(delay_start); pg_usleep(msec * 1000); + INSTR_TIME_SET_CURRENT(delay_end); pgstat_report_wait_end(); + /* Report the amount of time we slept */ + INSTR_TIME_SET_ZERO(delayed_time); + INSTR_TIME_ACCUM_DIFF(delayed_time, delay_end, delay_start); + + /* Parallel worker */ + if (IsParallelWorker()) + { + instr_time time_since_last_report; + + INSTR_TIME_SET_ZERO(time_since_last_report); + INSTR_TIME_ACCUM_DIFF(time_since_last_report, delay_end, + last_report_time); + nap_time_since_last_report += INSTR_TIME_GET_MILLISEC(delayed_time); + + if (INSTR_TIME_GET_MILLISEC(time_since_last_report) > WORKER_REPORT_DELAY_INTERVAL) + { + pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED, + nap_time_since_last_report); + nap_time_since_last_report = 0; + last_report_time = delay_end; + } + } + else + pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED, + INSTR_TIME_GET_MILLISEC(delayed_time)); + /* * We don't want to ignore postmaster death during very long vacuums * with vacuum_cost_delay configured. We can't use the usual diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h index 9a0ae27823..ec1f13748f 100644 --- a/src/include/catalog/catversion.h +++ b/src/include/catalog/catversion.h @@ -57,6 +57,6 @@ */ /* yyyymmddN */ -#define CATALOG_VERSION_NO 202408122 +#define CATALOG_VERSION_NO 202408201 #endif diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h index 5616d64523..9a0c2358c6 100644 --- a/src/include/commands/progress.h +++ b/src/include/commands/progress.h @@ -28,6 +28,7 @@ #define PROGRESS_VACUUM_NUM_DEAD_ITEM_IDS 7 #define PROGRESS_VACUUM_INDEXES_TOTAL 8 #define PROGRESS_VACUUM_INDEXES_PROCESSED 9 +#define PROGRESS_VACUUM_TIME_DELAYED 10 /* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */ #define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1 diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out index 862433ee52..2bef31a66d 100644 --- a/src/test/regress/expected/rules.out +++ b/src/test/regress/expected/rules.out @@ -2052,7 +2052,8 @@ pg_stat_progress_vacuum| SELECT s.pid, s.param7 AS dead_tuple_bytes, s.param8 AS num_dead_item_ids, s.param9 AS indexes_total, - s.param10 AS indexes_processed + s.param10 AS indexes_processed, + s.param11 AS time_delayed FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20) LEFT JOIN pg_database d ON ((s.datid = d.oid))); pg_stat_recovery_prefetch| SELECT stats_reset, -- 2.34.1