Re: Track the amount of time waiting due to cost_delay

Bertrand Drouvot Fri, 06 Dec 2024 01:42:42 -0800

Hi,

On Thu, Dec 05, 2024 at 10:43:51AM +0000, Bertrand Drouvot wrote:
> Yeah, people would likely use this new field to monitor long running vacuum.
> Long enough that this error should be acceptable. Do you agree?


OTOH, adding the 100% accuracy looks as simple as v9-0002 attached (0001 is
same as for v8), so I think we should provide it.

Thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

>From 8be8a71eb3c010d51bd6749dce33794f763c8572 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <[email protected]>
Date: Mon, 24 Jun 2024 08:43:26 +0000
Subject: [PATCH v9 1/2] Report the total amount of time that vacuum has been
 delayed due to cost delay

This commit adds one column: time_delayed to the pg_stat_progress_vacuum system
view to show the total amount of time in milliseconds that vacuum has been
delayed.

This uses the new parallel message type for progress reporting added
by f1889729dd.

In case of parallel worker, to avoid the leader to be interrupted too frequently
(while it might be sleeping for cost delay), the report is done only if the last
report has been done more than 1 second ago.

Having a time based only approach to throttle the reporting of the parallel
workers sounds reasonable.

Indeed when deciding about the throttling:

1. The number of parallel workers should not come into play:

 1.1) the more parallel workers is used, the less the impact of the leader on
 the vacuum index phase duration/workload is (because the repartition is done
 on more processes).

 1.2) the less parallel workers is, the less the leader will be interrupted (
 less parallel workers would report their delayed time).

2. The cost limit should not come into play as that value is distributed
proportionally among the parallel workers (so we're back to the previous point).

3. The cost delay does not come into play as the leader could be interrupted at
the beginning, the midle or whatever part of the wait and we are more interested
about the frequency of the interrupts.

3. A 1 second reporting "throttling" looks a reasonable threshold as:

 3.1 the idea is to have a significant impact when the leader could have been
interrupted say hundred/thousand times per second.

 3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum
multiple times per second (so a one second reporting granularity seems ok).

Would need to bump catversion because this changes the definition of
pg_stat_progress_vacuum.
---
 doc/src/sgml/monitoring.sgml         | 13 +++++++
 src/backend/catalog/system_views.sql |  2 +-
 src/backend/commands/vacuum.c        | 53 ++++++++++++++++++++++++++++
 src/include/commands/progress.h      |  1 +
 src/test/regress/expected/rules.out  |  3 +-
 5 files changed, 70 insertions(+), 2 deletions(-)
  22.8% doc/src/sgml/
   4.0% src/backend/catalog/
  67.6% src/backend/commands/
   3.8% src/test/regress/expected/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 840d7f8161..7386f7333d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6428,6 +6428,19 @@ FROM pg_stat_get_backend_idset() AS backendid;
        <literal>cleaning up indexes</literal>.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>time_delayed</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total amount of time spent in milliseconds waiting due to <xref linkend="guc-vacuum-cost-delay"/>
+       or <xref linkend="guc-autovacuum-vacuum-cost-delay"/>. In case of parallel
+       vacuum the reported time is across all the workers and the leader. The
+       workers update the column no more frequently than once per second, so it
+       could show slightly old values.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index da9a8fe99f..013bd06222 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1222,7 +1222,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
         S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
         S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
         S.param8 AS num_dead_item_ids, S.param9 AS indexes_total,
-        S.param10 AS indexes_processed
+        S.param10 AS indexes_processed, S.param11 AS time_delayed
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index bb639ef51f..6f9e515f56 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -39,6 +39,7 @@
 #include "catalog/pg_inherits.h"
 #include "commands/cluster.h"
 #include "commands/defrem.h"
+#include "commands/progress.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
@@ -59,6 +60,16 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
+/*
+ * Minimum amount of time (in ms) between two reports of the delayed time from a
+ * parallel worker to the leader. The goal is to avoid the leader to be
+ * interrupted too frequently while it might be sleeping for cost delay.
+ *
+ * Note that unsent delayed_time will disappear when the parallel workers exit
+ * because they are not considered in parallel_vacuum_end(). That's an acceptable
+ * error for the use cases.
+ */
+#define WORKER_REPORT_DELAY_INTERVAL 1000
 
 /*
  * GUC parameters
@@ -102,6 +113,16 @@ pg_atomic_uint32 *VacuumSharedCostBalance = NULL;
 pg_atomic_uint32 *VacuumActiveNWorkers = NULL;
 int			VacuumCostBalanceLocal = 0;
 
+/*
+ * In case of parallel workers, the last time the delay has been reported to
+ * the leader.
+ * We assume this initializes to zero.
+ */
+static instr_time last_report_time;
+
+/* total nap time between two reports */
+static double nap_time_since_last_report = 0;
+
 /* non-export function prototypes */
 static List *expand_vacuum_rel(VacuumRelation *vrel,
 							   MemoryContext vac_context, int options);
@@ -2402,13 +2423,45 @@ vacuum_delay_point(void)
 	/* Nap if appropriate */
 	if (msec > 0)
 	{
+		instr_time	delay_start;
+		instr_time	delay_end;
+		instr_time	delayed_time;
+
 		if (msec > vacuum_cost_delay * 4)
 			msec = vacuum_cost_delay * 4;
 
 		pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
+		INSTR_TIME_SET_CURRENT(delay_start);
 		pg_usleep(msec * 1000);
+		INSTR_TIME_SET_CURRENT(delay_end);
 		pgstat_report_wait_end();
 
+		/* Report the amount of time we slept */
+		INSTR_TIME_SET_ZERO(delayed_time);
+		INSTR_TIME_ACCUM_DIFF(delayed_time, delay_end, delay_start);
+
+		/* Parallel worker */
+		if (IsParallelWorker())
+		{
+			instr_time	time_since_last_report;
+
+			INSTR_TIME_SET_ZERO(time_since_last_report);
+			INSTR_TIME_ACCUM_DIFF(time_since_last_report, delay_end,
+								  last_report_time);
+			nap_time_since_last_report += INSTR_TIME_GET_MILLISEC(delayed_time);
+
+			if (INSTR_TIME_GET_MILLISEC(time_since_last_report) > WORKER_REPORT_DELAY_INTERVAL)
+			{
+				pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED,
+													nap_time_since_last_report);
+				nap_time_since_last_report = 0;
+				last_report_time = delay_end;
+			}
+		}
+		else
+			pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED,
+									   INSTR_TIME_GET_MILLISEC(delayed_time));
+
 		/*
 		 * We don't want to ignore postmaster death during very long vacuums
 		 * with vacuum_cost_delay configured.  We can't use the usual
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 5616d64523..9a0c2358c6 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -28,6 +28,7 @@
 #define PROGRESS_VACUUM_NUM_DEAD_ITEM_IDS		7
 #define PROGRESS_VACUUM_INDEXES_TOTAL			8
 #define PROGRESS_VACUUM_INDEXES_PROCESSED		9
+#define PROGRESS_VACUUM_TIME_DELAYED			10
 
 /* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
 #define PROGRESS_VACUUM_PHASE_SCAN_HEAP			1
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 3014d047fe..8b1154efac 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2056,7 +2056,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
     s.param7 AS dead_tuple_bytes,
     s.param8 AS num_dead_item_ids,
     s.param9 AS indexes_total,
-    s.param10 AS indexes_processed
+    s.param10 AS indexes_processed,
+    s.param11 AS time_delayed
    FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
      LEFT JOIN pg_database d ON ((s.datid = d.oid)));
 pg_stat_recovery_prefetch| SELECT stats_reset,
-- 
2.34.1

>From 5b382029eb8330bf9daef50093bb483d8e48e6c2 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <[email protected]>
Date: Fri, 6 Dec 2024 07:39:19 +0000
Subject: [PATCH v9 2/2] Report the amount of time we slept before exiting
 parallel workers

or we might get incomplete data due to WORKER_REPORT_DELAY_INTERVAL
---
 src/backend/commands/vacuum.c         | 6 +-----
 src/backend/commands/vacuumparallel.c | 7 +++++++
 src/include/commands/vacuum.h         | 1 +
 3 files changed, 9 insertions(+), 5 deletions(-)
  90.3% src/backend/commands/
   9.6% src/include/commands/

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 6f9e515f56..7ad42a9507 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -64,10 +64,6 @@
  * Minimum amount of time (in ms) between two reports of the delayed time from a
  * parallel worker to the leader. The goal is to avoid the leader to be
  * interrupted too frequently while it might be sleeping for cost delay.
- *
- * Note that unsent delayed_time will disappear when the parallel workers exit
- * because they are not considered in parallel_vacuum_end(). That's an acceptable
- * error for the use cases.
  */
 #define WORKER_REPORT_DELAY_INTERVAL 1000
 
@@ -121,7 +117,7 @@ int			VacuumCostBalanceLocal = 0;
 static instr_time last_report_time;
 
 /* total nap time between two reports */
-static double nap_time_since_last_report = 0;
+double		nap_time_since_last_report = 0;
 
 /* non-export function prototypes */
 static List *expand_vacuum_rel(VacuumRelation *vrel,
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 67cba17a56..a09b655a13 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -1087,6 +1087,13 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
+	/*
+	 * Report the amount of time we slept (or we might get incomplete data due
+	 * to WORKER_REPORT_DELAY_INTERVAL).
+	 */
+	pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED,
+										nap_time_since_last_report);
+
 	TidStoreDetach(dead_items);
 
 	/* Pop the error context stack */
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 759f9a87d3..7a3ff07ec0 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -312,6 +312,7 @@ extern PGDLLIMPORT int VacuumCostBalanceLocal;
 extern PGDLLIMPORT bool VacuumFailsafeActive;
 extern PGDLLIMPORT double vacuum_cost_delay;
 extern PGDLLIMPORT int vacuum_cost_limit;
+extern PGDLLIMPORT double nap_time_since_last_report;
 
 /* in commands/vacuum.c */
 extern void ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel);
-- 
2.34.1

Re: Track the amount of time waiting due to cost_delay

Reply via email to