Re: pgbench logging broken by time logic changes

Thomas Munro Thu, 08 Jul 2021 20:54:30 -0700

On Fri, Jul 9, 2021 at 5:15 AM Fabien COELHO <coe...@cri.ensmp.fr> wrote:
> > Thanks!  This doesn't seem to address the complaint, though.  Don't
> > you need to do something like this?  (See also attached.)
> >
> > +    initStats(&aggs, start - (start + epoch_shift) % 1000000);
>
> ISTM that this is: (start + epoch_shift) / 1000000 * 1000000


Sure, it's equivalent if you also change doLog() to match the change
in epoch, as your v16 did.  Your version seems fine to me.  The key
thing is that we only start new aggregates when the Unix epoch time
crosses over a XXX.000000 boundary, since we're only logging the XXX
part.  That's much like(*) pgbench13, and I'm happy however you want
to code that.  Case closed on that point.  Thanks!

> > Isn't it better if we only have to throw away the first one?).
>
> This should be the user decision to drop it or not, not the tool producing
> it, IMO.

Let me try this complaint again.  It's admittedly a minor detail, but
my goal in this thread is to match pgbench13's treatment of aggregate
boundaries and partial aggregates, so that we can close the open item
for 14 with a minimal fix that doesn't make any unnecessary changes.
Discussions about improvements or pre-existing problems can wait.

First, let me demonstrate that pgbench13 throws away final partial
aggregates.  I hacked REL_13_STABLE like so:

                if (agg_interval > 0)
                {
                        /* log aggregated but not yet reported transactions */
+fprintf(thread->logfile, "XXX log aggregated but not yet reported
transactions: aggs.cnt = %ld\n", aggs.cnt);
                        doLog(thread, state, &aggs, false, 0, 0);
                }
                fclose(thread->logfile);

I ran pgbench -T6 --aggregate-interval 1 -l -S postgres, and it
produced a log file containing:

===BOF===
1625782245 7974 428574 24998642 49 683
1625782246 19165 998755 53217669 49 310
1625782247 19657 998868 51577774 47 189
1625782248 19707 998898 51660408 47 189
1625782249 19969 998867 50454999 48 156
1625782250 19845 998877 51071013 47 210
XXX log aggregated but not yet reported transactions: aggs.cnt = 10988
===EOF===

We can see three interesting things:

1.  The first aggregate is partial (only ~7k transactions, because it
started partway through a second).  Users have to throw away that
first aggregate because its statistics are noise.  That is the price
to pay to have later aggregates start at the time they say.

2.  We get 5 more full aggregates (~19k transactions).  That's a total
of 6 aggregates, which makes intuitive sense with -T6.

3.  At the end, the extra call to doLog() did nothing, and yet cnt =
10988.  That's data we're throwing away, because Unix epoch time has
not advanced far enough to reach a new aggregate start time (it's not
impossible, but probability is very low).  Checking the commit log, I
see that the code that claims to log the final aggregate came from
commit b6037664960 (2016); apparently it doesn't do what the comments
seem to think it does (did that ever work?  Should perhaps be cleaned
up, but separately, it's not an open item for 14).

Now, in your recent patches you force that final partial aggregate to
be logged in that case with that new flag mechanism, as we can see:

===BOF===
1625783726 11823 609143 32170321 48 549
1625783727 19530 998995 52383115 47 210
1625783728 19468 999026 52208898 46 181
1625783729 19826 999001 51238427 46 185
1625783730 19195 999110 52841674 49 172
1625783731 18572 998992 56028876 48 318
1625783732 7484 388620 20951100 48 316
===EOF===

1.  We get a partial initial aggregate just like 13.  That sacrificial
aggregate helps us synchronize the rest of the aggregates with the
logged timestamps.  Good.

2.  We get 5 full aggregates (~19k transactions) just like 13.  As in
13, that's quite likely, because the first one was "short" so we
almost always reach the end of the 6th one before -T6 runs out of
sand.  Good.

3.  We get a new partial aggregate at the end.  Users would have to
throw that one away too.  This is not a big deal, but it's a change in
behaviour that should be discussed.

Given that that last one is a waste of pixels and a (so far)
unjustified change in behaviour, I propose, this time with a little
more force and an updated patch, that we abandon that part of the
change.  I submit that you only added that because your earlier
patches didn't have the partial aggregate at the start, so then it
often didn't produce the 6th line of output.  So, you forced it to log
whatever it had left, even though the full time hadn't elapsed yet.
Now we don't need that.

The patch and resulting code are simpler, and the user experience matches 13.

See attached.

> > I don't really see the value in the test that checks that $delay falls
> > in the range 1.5s - 2.5s and then ignores the result.  If it hangs
> > forever, we'll find out about it, and otherwise no human or machine
> > will ever care about that test.  I removed it from this version.  Were
> > you really attached to it?
>
> YES, REALLY! It would just have caught quite a few of the regressions we
> are trying to address here. I want it there even if ignored because I'll
> look for it to avoid regressions in the future. If the test is actually
> removed, recreating it is a pain. If you really want to disactivate it,
> use if(0) but PLEASE let it there so that it can ne reactivated for tests
> very simply, not bad maintaining some test outside of the tree.

Ok, you win :-)

> When/if you get to commit this patch, eventually, do not forget that I'm
> pushing forward fixes contributed by others, including Kyotaro Horiguchi
> and Yugo Nagata.

Fixed, thanks.

* I say "much like" and not "exactly like"; of course there may be a
subtle difference if ntpd adjusts your clock while a benchmark is
running.  Something must give, and 14's coding prefers to keep the
duration of aggregates stable at exactly X seconds according to the
high precision timer, so that the statistics it reports are
meaningful, but 13 prefers to keep the timestamps it logs in sync with
other software using gettimeofday() and will give you a weird short or
long aggregate to achieve that (producing bad statistics).  I can see
arguments for both but I'm OK with that change and I see that it is in
line with your general goal of switching to modern accurate time
interfaces.

From 5a5074c1ba971197569a7ac83f6f80c56479b806 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Fri, 9 Jul 2021 12:09:16 +1200
Subject: [PATCH v17] Fix bugs in pgbench log output.

Commit 547f04e switched from using instr_time to pg_time_usec_t, but
introduced a couple of bugs:

1.  The log previously used Unix time stamps, but it was changed to
write out times based on a system-dependent epoch.  Restore the Unix
epoch-based reporting, and make sure to begin aggregates on the
boundaries of Unix epoch seconds, as before.

2.  The aggregation interval needed to be scaled appropriately for the
new accounting unit, to avoid logging bogus empty aggregates.

Also add simple new TAP tests to verify the logged output format.  Since
one of the checks is likely to fail on a heavily loaded system, ignore
failures in that test.

Back-patch to 14.

Author: Fabien COELHO <coe...@cri.ensmp.fr>
Author: Yugo NAGATA <nag...@sraoss.co.jp>
Author: Kyotaro Horiguchi <horikyota....@gmail.com>
Reviewed-by: Michael Paquier <mich...@paquier.xyz>
Reviewed-by: Alvaro Herrera <alvhe...@alvh.no-ip.org>
Reported-by: YoungHwan Joo <rul...@gmail.com>
Reported-by: Gregory Smith <gregsmithpg...@gmail.com>
Discussion: https://postgr.es/m/CAF7igB1r6wRfSCEAB-iZBKxkowWY6%2BdFF2jObSdd9%2BiVK%2BvHJg%40mail.gmail.com
Discussion: https://postgr.es/m/CAHLJuCW_8Vpcr0%3Dt6O_gozrg3wXXWXZXDioYJd3NhvKriqgpfQ%40mail.gmail.com
---
 src/bin/pgbench/pgbench.c                    | 43 ++++++++++++---
 src/bin/pgbench/t/001_pgbench_with_server.pl | 58 +++++++++++++++++++-
 2 files changed, 92 insertions(+), 9 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 4aeccd93af..a54958930b 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -343,6 +343,12 @@ typedef struct StatsData
 	SimpleStats lag;
 } StatsData;
 
+/*
+ * For displaying Unix epoch timestamps, as some time functions may have
+ * another reference.
+ */
+pg_time_usec_t epoch_shift;
+
 /*
  * Struct to keep random state.
  */
@@ -3771,7 +3777,11 @@ executeMetaCommand(CState *st, pg_time_usec_t *now)
 /*
  * Print log entry after completing one transaction.
  *
- * We print Unix-epoch timestamps in the log, so that entries can be
+ * The function behavior changes depending on sample_rate (a fraction of
+ * transaction is reported) and agg_interval (transactions are aggregated
+ * and reported once every agg_interval seconds).
+ *
+ * We use Unix-epoch timestamps in the log, so that entries can be
  * correlated against other logs.  On some platforms this could be obtained
  * from the caller, but rather than get entangled with that, we just eat
  * the cost of an extra syscall in all cases.
@@ -3781,7 +3791,7 @@ doLog(TState *thread, CState *st,
 	  StatsData *agg, bool skipped, double latency, double lag)
 {
 	FILE	   *logfile = thread->logfile;
-	pg_time_usec_t now = pg_time_now();
+	pg_time_usec_t now = pg_time_now() + epoch_shift;
 
 	Assert(use_log);
 
@@ -3796,17 +3806,19 @@ doLog(TState *thread, CState *st,
 	/* should we aggregate the results or not? */
 	if (agg_interval > 0)
 	{
+		pg_time_usec_t next;
+
 		/*
 		 * Loop until we reach the interval of the current moment, and print
 		 * any empty intervals in between (this may happen with very low tps,
 		 * e.g. --rate=0.1).
 		 */
 
-		while (agg->start_time + agg_interval <= now)
+		while ((next = agg->start_time + agg_interval * INT64CONST(1000000)) <= now)
 		{
 			/* print aggregated report to logfile */
 			fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
-					agg->start_time,
+					agg->start_time / 1000000,	/* seconds since Unix epoch */
 					agg->cnt,
 					agg->latency.sum,
 					agg->latency.sum2,
@@ -3825,7 +3837,7 @@ doLog(TState *thread, CState *st,
 			fputc('\n', logfile);
 
 			/* reset data and move to next interval */
-			initStats(agg, agg->start_time + agg_interval);
+			initStats(agg, next);
 		}
 
 		/* accumulate the current transaction */
@@ -5458,7 +5470,8 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
 
 	if (progress_timestamp)
 	{
-		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now));
+		snprintf(tbuf, sizeof(tbuf), "%.3f s",
+				 PG_TIME_GET_DOUBLE(now + epoch_shift));
 	}
 	else
 	{
@@ -5808,6 +5821,15 @@ main(int argc, char **argv)
 	char	   *env;
 
 	int			exit_code = 0;
+	struct timeval tv;
+
+	/*
+	 * Record difference between Unix epoch and high resolution timer's epoch.
+	 * We'll use this for logging and aggregation with Unix epoch-based
+	 * buckets.
+	 */
+	gettimeofday(&tv, NULL);
+	epoch_shift = tv.tv_sec * INT64CONST(1000000) + tv.tv_usec - pg_time_now();
 
 	pg_logging_init(argv[0]);
 	progname = get_progname(argv[0]);
@@ -6637,7 +6659,14 @@ threadRun(void *arg)
 	thread->bench_start = start;
 	thread->throttle_trigger = start;
 
-	initStats(&aggs, start);
+	/*
+	 * The log format currently has Unix epoch timestamps with whole numbers
+	 * of seconds.  Round the first aggregate's start time down to the nearest
+	 * Unix epoch second (the very first aggregate might really have started a
+	 * fraction of a second later, but later aggregates are measured from the
+	 * whole number time that is actually logged).
+	 */
+	initStats(&aggs, (start + epoch_shift) / 1000000 * 1000000);
 	last = aggs;
 
 	/* loop till all clients have terminated */
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 3aa9d5d753..3bf42fad1b 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -8,6 +8,7 @@ use PostgresNode;
 use TestLib;
 use Test::More;
 use Config;
+use Time::HiRes qw(time);
 
 # start a pgbench specific server
 my $node = get_new_node('main');
@@ -54,12 +55,14 @@ sub pgbench
 
 	push @cmd, @args;
 
+	my $start = time();
 	$node->command_checks_all(\@cmd, $stat, $out, $err, $name);
+	my $stop = time();
 
 	# cleanup?
 	#unlink @filenames or die "cannot unlink files (@filenames): $!";
 
-	return;
+	return $stop - $start;
 }
 
 # tablespace for testing, because partitioned tables cannot use pg_default
@@ -1187,7 +1190,7 @@ sub check_pgbench_logs
 
 	# $prefix is simple enough, thus does not need escaping
 	my @logs = list_files($dir, qr{^$prefix\..*$});
-	ok(@logs == $nb, "number of log files");
+	ok(@logs == $nb, "number of log files (@logs)");
 	ok(grep(/\/$prefix\.\d+(\.\d+)?$/, @logs) == $nb, "file name format");
 
 	my $log_number = 0;
@@ -1220,6 +1223,57 @@ sub check_pgbench_logs
 my $bdir = $node->basedir;
 
 # Run with sampling rate, 2 clients with 50 transactions each.
+#
+# Test time-sensitive features on a light read-only transaction:
+#
+#   -T: bench duration, 2 seconds to exercise progress & logs
+#   -P: progress report
+#   --aggregate-interval: periodic aggregated logs
+#   --rate: schedule load
+#   --latency-limit: max delay, not deeply exercice
+#
+# note: the --rate behavior is probabilistic in nature.
+# note: --progress-timestamp is not tested.
+my $delay = pgbench(
+	'-T 2 -P 1 -l --aggregate-interval=1 -S -b se@2'
+	. ' --rate=20 --latency-limit=1000 -j ' . $nthreads
+	. ' -c 3 -r',
+	0,
+	[   qr{type: multiple},
+		qr{clients: 3},
+		qr{threads: $nthreads},
+		qr{duration: 2 s},
+		qr{script 1: .* select only},
+		qr{script 2: .* select only},
+		qr{statement latencies in milliseconds},
+		qr{FROM pgbench_accounts} ],
+	[ qr{vacuum}, qr{progress: 1\b} ],
+	'pgbench progress', undef,
+	"--log-prefix=$bdir/001_pgbench_log_1");
+
+# check that we are around 2 seconds
+TODO: {
+	local $TODO = "possibly unreliable on slow hosts or unlucky runs";
+
+	# The rate may results in an unlucky schedule which triggers
+	# an early exit, hence the loose bound.
+
+	# also, the delay may totally fail on very slow or overloaded hosts,
+	# valgrind runs...
+
+	ok(1.5 < $delay && $delay < 2.5, "-T 2 run around 2 seconds");
+}
+
+# $nthreads threads, 2 seconds, but due to timing imprecision we might get
+# only 1 or as many as 3 progress reports per thread.
+# aggregate log format is:
+#   unix_epoch_time #tx sum sum2 min max [sum sum2 min max [skipped]]
+# first series about latency; second about lag (--rate) ;
+# skipped only if --latency-limit is set.
+check_pgbench_logs($bdir, '001_pgbench_log_1', $nthreads, 1, 3,
+	qr{^\d{10,} \d{1,2} \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+$});
+
+# with sampling rate, 2 clients with 50 tx each
 pgbench(
 	"-n -S -t 50 -c 2 --log --sampling-rate=0.5", 0,
 	[ qr{select only}, qr{processed: 100/100} ], [qr{^$}],
-- 
2.30.2

Re: pgbench logging broken by time logic changes

Reply via email to