Hello Tom,

If this is done, some people with zipfian distribution that currently
work might be unhappy.

After giving it some thought, I think that this cannot be fully fixed for
12.

Just to clarify --- my complaint about "over engineering" referred to
the fact that a cache exists at all; fooling around with the overflow
behavior isn't really going to answer that.

Having some cache really makes sense for s < 1, AFAICS.

The bigger picture here is that a benchmarking tool that contains its
own performance surprises is not a nice thing to have.

Hmmm. It seems that it depends. Sometimes self inflected wounds are the soldier's responsability, sometimes they must be prevented.

It's not very hard to imagine somebody wasting a lot of time trying to explain weird results, only to find out that the cause is unstable performance of random_zipfian. Or worse, people might draw totally incorrect conclusions if they fail to drill down enough to realize that there's a problem in pgbench rather than on the server side.

Yep, benchmarking is tougher than it looks: it is very easy to get it wrong without knowing, whatever tool you use.

Given the constraint of Jim Gray's approximated method for s in (0, 1),
which really does zipfian for the first two integers and then uses an
exponential approximation, the only approach is that the parameters must
be computed in a partial eval preparation phase before the bench code is
run. This means that only (mostly) constants would be allowed as
parameters when s is in (0, 1), but I think that this is acceptable
because anyway the method fundamentaly requires it.

Yeah, if we could store all the required harmonic numbers before the
test run timing starts, that would address the concern about stable
performance. But I have to wonder whether zipfian with s < 1 is useful
enough to justify so much code.

I do not know about that. I do not think that Jim Gray chose to invent an approximated method for s < 1 because he thought it was fun. I think he did it because his benchmark data required it. If you need it, you need it. If you do not need it, you do not care.

The attached other attached patch illustrate what I call poor performance
for stupid parameters (no point in doing zipfian on 2 integers…) :
...
Maybe the implementation could impose that s is at least 1.001 to avoid
the lower performance?

I was wondering about that too.  It seems like it'd be a wise idea to
further constrain s and/or n to ensure that the s > 1 code path doesn't do
anything too awful ...

Yep. The attached version enforces s >= 1.001, which avoids the worse cost
of iterating, according to my small tests.

unless someone comes up with a better implementation that has stable performance without such constraints.

Hmmm…

--
Fabien.
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 24833f46bc..d9c2b71c75 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -1598,23 +1598,21 @@ f(x) = PHI(2.0 * parameter * (x - mu) / (max - min + 1)) /
     </listitem>
     <listitem>
      <para>
-      <literal>random_zipfian</literal> generates an approximated bounded Zipfian
-      distribution. For <replaceable>parameter</replaceable> in (0, 1), an
-      approximated algorithm is taken from
-      "Quickly Generating Billion-Record Synthetic Databases",
-      Jim Gray et al, SIGMOD 1994. For <replaceable>parameter</replaceable>
-      in (1, 1000), a rejection method is used, based on
+      <literal>random_zipfian</literal> generates an bounded Zipfian distribution,
+      for <replaceable>parameter</replaceable> in [1.001, 1000].
+      A rejection method is used, based on
       "Non-Uniform Random Variate Generation", Luc Devroye, p. 550-551,
       Springer 1986. The distribution is not defined when the parameter's
-      value is 1.0. The function's performance is poor for parameter values
-      close and above 1.0 and on a small range.
+      value is 1.0.
+      The implementation enforces
+      <literal><repleceable>parameter</replaceable> >= 1.001</literal>
+      in order to avoid the method's poor drawing performance
+      for parameter values close and above 1.0 and on a small range.
      </para>
      <para>
       <replaceable>parameter</replaceable> defines how skewed the distribution
       is. The larger the <replaceable>parameter</replaceable>, the more
       frequently values closer to the beginning of the interval are drawn.
-      The closer to 0 <replaceable>parameter</replaceable> is,
-      the flatter (more uniform) the output distribution.
       The distribution is such that, assuming the range starts from 1,
       the ratio of the probability of drawing <replaceable>k</replaceable>
       versus drawing <replaceable>k+1</replaceable> is
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 4789ab92ee..81c64c7d71 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -137,7 +137,9 @@ static int	pthread_join(pthread_t th, void **thread_return);
 #define ZIPF_CACHE_SIZE	15		/* cache cells number */
 
 #define MIN_GAUSSIAN_PARAM		2.0 /* minimum parameter for gauss */
-#define MAX_ZIPFIAN_PARAM		1000	/* maximum parameter for zipfian */
+
+#define MIN_ZIPFIAN_PARAM		1.001	/* minimum parameter for zipfian */
+#define MAX_ZIPFIAN_PARAM		1000.0	/* maximum parameter for zipfian */
 
 int			nxacts = 0;			/* number of transactions per client */
 int			duration = 0;		/* duration in seconds */
@@ -409,35 +411,6 @@ typedef struct
 	int			ecnt;			/* error count */
 } CState;
 
-/*
- * Cache cell for random_zipfian call
- */
-typedef struct
-{
-	/* cell keys */
-	double		s;				/* s - parameter of random_zipfian function */
-	int64		n;				/* number of elements in range (max - min + 1) */
-
-	double		harmonicn;		/* generalizedHarmonicNumber(n, s) */
-	double		alpha;
-	double		beta;
-	double		eta;
-
-	uint64		last_used;		/* last used logical time */
-} ZipfCell;
-
-/*
- * Zipf cache for zeta values
- */
-typedef struct
-{
-	uint64		current;		/* counter for LRU cache replacement algorithm */
-
-	int			nb_cells;		/* number of filled cells */
-	int			overflowCount;	/* number of cache overflows */
-	ZipfCell	cells[ZIPF_CACHE_SIZE];
-} ZipfCache;
-
 /*
  * Thread state
  */
@@ -459,8 +432,6 @@ typedef struct
 
 	int64		throttle_trigger;	/* previous/next throttling (us) */
 	FILE	   *logfile;		/* where to log, or NULL */
-	ZipfCache	zipf_cache;		/* for thread-safe  zipfian random number
-								 * generation */
 
 	/* per thread collected stats */
 	instr_time	start_time;		/* thread start time */
@@ -982,77 +953,12 @@ getPoissonRand(RandomState *random_state, double center)
 	return (int64) (-log(uniform) * center + 0.5);
 }
 
-/* helper function for getZipfianRand */
-static double
-generalizedHarmonicNumber(int64 n, double s)
-{
-	int			i;
-	double		ans = 0.0;
-
-	for (i = n; i > 1; i--)
-		ans += pow(i, -s);
-	return ans + 1.0;
-}
-
-/* set harmonicn and other parameters to cache cell */
-static void
-zipfSetCacheCell(ZipfCell *cell, int64 n, double s)
-{
-	double		harmonic2;
-
-	cell->n = n;
-	cell->s = s;
-
-	harmonic2 = generalizedHarmonicNumber(2, s);
-	cell->harmonicn = generalizedHarmonicNumber(n, s);
-
-	cell->alpha = 1.0 / (1.0 - s);
-	cell->beta = pow(0.5, s);
-	cell->eta = (1.0 - pow(2.0 / n, 1.0 - s)) / (1.0 - harmonic2 / cell->harmonicn);
-}
-
-/*
- * search for cache cell with keys (n, s)
- * and create new cell if it does not exist
- */
-static ZipfCell *
-zipfFindOrCreateCacheCell(ZipfCache *cache, int64 n, double s)
-{
-	int			i,
-				least_recently_used = 0;
-	ZipfCell   *cell;
-
-	/* search cached cell for given parameters */
-	for (i = 0; i < cache->nb_cells; i++)
-	{
-		cell = &cache->cells[i];
-		if (cell->n == n && cell->s == s)
-			return &cache->cells[i];
-
-		if (cell->last_used < cache->cells[least_recently_used].last_used)
-			least_recently_used = i;
-	}
-
-	/* create new one if it does not exist */
-	if (cache->nb_cells < ZIPF_CACHE_SIZE)
-		i = cache->nb_cells++;
-	else
-	{
-		/* replace LRU cell if cache is full */
-		i = least_recently_used;
-		cache->overflowCount++;
-	}
-
-	zipfSetCacheCell(&cache->cells[i], n, s);
-
-	cache->cells[i].last_used = cache->current++;
-	return &cache->cells[i];
-}
-
 /*
  * Computing zipfian using rejection method, based on
  * "Non-Uniform Random Variate Generation",
  * Luc Devroye, p. 550-551, Springer 1986.
+ *
+ * This works for s > 1.0
  */
 static int64
 computeIterativeZipfian(RandomState *random_state, int64 n, double s)
@@ -1079,39 +985,16 @@ computeIterativeZipfian(RandomState *random_state, int64 n, double s)
 	return (int64) x;
 }
 
-/*
- * Computing zipfian using harmonic numbers, based on algorithm described in
- * "Quickly Generating Billion-Record Synthetic Databases",
- * Jim Gray et al, SIGMOD 1994
- */
-static int64
-computeHarmonicZipfian(ZipfCache *zipf_cache, RandomState *random_state,
-					   int64 n, double s)
-{
-	ZipfCell   *cell = zipfFindOrCreateCacheCell(zipf_cache, n, s);
-	double		uniform = pg_erand48(random_state->xseed);
-	double		uz = uniform * cell->harmonicn;
-
-	if (uz < 1.0)
-		return 1;
-	if (uz < 1.0 + cell->beta)
-		return 2;
-	return 1 + (int64) (cell->n * pow(cell->eta * uniform - cell->eta + 1.0, cell->alpha));
-}
-
 /* random number generator: zipfian distribution from min to max inclusive */
 static int64
-getZipfianRand(ZipfCache *zipf_cache, RandomState *random_state, int64 min,
-			   int64 max, double s)
+getZipfianRand(RandomState *random_state, int64 min, int64 max, double s)
 {
 	int64		n = max - min + 1;
 
 	/* abort if parameter is invalid */
-	Assert(s > 0.0 && s != 1.0 && s <= MAX_ZIPFIAN_PARAM);
+	Assert(1.0 <= s && s <= MAX_ZIPFIAN_PARAM);
 
-	return min - 1 + ((s > 1)
-					  ? computeIterativeZipfian(random_state, n, s)
-					  : computeHarmonicZipfian(zipf_cache, random_state, n, s));
+	return min - 1 + computeIterativeZipfian(random_state, n, s);
 }
 
 /*
@@ -2433,17 +2316,16 @@ evalStandardFunc(TState *thread, CState *st,
 					}
 					else if (func == PGBENCH_RANDOM_ZIPFIAN)
 					{
-						if (param <= 0.0 || param == 1.0 || param > MAX_ZIPFIAN_PARAM)
+						if (param < MIN_ZIPFIAN_PARAM || param > MAX_ZIPFIAN_PARAM)
 						{
 							fprintf(stderr,
-									"zipfian parameter must be in range (0, 1) U (1, %d]"
-									" (got %f)\n", MAX_ZIPFIAN_PARAM, param);
+									"zipfian parameter must be in range [%.3f, %.0f]"
+									" (got %f)\n",
+									MIN_ZIPFIAN_PARAM, MAX_ZIPFIAN_PARAM, param);
 							return false;
 						}
 						setIntValue(retval,
-									getZipfianRand(&thread->zipf_cache,
-												   &st->cs_func_rs,
-												   imin, imax, param));
+									getZipfianRand(&st->cs_func_rs, imin, imax, param));
 					}
 					else		/* exponential */
 					{
@@ -5012,8 +4894,6 @@ printResults(TState *threads, StatsData *total, instr_time total_time,
 				tps_include,
 				tps_exclude;
 	int64		ntx = total->cnt - total->skipped;
-	int			i,
-				totalCacheOverflows = 0;
 
 	time_include = INSTR_TIME_GET_DOUBLE(total_time);
 
@@ -5041,15 +4921,6 @@ printResults(TState *threads, StatsData *total, instr_time total_time,
 		printf("number of transactions actually processed: " INT64_FORMAT "\n",
 			   ntx);
 	}
-	/* Report zipfian cache overflow */
-	for (i = 0; i < nthreads; i++)
-	{
-		totalCacheOverflows += threads[i].zipf_cache.overflowCount;
-	}
-	if (totalCacheOverflows > 0)
-	{
-		printf("zipfian cache array overflowed %d time(s)\n", totalCacheOverflows);
-	}
 
 	/* Remaining stats are nonsensical if we failed to execute any xacts */
 	if (total->cnt <= 0)
@@ -5937,9 +5808,6 @@ main(int argc, char **argv)
 		initRandomState(&thread->ts_sample_rs);
 		thread->logfile = NULL; /* filled in later */
 		thread->latency_late = 0;
-		thread->zipf_cache.nb_cells = 0;
-		thread->zipf_cache.current = 0;
-		thread->zipf_cache.overflowCount = 0;
 		initStats(&thread->stats, 0);
 
 		nclients_dealt += thread->nstate;
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 45888dc12e..a63adacc80 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -676,13 +676,13 @@ SELECT LEAST(}.join(', ', (':i') x 256).q{)}
 	[
 		'set zipfian param to 1',
 		2,
-		[qr{zipfian parameter must be in range \(0, 1\) U \(1, \d+\]}],
+		[qr{zipfian parameter must be in range \[1\.001, 1000\]}],
 		q{\set i random_zipfian(0, 10, 1)}
 	],
 	[
 		'set zipfian param too large',
 		2,
-		[qr{zipfian parameter must be in range \(0, 1\) U \(1, \d+\]}],
+		[qr{zipfian parameter must be in range \[1\.001, 1000\]}],
 		q{\set i random_zipfian(0, 10, 1000000)}
 	],
 	[
@@ -815,33 +815,6 @@ for my $e (@errors)
 		{ $n => $script });
 }
 
-# zipfian cache array overflow
-pgbench(
-	'-t 1', 0,
-	[ qr{processed: 1/1}, qr{zipfian cache array overflowed 1 time\(s\)} ],
-	[qr{^}],
-	'pgbench zipfian array overflow on random_zipfian',
-	{
-		'001_pgbench_random_zipfian' => q{
-\set i random_zipfian(1, 100, 0.5)
-\set i random_zipfian(2, 100, 0.5)
-\set i random_zipfian(3, 100, 0.5)
-\set i random_zipfian(4, 100, 0.5)
-\set i random_zipfian(5, 100, 0.5)
-\set i random_zipfian(6, 100, 0.5)
-\set i random_zipfian(7, 100, 0.5)
-\set i random_zipfian(8, 100, 0.5)
-\set i random_zipfian(9, 100, 0.5)
-\set i random_zipfian(10, 100, 0.5)
-\set i random_zipfian(11, 100, 0.5)
-\set i random_zipfian(12, 100, 0.5)
-\set i random_zipfian(13, 100, 0.5)
-\set i random_zipfian(14, 100, 0.5)
-\set i random_zipfian(15, 100, 0.5)
-\set i random_zipfian(16, 100, 0.5)
-}
-	});
-
 # throttling
 pgbench(
 	'-t 100 -S --rate=100000 --latency-limit=1000000 -c 2 -n -r',

Reply via email to