Hello Tomas,

What would a user do with this information, and how would they know
what to do?

Sure, but it was unclear what to do. Extending the cache to avoid that would look like over-engineering.

That seems like a rather strange argument. What exactly is so complex on
resizing the cache to quality as over-engineering?

Because the cache size can diverge on a bad script, and the search is currently linear. Even with a log search it does not look attractive.

If the choice is between reporting the failure to the user, and
addressing the failure, surely the latter would be the default option?
Particularly if the user can't really address the issue easily
(recompiling psql is not very practical solution).

I remain of the opinion that we ought to simply rip out support for
zipfian with s < 1.

+1 to that

If this is done, some people with zipfian distribution that currently work might be unhappy.

Some people really want zipfian because it reflects their data access
pattern, possibly with s < 1.

We cannot helpt it: real life seems zipfianly distributed:-)

Sure. But that hardly means we ought to provide algorithms that we know
are not suitable for benchmarking tools, which I'd argue is this case.

Hmmm. It really depends on the values actually used.

Also, we have two algorithms for generating zipfian distributions. Why
wouldn't it be sufficient to keep just one of them?

Because the two methods work for different values of the parameter, so it depends on the distribution which is targetted. If you want s < 1, the exclusion method does not help (precisely, the real "infinite" zipfian distribution is not mathematical defined when s < 1 because the underlying sum diverges. Having s < 1 can only be done on a finite set).

It's not useful for benchmarking purposes to have a random-number
function with such poor computational properties.

This is mostly a startup cost, the generation cost when a bench is
running is reasonable. How to best implement the precomputation is an
open question.

... which means it's not a startup cost. IMHO this simply shows pgbench
does not have the necessary infrastructure to provide this feature.

Well, a quick one has been proposed with a cache. Now I can imagine alternatives, but I'm wondering how much it is worth it.

Eg, restraint random_zipfian to more or less constant parameters when s < 1, so that a partial evaluation could collect the needed pairs and perform the pre-computations before the bench is started.

Now, that looks like over-engineering, and then someone would complain of the restrictions.

As a reviewer I was not thrilled by the cache stuff, but I had no better
idea that would not fall under "over-over engineering" or the like.

Maybe it could error out and say "recompile me", but then someone
would have said "that is unacceptable".

Maybe it could auto extend the cache, but that is still more unnecessary
over-engineering, IMHO.

I'm puzzled. Why would that be over-engineering?

As stated above, cache size divergence and O(n) search. Even a log2(n) search would be bad, IMO.

I think leaving it in there is just a foot-gun: we'd be a lot better
off throwing an error that tells people to use some other distribution.

When s < 1, the startup cost is indeed a pain. However, it is a pain
prescribed by a Turing Award.

Then we shouldn't have it, probably. Or we should at least implement a
proper startup phase, so that the costly precomputation is not included
in the test interval.

That can be done, but I'm not sure of an very elegant design. And I was just the reviewer on this one.

Or if we do leave it in there, we for sure have to have documentation
that *actually* explains how to use it, which this patch still doesn't.

I'm not sure what explaining there could be about how to use it: one
calls the function to obtain pseudo-random integers with the desired
distribution?

Well, I'd argue the current description "performance is poor" is not
particularly clear.

There's nothing suggesting that you'd better not use a large number of
different (n,s) combinations.

Indeed, there is no caveat about this point, as noted above.
Please find an updated patch for the documentation, pointing out the
existence of the cache and an advice not to overdo it.
It does not solve the underlying problem which raised your complaint,
but at least it is documented.


I think you forgot to attache the patch ...

Indeed, this is becoming a habbit:-( Attached. Hopefully.

--
Fabien.
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 9d18524834..7c961842fe 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -1605,7 +1605,8 @@ f(x) = PHI(2.0 * parameter * (x - mu) / (max - min + 1)) /
       "Non-Uniform Random Variate Generation", Luc Devroye, p. 550-551,
       Springer 1986. The distribution is not defined when the parameter's
       value is 1.0. The function's performance is poor for parameter values
-      close and above 1.0 and on a small range.
+      very close and above 1.0 and on a small range. Also, there is a large setup
+      cost when the parameter in (0, 1) and the interval is large.
      </para>
      <para>
       <replaceable>parameter</replaceable> defines how skewed the distribution
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 19532cfb54..b173cec59b 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -410,7 +410,7 @@ typedef struct
 } CState;
 
 /*
- * Cache cell for random_zipfian call
+ * Cache cell for random_zipfian call for when s in (0.0, 1.0)
  */
 typedef struct
 {
@@ -418,7 +418,7 @@ typedef struct
 	double		s;				/* s - parameter of random_zipfian function */
 	int64		n;				/* number of elements in range (max - min + 1) */
 
-	double		harmonicn;		/* generalizedHarmonicNumber(n, s) */
+	double		harmonicn;		/* cache expensive generalizedHarmonicNumber(n, s) */
 	double		alpha;
 	double		beta;
 	double		eta;
@@ -977,15 +977,17 @@ getPoissonRand(RandomState *random_state, double center)
 	return (int64) (-log(uniform) * center + 0.5);
 }
 
-/* helper function for getZipfianRand */
+/* expensive helper function for getZipfianRand when s in (0.0, 1.0) */
 static double
 generalizedHarmonicNumber(int64 n, double s)
 {
-	int			i;
+	int64			i;
 	double		ans = 0.0;
 
+	/* computed in reverse order for better numerical precision */
 	for (i = n; i > 1; i--)
 		ans += pow(i, -s);
+
 	return ans + 1.0;
 }
 
@@ -1048,6 +1050,12 @@ zipfFindOrCreateCacheCell(ZipfCache *cache, int64 n, double s)
  * Computing zipfian using rejection method, based on
  * "Non-Uniform Random Variate Generation",
  * Luc Devroye, p. 550-551, Springer 1986.
+ *
+ * This method applies when s > 1.0 and works with an unbounded n,
+ * although the implementation is bounded.
+ *
+ * It is inefficient when s is close to 1.0 or n is small because the
+ * while loop * is more likely to be iterated.
  */
 static int64
 computeIterativeZipfian(RandomState *random_state, int64 n, double s)
@@ -1078,6 +1086,12 @@ computeIterativeZipfian(RandomState *random_state, int64 n, double s)
  * Computing zipfian using harmonic numbers, based on algorithm described in
  * "Quickly Generating Billion-Record Synthetic Databases",
  * Jim Gray et al, SIGMOD 1994
+ *
+ * This method applies when s in (0.0, 1.0). It does zipf for the first
+ * two values, and then uses an exponential approximation starting from
+ * the third value.
+ *
+ * There is a potentially very large setup cost when n is large.
  */
 static int64
 computeHarmonicZipfian(ZipfCache *zipf_cache, RandomState *random_state,
@@ -1104,7 +1118,7 @@ getZipfianRand(ZipfCache *zipf_cache, RandomState *random_state, int64 min,
 	/* abort if parameter is invalid */
 	Assert(s > 0.0 && s != 1.0 && s <= MAX_ZIPFIAN_PARAM);
 
-	return min - 1 + ((s > 1)
+	return min - 1 + ((s > 1.0)
 					  ? computeIterativeZipfian(random_state, n, s)
 					  : computeHarmonicZipfian(zipf_cache, random_state, n, s));
 }

Reply via email to