Hello Tomas,

I'm trying to use random_zipfian() for benchmarking of skewed data sets, and I ran head-first into an issue with rather excessive CPU costs. [...] This happens because generalizedHarmonicNumber() does this:

        for (i = n; i > 1; i--)
                ans += pow(i, -s);

where n happens to be 1000000000 (range passed to random_zipfian), so
the loop takes quite a bit of time.

If you find a better formula for the harmonic number, you are welcome and probably get your name on it:-)

So much that we only ever complete a
single transaction, because this work happens in the context of the
first transction, and so it counts into the 30-second limit.

That is why the value is cached, so it is done once per size & value.

If you want skewed but not especially zipfian, use exponential which is quite cheap. Also zipfian with a > 1.0 parameter does not have to compute the harmonic number, so it depends in the parameter.

Please also beware that non uniform keys are correlated (the more frequent are close values), which is somewhat non representative of what you would expect in real life. This is why I submitted a pseudo-random permutation function, which alas does not get much momentum from committers.

The docs actually do mention performance of this function:

   The function's performance is poor for parameter values close and
   above 1.0 and on a small range.

But that does not seem to cover the example I just posted, because 0.1
is not particularly close to 1.0 (or above 1.0), and 1e9 values hardly
counts as "small range".

Yep. The warning is about the repeated cost of calling random_zipfian, which is not good when the parameter is close to and above 1.0. It is not about the setup cost when the value is between 0 and &. This could indeed deserve a warning.

Now if you do benchmarking on a database, probably you want to run hours of to level out checkpointing and other background tasks, so the setup cost should be negligeable, in the end...

I see this part of random_zipfian comes from "Quickly Generating
Billion-Record Synthetic Databases" which deals with generating data
sets, where wasting a bit of time is not a huge deal. But when running
benchmarks it matters much more. So maybe there's a disconnect here?

Hmmm. The first author of this paper got a Turing award:-) The disconnect is just that the setup cost is neglected, or computed offline.

Interestingly enough, I don't see this issue on values above 1.0, no
matter how close to 1.0 those are. Although the throughput seems lower
than with uniform access, so this part of random_zipfian is not quite
cheap either.

Indeed. Pg provides an underlying uniform pseudo-random function, so generating uniform is cheap. Others need more or less expensive transformations.

Now, I don't know what to do about this. Ideally, we'd have a faster
algorithm to generate zipfian distributions

You are welcome to find one, and get famous (hmmm... among some specialized mathematicians at least:-) for it.

- I don't know if such thing exists though. Or maybe we could precompute the distribution first, not counting it into the benchmark duration.

Could be done, but this would require significant partial evaluation efforts and only work when the size and parameter are constants (although using the function with variable parameters would be a very bad idea anyway).

But I guess the very least we can/should do is improving the docs to
make it more obvious which cases are expected to be slow.

Yep. Attached is a doc & comment improvement.

--
Fabien.
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 9d18524834..1481be64ea 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -1605,7 +1605,8 @@ f(x) = PHI(2.0 * parameter * (x - mu) / (max - min + 1)) /
       "Non-Uniform Random Variate Generation", Luc Devroye, p. 550-551,
       Springer 1986. The distribution is not defined when the parameter's
       value is 1.0. The function's performance is poor for parameter values
-      close and above 1.0 and on a small range.
+      close and above 1.0 and on a small range. Also, there is a large setup
+      cost when the parameter in (0, 1) and the interval is large.
      </para>
      <para>
       <replaceable>parameter</replaceable> defines how skewed the distribution
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 19532cfb54..d1fb11ca0e 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -410,7 +410,7 @@ typedef struct
 } CState;
 
 /*
- * Cache cell for random_zipfian call
+ * Cache cell for random_zipfian call for when s in (0.0, 1.0)
  */
 typedef struct
 {
@@ -977,15 +977,17 @@ getPoissonRand(RandomState *random_state, double center)
 	return (int64) (-log(uniform) * center + 0.5);
 }
 
-/* helper function for getZipfianRand */
+/* helper function for getZipfianRand when s in (0.0, 1.0) */
 static double
 generalizedHarmonicNumber(int64 n, double s)
 {
 	int			i;
 	double		ans = 0.0;
 
+	/* computed in reverse order for better numerical precision */
 	for (i = n; i > 1; i--)
 		ans += pow(i, -s);
+
 	return ans + 1.0;
 }
 
@@ -1048,6 +1050,12 @@ zipfFindOrCreateCacheCell(ZipfCache *cache, int64 n, double s)
  * Computing zipfian using rejection method, based on
  * "Non-Uniform Random Variate Generation",
  * Luc Devroye, p. 550-551, Springer 1986.
+ *
+ * This method applies when s > 1.0 and works with an unbounded n,
+ * although the implementation is bounded.
+ *
+ * It is inefficient when s is close to 1.0 or n is small because the
+ * while loop * is more likely to be iterated.
  */
 static int64
 computeIterativeZipfian(RandomState *random_state, int64 n, double s)
@@ -1078,6 +1086,12 @@ computeIterativeZipfian(RandomState *random_state, int64 n, double s)
  * Computing zipfian using harmonic numbers, based on algorithm described in
  * "Quickly Generating Billion-Record Synthetic Databases",
  * Jim Gray et al, SIGMOD 1994
+ *
+ * This method applies when s in (0.0, 1.0). It does zipf for the first
+ * two values, and then uses an exponential approximation starting from
+ * the third value.
+ *
+ * There is a potentially very large setup cost when n is large.
  */
 static int64
 computeHarmonicZipfian(ZipfCache *zipf_cache, RandomState *random_state,
@@ -1104,7 +1118,7 @@ getZipfianRand(ZipfCache *zipf_cache, RandomState *random_state, int64 min,
 	/* abort if parameter is invalid */
 	Assert(s > 0.0 && s != 1.0 && s <= MAX_ZIPFIAN_PARAM);
 
-	return min - 1 + ((s > 1)
+	return min - 1 + ((s > 1.0)
 					  ? computeIterativeZipfian(random_state, n, s)
 					  : computeHarmonicZipfian(zipf_cache, random_state, n, s));
 }

Reply via email to