Hello Ildar,

so that different instances of hash function within one script would
have different seeds. Yes, that is a good idea, I can do that.

Added this feature in attached patch. But on a second thought this could
be something that user won't expect. For example, they may want to run
pgbench with two scripts:
- the first one updates row by key that is a hashed random_zipfian value;
- the second one reads row by key generated the same way
(that is actually what YCSB workloads A and B do)

It feels natural to write something like this:
\set rnd random_zipfian(0, 1000000, 0.99)
\set key abs(hash(:rnd)) % 1000
in both scripts and expect that they both would have the same
distribution. But they wouldn't. We could of course describe this
implicit behaviour in documentation, but ISTM that shared seed would be
more clear.

I think that it depends on the use case, that both can be useful, so there should be a way to do either.

With "always different" default seed, distinct distributions are achieved
with:

   -- DIFF distinct seeds inside and between runs
   \set i1 abs(hash(:r1)) % 1000
   \set j1 abs(hash(:r2)) % 1000

and the same distribution can be done with an explicit seed:

   -- DIFF same seed inside and between runs
   \set i1 abs(hash(:r1), 5432) % 1000
   \set j1 abs(hash(:r2), 5432) % 1000

The drawback is that the same seed is used between runs in this case, which is not desirable. This could be circumvented by adding the random seed as an automatic variable and using it, eg:

   -- DIFF same seed inside run, distinct between runs
   \set i1 abs(hash(:r1), :random_seed + 5432) % 1000
   \set j1 abs(hash(:r2), :random_seed + 2345) % 1000


Now with a shared hash_seed the same distribution is by default:

   -- SHARED same underlying hash_seed inside run, distinct between runs
   \set i1 abs(hash(:r1)) % 1000
   \set j1 abs(hash(:r2)) % 1000

However some trick is needed now to get distinct seeds. With

   -- SHARED distinct seed inside run, but same between runs
   \set i1 abs(hash(:r1, 5432)) % 1000
   \set j1 abs(hash(:r2, 2345)) % 1000

We are back to the same issue has the previous case because then the distribution is the same from one run to the next, which is not desirable. I found this workaround trick:

   -- SHARED distinct seeds inside and between runs
   \set i1 abs(hash(:r1, hash(5432))) % 1000
   \set j1 abs(hash(:r2, hash(2345))) % 1000

Or with a new :hash_seed or :random_seed automatic variable, we could also have:

   -- SHARED distinct seeds inside and between runs
   \set i1 abs(hash(:r1, :hash_seed + 5432)) % 1000
   \set j1 abs(hash(:r2, :hash_seed + 2345)) % 1000

It provides controllable distinct seeds between runs but equal one between if desired, by reusing the same value to be hashed as a seed.

I also agree with your argument that the user may reasonably expect that hash(5432) == hash(5432) inside and between scripts, at least on the same run, so would be surprised that it is not the case.

So I've changed my mind, I'm sorry for making you going back and forth on the subject. I'm now okay with one shared 64 bit hash seed, with a clear documentation about the fact, and an outline of the trick to achieve distinct distributions inside a run if desired and why it would be desirable to avoid correlations. Also, I think that providing the seed as automatic variable (:hash_seed or :hseed or whatever) would make some sense as well. Maybe this could be used as a way to fix the seed explicitely, eg:

   pgbench -D hash_seed=1234 ...

Would use this value instead of the random generated one. Also, with that the default inserted second argument could be simply ":hash_seed", which would simplify the executor which would not have to do check for an optional second argument.

--
Fabien.

Reply via email to