Re: General purpose hashing func in pgbench

Fabien COELHO Sat, 13 Jan 2018 00:17:42 -0800


Hello Ildar,

so that different instances of hash function within one script would
have different seeds. Yes, that is a good idea, I can do that.

Added this feature in attached patch. But on a second thought this could
be something that user won't expect. For example, they may want to run
pgbench with two scripts:
- the first one updates row by key that is a hashed random_zipfian value;
- the second one reads row by key generated the same way
(that is actually what YCSB workloads A and B do)

It feels natural to write something like this:
\set rnd random_zipfian(0, 1000000, 0.99)
\set key abs(hash(:rnd)) % 1000
in both scripts and expect that they both would have the same
distribution. But they wouldn't. We could of course describe this
implicit behaviour in documentation, but ISTM that shared seed would be
more clear.

I think that it depends on the use case, that both can be useful, so thereshould be a way to do either.


With "always different" default seed, distinct distributions are achieved
with:

   -- DIFF distinct seeds inside and between runs
   \set i1 abs(hash(:r1)) % 1000
   \set j1 abs(hash(:r2)) % 1000

and the same distribution can be done with an explicit seed:

   -- DIFF same seed inside and between runs
   \set i1 abs(hash(:r1), 5432) % 1000
   \set j1 abs(hash(:r2), 5432) % 1000

The drawback is that the same seed is used between runs in this case,which is not desirable. This could be circumvented by adding the randomseed as an automatic variable and using it, eg:


   -- DIFF same seed inside run, distinct between runs
   \set i1 abs(hash(:r1), :random_seed + 5432) % 1000
   \set j1 abs(hash(:r2), :random_seed + 2345) % 1000


Now with a shared hash_seed the same distribution is by default:

   -- SHARED same underlying hash_seed inside run, distinct between runs
   \set i1 abs(hash(:r1)) % 1000
   \set j1 abs(hash(:r2)) % 1000

However some trick is needed now to get distinct seeds. With

   -- SHARED distinct seed inside run, but same between runs
   \set i1 abs(hash(:r1, 5432)) % 1000
   \set j1 abs(hash(:r2, 2345)) % 1000

We are back to the same issue has the previous case because then thedistribution is the same from one run to the next, which is not desirable.I found this workaround trick:


   -- SHARED distinct seeds inside and between runs
   \set i1 abs(hash(:r1, hash(5432))) % 1000
   \set j1 abs(hash(:r2, hash(2345))) % 1000

Or with a new :hash_seed or :random_seed automatic variable, we could alsohave:


   -- SHARED distinct seeds inside and between runs
   \set i1 abs(hash(:r1, :hash_seed + 5432)) % 1000
   \set j1 abs(hash(:r2, :hash_seed + 2345)) % 1000

It provides controllable distinct seeds between runs but equal one betweenif desired, by reusing the same value to be hashed as a seed.

I also agree with your argument that the user may reasonably expect thathash(5432) == hash(5432) inside and between scripts, at least on the samerun, so would be surprised that it is not the case.

So I've changed my mind, I'm sorry for making you going back and forth onthe subject. I'm now okay with one shared 64 bit hash seed, with a cleardocumentation about the fact, and an outline of the trick to achievedistinct distributions inside a run if desired and why it would bedesirable to avoid correlations. Also, I think that providing the seed asautomatic variable (:hash_seed or :hseed or whatever) would make somesense as well. Maybe this could be used as a way to fix the seedexplicitely, eg:


   pgbench -D hash_seed=1234 ...

Would use this value instead of the random generated one. Also, with thatthe default inserted second argument could be simply ":hash_seed", whichwould simplify the executor which would not have to do check for anoptional second argument.


--
Fabien.

Re: General purpose hashing func in pgbench

Reply via email to