Joe McDonnell created IMPALA-13943: -------------------------------------- Summary: Add option to seed rand() with scan range information Key: IMPALA-13943 URL: https://issues.apache.org/jira/browse/IMPALA-13943 Project: IMPALA Issue Type: Task Components: Backend Affects Versions: Impala 5.0.0 Reporter: Joe McDonnell
For conditions that use rand() in a scan node, rand()'s PRNG gets started fresh for each scan range. This means that each scan range can produce the same sequence of random number. For example: {noformat} create table randtest (i int); # Create multiple files with the same rows insert into randtest values (1),(2),(3),(4),(5),(6),(7),(8),(9),(10); insert into randtest values (1),(2),(3),(4),(5),(6),(7),(8),(9),(10); insert into randtest values (1),(2),(3),(4),(5),(6),(7),(8),(9),(10); select i, count(*) from randtest where rand() < 0.5 group by i; +----------------------------------+-----------------------+ | default.randtest.i (tid=1 sid=1) | count() (tid=1 sid=2) | +----------------------------------+-----------------------+ | 4 | 3 | | 6 | 3 | | 5 | 3 | | 8 | 3 | | 1 | 3 | | 3 | 3 | +----------------------------------+-----------------------+{noformat} Since each scan range is getting the same sequence of random numbers from the PRNG, each scan range is returning the same values. If this was truly random, it is likely to return all the values 1-10. One option is to have a mode that hashes the scan range information and uses it to seed the PRNG to have better randomness in this case. This is still deterministic for unchanging files. Another option is to have a mode where rand() uses a random seed for true non-determinism. -- This message was sent by Atlassian Jira (v8.20.10#820010)