Carter Shanklin created HIVE-16802:
--------------------------------------

             Summary: Standard Table Sampling
                 Key: HIVE-16802
                 URL: https://issues.apache.org/jira/browse/HIVE-16802
             Project: Hive
          Issue Type: Sub-task
            Reporter: Carter Shanklin


Hive's table sampling implementation has 2 major issues:
1. It makes assumptions about file layout. The main sampling approach requires 
tables to be bucketed. Bucketed tables are becoming less common as the need to 
bucket has reduced over time so many users cannot benefit from sampling.
2. The syntax is non standard.

SQL standard defines a TABLESAMPLE operator that requires a sampling method to 
be supplied and a probability p. The number of output records is approximately 
N * p/100, where N is the number of records in the table. There are two 
sampling methods defined: BERNOULLI and SYSTEM.

With the BERNOULLI sampling method, each record is evaluated as an independent 
Bernoulli trial.
The SYSTEM sampling method only controls the size of the output set, there is 
no independence guarantee between rows. It's common for SYSTEM sampling to be 
done at a block or page level, and if a block is selected, all records from the 
block are returned. Hive's current sampling methods are effectively types of 
SYSTEM sampling.

The standard also allows you to seed the PRNG used for trials using the 
REPEATABLE clause. The same input table with same p value and same repeatable 
value produces the same output.

Some examples:
select * from t tablesample bernoulli ( 50 );
select * from t tablesample system ( 30 ) repeatable (1234);




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to