Carter Shanklin created HIVE-16802:
--------------------------------------
Summary: Standard Table Sampling
Key: HIVE-16802
URL: https://issues.apache.org/jira/browse/HIVE-16802
Project: Hive
Issue Type: Sub-task
Reporter: Carter Shanklin
Hive's table sampling implementation has 2 major issues:
1. It makes assumptions about file layout. The main sampling approach requires
tables to be bucketed. Bucketed tables are becoming less common as the need to
bucket has reduced over time so many users cannot benefit from sampling.
2. The syntax is non standard.
SQL standard defines a TABLESAMPLE operator that requires a sampling method to
be supplied and a probability p. The number of output records is approximately
N * p/100, where N is the number of records in the table. There are two
sampling methods defined: BERNOULLI and SYSTEM.
With the BERNOULLI sampling method, each record is evaluated as an independent
Bernoulli trial.
The SYSTEM sampling method only controls the size of the output set, there is
no independence guarantee between rows. It's common for SYSTEM sampling to be
done at a block or page level, and if a block is selected, all records from the
block are returned. Hive's current sampling methods are effectively types of
SYSTEM sampling.
The standard also allows you to seed the PRNG used for trials using the
REPEATABLE clause. The same input table with same p value and same repeatable
value produces the same output.
Some examples:
select * from t tablesample bernoulli ( 50 );
select * from t tablesample system ( 30 ) repeatable (1234);
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)