Rohan Garg created CALCITE-5084:
-----------------------------------

             Summary: Support ROWS syntax with TABLESAMPLE
                 Key: CALCITE-5084
                 URL: https://issues.apache.org/jira/browse/CALCITE-5084
             Project: Calcite
          Issue Type: Task
            Reporter: Rohan Garg


Currently, Calcite provides a useful syntax for TABLESAMPLE which allows users 
to sample the data being processed. It has two main parameters : 
1. sampling algorithm (BERNOULLI or SYSTEM)
2. sampling percentage (a value between 0 and 100 to indicate rate of sampling)

While percentage is generally good, it is not always possible to provide a 
decent value for it if the user is unaware of the row counts. Further incase of 
subqueries (assuming that the underlying system handles tablesample with 
subqueries), it becomes even more difficult to estimate the correct percentage 
value.

Most likely the 'n ROWS' syntax is not a part of the SQL standard and hence 
wasn't included in the default calcite grammar. But, a few systems have 
implemented it in their dialects : 
1. MS SQL Server : 
[https://docs.microsoft.com/en-us/sql/t-sql/queries/from-transact-sql?view=sql-server-ver15#tablesample-clause]
2. Snowflake : 
[https://docs.snowflake.com/en/sql-reference/constructs/sample.html]
3. Google Spanner : 
[https://cloud.google.com/spanner/docs/reference/standard-sql/query-syntax#tablesample_operator]
4. Apache Spark : 
[https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-sampling.html]
So, it would be a useful addition to Calcite.

Derived from https://issues.apache.org/jira/browse/CALCITE-5074



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to