theirix opened a new issue, #13563:
URL: https://github.com/apache/datafusion/issues/13563

   ### Is your feature request related to a problem or challenge?
   
   It is helpful to have sampling support for queries to ease the exploration 
of data.
   
   ### Describe the solution you'd like
   
   It should be supported on the SQL level (`SAMPLE` or `TABLESAMPLE` syntax). 
The sampling construct should be passed to the table source so the sampling is 
performed at the scan plan (e.g. in an optimised parquet reader).
   
   This feature could be implemented in three sequential stages:
   1. Support additional SQL syntax but fail in the physical plan builder
   2. Transparently convert to `WHERE RANDOM() < P` filter
   3. For eligible data sources push the sampling to the table source
   
   ### Describe alternatives you've considered
   
   It is possible to use `WHERE RANDOM() < 0.1` selection (see discussion 
https://github.com/apache/datafusion/issues/13268 ), but the support in SQL is 
clearer.
   
   Existing query engines and databases already implement sampling, but it is 
not in ANSI standard. There are different flavours, but essentially, they allow 
for specific sampling methods and percentages (or sometimes a number of rows) 
`TABLESAMPLE [SYSTEM | BERNOULLI] (PERCENTAGE | ROWS)`
   
   [DuckDB](https://duckdb.org/docs/sql/samples.html#table-samples):
   ```sql
   SELECT * FROM tbl TABLESAMPLE SYSTEM (10%),
   ```
   
   
[PostgreSQL](https://www.postgresql.org/docs/current/sql-select.html#SQL-FROM) 
and [Trino](https://trino.io/docs/current/sql/select.html#tablesample):
   ```sql
   SELECT * FROM tbl TABLESAMPLE SYSTEM (10),
   ```
   
   
[Spark](https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-sampling.html)
   ```sql
   SELECT * FROM tbl TABLESAMPLE SYSTEM (10 PERCENT)
   ```
   
   
[Clickhouse](https://clickhouse.com/docs/en/sql-reference/statements/select/sample)
 is different:
   ```sql
   SELECT * FROM tbl SAMPLE 0.1
   ```
   
   ### Additional context
   
   Also requested in #11554. The filter for sampling was refined in #13268.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to