alamb opened a new issue, #14608:
URL: https://github.com/apache/datafusion/issues/14608

   ### Is your feature request related to a problem or challenge?
   
   [TPC-H](https://www.tpc.org/tpch/) is an important and well studied 
benchmark. It is used for testing many database optimizations and is well known 
and widely studied. TPCH This is especially important for academic research 
projects and classes. For example see the CMU optimizer  class (see 
https://github.com/apache/datafusion/issues/14373)
   
   Generating this dataset today is quite a pain as it requires compiling and 
running a very old c program dbgen.c (see XXX)
   
   DuckDB has a TPCH extension that makes it very easy to create the dataset 
and queries: https://duckdb.org/docs/extensions/tpch.html
   
   I also think this is another reason why DuckDB is so popular with Academic 
research as it lowers the barrier to getting this dataset
   
   Today, to generate this dataset, the DataFusion 
[bench.sh](https://github.com/apache/datafusion/blob/main/benchmarks/bench.sh) 
program 
   
   bench.sh runs tpdh dbgen from this container: 
https://github.com/scalytics/TPCH-Docker/pkgs/container/tpch-docker
   
   And the eventual code is here: 
https://github.com/scalytics/TPCH-Docker/tree/main/data/tpch/2.18.0_rc2/dbgen
   
   This setup is non ideal for several reasons:
   1. It requires docker, and takes quite a while to run
   2. It makes CSV files which are almost never what is used in practice 
(people use parquet, etc)
   3. It isn't part of datafusion-cli 
   
   ### Describe the solution you'd like
   
   I would like it to be very easy to create data and run TPCH queries in 
datafusion-cli
   
   
   
   ### Describe alternatives you've considered
   
   
   ## Idea 1: Instructions + Precalculate the Data
   The idea here would be to precalculate the data and find somewhere to host 
it (I am sure the ASF has places to host files, but we would need to research 
what the limits are, etc).
   
   Here is an example repo: https://github.com/aleaugustoplus/tpch-data (maybe 
we can do the same or even use that one)
   
   Then we would provide instructions / a script on how to download and use the 
files with `datafusion-cli`
   
   Ideally we would provide the data in parquet format
   
   ## Idea 2: Integrate the dbgen function into `datafusion-cli` (like DuckDB)
   
   The idea here would be to implement some/all of the syntax from DuckDB: 
https://duckdb.org/docs/extensions/tpch.html
   
   That would mean a command like this
   
   ```sql
   CALL dbgen(sf = 1);
   ```
   
   That would create the 8 TPCH tables, along with command to show the queries 
and answers
   
   The trick here is that `dbgen` is some ancient c program and there is no 
Rust version I could find. It is critical that that data is exactly the same.
   
   Transliterating dbgen.c from C to Rust might be a fun project (and maybe 
someone could figure out how to make it parallel while they are at it)
   
   
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to