lmwnshn commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2651044600
If you prefer Java to C, CMU-DB's BenchBase project does implement support for generating and loading TPC-H data in parallel: https://github.com/cmu-db/benchbase/tree/main/src/main/java/com/oltpbenchmark/benchmarks/tpch Another alternative that I explored is using DuckDB to generate the data, exporting that as Parquet, and then ingesting it into DataFusion (schema may require fixing): ``` ./duckdb data/tpch.db -c "INSTALL tpch; LOAD tpch; CALL dbgen(sf = 1); EXPORT DATABASE './data/' (FORMAT PARQUET);" ``` But personally I think native integration makes for the best user experience. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org