zhuqi-lucas commented on PR #16243: URL: https://github.com/apache/datafusion/pull/16243#issuecomment-2939581266
> > It looks like no performance improvement for h2o_window benchmark result... > > Now that I think about it, the h2o benchmark may not have any string columns 🤔 > > Do the TPCH benchmarks read from CSV? Maybe we could just get some manual benchmarks ? Thank you @alamb , this is a good point. Do some investigation from benchmark code now. ```rust # Runs the tpch benchmark run_tpch() { SCALE_FACTOR=$1 if [ -z "$SCALE_FACTOR" ] ; then echo "Internal error: Scale factor not specified" exit 1 fi TPCH_DIR="${DATA_DIR}/tpch_sf${SCALE_FACTOR}" RESULTS_FILE="${RESULTS_DIR}/tpch_sf${SCALE_FACTOR}.json" echo "RESULTS_FILE: ${RESULTS_FILE}" echo "Running tpch benchmark..." # Optional query filter to run specific query QUERY=$([ -n "$ARG3" ] && echo "--query $ARG3" || echo "") debug_run $CARGO_COMMAND --bin tpch -- benchmark datafusion --iterations 5 --path "${TPCH_DIR}" --prefer_hash_join "${PREFER_HASH_JOIN}" --format parquet -o "${RESULTS_FILE}" $QUERY } ``` ```rust /// File format: `csv` or `parquet` #[structopt(short = "f", long = "format", default_value = "csv")] file_format: String, ``` It looks like we default to parquet for tpch, but it also supports csv, i will try to create a PR to support csv from tpch benchmark parameters. Because from the generator code for tpch, we also generate the CSV format, so it's reasonable for us to support CSV benchmark also, i will create a PR soon. Thanks ```rust # Create 'tbl' (CSV format) data into $DATA_DIR if it does not already exist FILE="${TPCH_DIR}/supplier.tbl" if test -f "${FILE}"; then echo " tbl files exist ($FILE exists)." else echo " creating tbl files with tpch_dbgen..." docker run -v "${TPCH_DIR}":/data -it --rm ghcr.io/scalytics/tpch-docker:main -vf -s "${SCALE_FACTOR}" fi ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org