zhuqi-lucas commented on PR #16243:
URL: https://github.com/apache/datafusion/pull/16243#issuecomment-2939581266

   > > It looks like no performance improvement for h2o_window benchmark 
result...
   > 
   > Now that I think about it, the h2o benchmark may not have any string 
columns 🤔
   > 
   > Do the TPCH benchmarks read from CSV? Maybe we could just get some manual 
benchmarks ?
   
   
   Thank you @alamb , this is a good point. Do some investigation from 
benchmark code now.
   
   ```rust
   # Runs the tpch benchmark
   run_tpch() {
       SCALE_FACTOR=$1
       if [ -z "$SCALE_FACTOR" ] ; then
           echo "Internal error: Scale factor not specified"
           exit 1
       fi
       TPCH_DIR="${DATA_DIR}/tpch_sf${SCALE_FACTOR}"
   
       RESULTS_FILE="${RESULTS_DIR}/tpch_sf${SCALE_FACTOR}.json"
       echo "RESULTS_FILE: ${RESULTS_FILE}"
       echo "Running tpch benchmark..."
       # Optional query filter to run specific query
       QUERY=$([ -n "$ARG3" ] && echo "--query $ARG3" || echo "")
       debug_run $CARGO_COMMAND --bin tpch -- benchmark datafusion --iterations 
5 --path "${TPCH_DIR}" --prefer_hash_join "${PREFER_HASH_JOIN}" --format 
parquet -o "${RESULTS_FILE}" $QUERY
   }
   ```
   
   ```rust
       /// File format: `csv` or `parquet`
       #[structopt(short = "f", long = "format", default_value = "csv")]
       file_format: String,
   ```
   
   
   It looks like we default to parquet for tpch, but it also supports csv, i 
will try to create a PR to support csv from tpch benchmark parameters.
   
   
   Because from the generator code for tpch, we also generate the CSV format, 
so it's reasonable for us to support CSV benchmark also, i will create a PR 
soon. Thanks
   
   ```rust
    # Create 'tbl' (CSV format) data into $DATA_DIR if it does not already exist
       FILE="${TPCH_DIR}/supplier.tbl"
       if test -f "${FILE}"; then
           echo " tbl files exist ($FILE exists)."
       else
           echo " creating tbl files with tpch_dbgen..."
           docker run -v "${TPCH_DIR}":/data -it --rm 
ghcr.io/scalytics/tpch-docker:main -vf -s "${SCALE_FACTOR}"
       fi
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to