kosiew commented on PR #20500: URL: https://github.com/apache/datafusion/pull/20500#issuecomment-4134062456
https://github.com/Samyak2/datafusion/blob/fix-repartition-string-view-counting/datafusion/common/src/config.rs#L738-L740 For the above benchmark runs, the Parquet-backed benchmark data is expected to use view types by default. Why: - DataFusion's Parquet config defaults `schema_force_view_types` to `true` - when that option is enabled, Parquet string columns are read as `Utf8View` and binary columns as `BinaryView` - the TPC-H benchmark constructs `ParquetFormat` using the session's Parquet table options, so it inherits that default behavior - the ClickBench benchmark also uses the session Parquet defaults and additionally sets `binary_as_string = true` so legacy binary-encoded string columns in the `hits_partitioned` dataset are treated as strings That means both of the benchmark outputs under discussion should be assumed to have string view arrays enabled for Parquet-backed string columns unless view types were explicitly disabled. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
