alamb opened a new issue, #12510:
URL: https://github.com/apache/datafusion/issues/12510

   ### Is your feature request related to a problem or challenge?
   
   In the [ClickBench benchmark queries, there are two datasets we 
use](https://github.com/ClickHouse/ClickBench?tab=readme-ov-file#data-loading). 
A "single file" `hits.parquet` and "partitioned" which has 100 files in a 
directory. They hold the same data. 
   
   However DataFusion resolves `hits.parquet` such that columns like `URL` are 
a `Utf8` or `Utf8View` while the same columns are resolved as `Binary` or 
`BinaryView`
   
   This has caused some small slowdowns while enabling StringView by default -- 
see https://github.com/apache/datafusion/issues/12509
   
   
   You can see the schema resolution by:
   ```shell
   cd benchmarks
   # download hits.parquet
   ./bench.sh data clickbench_1
   # download hits_partitioned
   ./bench.sh data clickbench_partitioned
   ```
   
   Then run `datafusion-cli`:
   
   ```shell
   cd data
   # hits.parquet has Utf8 columns
   datafusion-cli -c 'describe "hits.parquet"' | grep Utf8
   | Title                 | Utf8      | NO          |
   | URL                   | Utf8      | NO          |
   | Referer               | Utf8      | NO          |
   ...
   | UTMContent            | Utf8      | NO          |
   | UTMTerm               | Utf8      | NO          |
   | FromTag               | Utf8      | NO          |
   
   # hits_patitioned has Binary type for the same columns
   datafusion-cli -c 'describe "hits_partitioned"' | grep Binary
   | Title                 | Binary    | YES         |
   | URL                   | Binary    | YES         |
   | Referer               | Binary    | YES         |
   ...
   | UTMContent            | Binary    | YES         |
   | UTMTerm               | Binary    | YES         |
   | FromTag               | Binary    | YES         |
   ```
   
   It semes for some reason the individual files are all resolved to `Binary`:
   
   ```
   datafusion-cli -c 'describe "hits_partitioned/hits_99.parquet"' | grep Binary
   | Title                 | Binary    | YES         |
   | URL                   | Binary    | YES         |
   | Referer               | Binary    | YES         |
   | FlashMinor2           | Binary    | YES         |
   | UserAgentMinor        | Binary    | YES         |
   ...
   datafusion-cli -c 'describe "hits_partitioned/hits_60.parquet"' | grep Binary
   | Title                 | Binary    | YES         |
   | URL                   | Binary    | YES         |
   | Referer               | Binary    | YES         |
   | FlashMinor2           | Binary    | YES         |
   | UserAgentMinor        | Binary    | YES         |
   ...
   ```
   
   ### Describe the solution you'd like
   
   I would like ideally that the clickbench queries resolve to the same schema, 
in this case Utf8 given the contents of the files and the queries that treat it 
them as strings
   
   
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to