adriangb commented on PR #12978:
URL: https://github.com/apache/datafusion/pull/12978#issuecomment-2423935723

   I made a big parquet file as follows:
   
   ```python
   import random
   import string
   import polars as pl
   
   df = pl.DataFrame({'col': ["A" + 
"".join(random.choices(string.ascii_letters, k=1_000)) for _ in 
range(1_000_000)]})
   df.write_parquet('data.parquet', compression='uncompressed')
   ```
   
   This came out to ~1GB. I then uploaded it to a GCS bucket.
   
   I ran queries `col = 'Z'` and `col like 'Z'` against it and got 2s and 23s 
respectively. IMO that means it's not getting pushed down.
   
   The explain plans show the following:
   
   ```
   ParquetExec: file_groups={10 groups: [[data.parquet:0..100890471], 
[data.parquet:100890471..201780942], [data.parquet:201780942..302671413], 
[data.parquet:302671413..403561884], [data.parquet:403561884..504452355], 
...]}, projection=[col], predicate=col@0 = Z, pruning_predicate=CASE WHEN 
col_null_count@2 = col_row_count@3 THEN false ELSE col_min@0 <= Z AND Z <= 
col_max@1 END, required_guarantees=[col in (Z)], metrics=[output_rows=0, 
elapsed_compute=10ns, predicate_evaluation_errors=0, bytes_scanned=19368790, 
row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=3, 
pushdown_rows_filtered=0, page_index_rows_filtered=0, 
row_groups_matched_statistics=0, row_groups_matched_bloom_filter=0, 
file_scan_errors=0, file_open_errors=0, num_predicate_creation_errors=0, 
time_elapsed_scanning_until_data=18.748µs, time_elapsed_opening=7.717746249s, 
time_elapsed_processing=64.457827ms, page_index_eval_time=10.134µs, 
pushdown_eval_time=20ns, time_elapsed_scanning_total=19.21µs]
   ```
   
   ```
   ParquetExec: file_groups={10 groups: [[data.parquet:0..100890471], 
[data.parquet:100890471..201780942], [data.parquet:201780942..302671413], 
[data.parquet:302671413..403561884], [data.parquet:403561884..504452355], 
...]}, projection=[col], predicate=col@0 LIKE Z, metrics=[output_rows=1000000, 
elapsed_compute=10ns, predicate_evaluation_errors=0, bytes_scanned=1006955145, 
row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=0, 
pushdown_rows_filtered=0, page_index_rows_filtered=0, 
row_groups_matched_statistics=0, row_groups_matched_bloom_filter=0, 
file_scan_errors=0, file_open_errors=0, num_predicate_creation_errors=0, 
time_elapsed_scanning_until_data=49.346124581s, time_elapsed_opening=2.18377s, 
time_elapsed_processing=1.545583231s, page_index_eval_time=20ns, 
pushdown_eval_time=20ns, time_elapsed_scanning_total=49.654700084s]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to