adriangb commented on PR #12978:
URL: https://github.com/apache/datafusion/pull/12978#issuecomment-2423935723
I made a big parquet file as follows:
```python
import random
import string
import polars as pl
df = pl.DataFrame({'col': ["A" +
"".join(random.choices(string.ascii_letters, k=1_000)) for _ in
range(1_000_000)]})
df.write_parquet('data.parquet', compression='uncompressed')
```
This came out to ~1GB. I then uploaded it to a GCS bucket.
I ran queries `col = 'Z'` and `col like 'Z'` against it and got 2s and 23s
respectively. IMO that means it's not getting pushed down.
The explain plans show the following:
```
ParquetExec: file_groups={10 groups: [[data.parquet:0..100890471],
[data.parquet:100890471..201780942], [data.parquet:201780942..302671413],
[data.parquet:302671413..403561884], [data.parquet:403561884..504452355],
...]}, projection=[col], predicate=col@0 = Z, pruning_predicate=CASE WHEN
col_null_count@2 = col_row_count@3 THEN false ELSE col_min@0 <= Z AND Z <=
col_max@1 END, required_guarantees=[col in (Z)], metrics=[output_rows=0,
elapsed_compute=10ns, predicate_evaluation_errors=0, bytes_scanned=19368790,
row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=3,
pushdown_rows_filtered=0, page_index_rows_filtered=0,
row_groups_matched_statistics=0, row_groups_matched_bloom_filter=0,
file_scan_errors=0, file_open_errors=0, num_predicate_creation_errors=0,
time_elapsed_scanning_until_data=18.748µs, time_elapsed_opening=7.717746249s,
time_elapsed_processing=64.457827ms, page_index_eval_time=10.134µs,
pushdown_eval_time=20ns, time_elapsed_scanning_total=19.21µs]
```
```
ParquetExec: file_groups={10 groups: [[data.parquet:0..100890471],
[data.parquet:100890471..201780942], [data.parquet:201780942..302671413],
[data.parquet:302671413..403561884], [data.parquet:403561884..504452355],
...]}, projection=[col], predicate=col@0 LIKE Z, metrics=[output_rows=1000000,
elapsed_compute=10ns, predicate_evaluation_errors=0, bytes_scanned=1006955145,
row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=0,
pushdown_rows_filtered=0, page_index_rows_filtered=0,
row_groups_matched_statistics=0, row_groups_matched_bloom_filter=0,
file_scan_errors=0, file_open_errors=0, num_predicate_creation_errors=0,
time_elapsed_scanning_until_data=49.346124581s, time_elapsed_opening=2.18377s,
time_elapsed_processing=1.545583231s, page_index_eval_time=20ns,
pushdown_eval_time=20ns, time_elapsed_scanning_total=49.654700084s]
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]