adriangb commented on code in PR #19639:
URL: https://github.com/apache/datafusion/pull/19639#discussion_r2660387539
##########
datafusion/datasource-parquet/src/opener.rs:
##########
@@ -1576,13 +1858,16 @@ mod test {
assert_eq!(num_batches, 1);
assert_eq!(num_rows, 1);
- // Filter should not match the partition value or the data value
+ // Filter should not match the partition value or the data value.
+ // With adaptive selectivity tracking, unknown filters start in
post_scan
+ // to learn their effectiveness. So the file is read and then filtered,
+ // resulting in 1 batch with 0 rows (rather than pruning the file
entirely).
Review Comment:
TODO: check this, maybe set the selectivity high for this test?
##########
datafusion/sqllogictest/test_files/parquet.slt:
##########
@@ -457,10 +457,7 @@ EXPLAIN
logical_plan
01)Filter: CAST(binary_as_string_default.binary_col AS Utf8View) LIKE
Utf8View("%a%") AND CAST(binary_as_string_default.largebinary_col AS Utf8View)
LIKE Utf8View("%a%") AND CAST(binary_as_string_default.binaryview_col AS
Utf8View) LIKE Utf8View("%a%")
02)--TableScan: binary_as_string_default projection=[binary_col,
largebinary_col, binaryview_col],
partial_filters=[CAST(binary_as_string_default.binary_col AS Utf8View) LIKE
Utf8View("%a%"), CAST(binary_as_string_default.largebinary_col AS Utf8View)
LIKE Utf8View("%a%"), CAST(binary_as_string_default.binaryview_col AS Utf8View)
LIKE Utf8View("%a%")]
-physical_plan
-01)FilterExec: CAST(binary_col@0 AS Utf8View) LIKE %a% AND
CAST(largebinary_col@1 AS Utf8View) LIKE %a% AND CAST(binaryview_col@2 AS
Utf8View) LIKE %a%
-02)--RepartitionExec: partitioning=RoundRobinBatch(2), input_partitions=1
-03)----DataSourceExec: file_groups={1 group:
[[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/binary_as_string.parquet]]},
projection=[binary_col, largebinary_col, binaryview_col], file_type=parquet,
predicate=CAST(binary_col@0 AS Utf8View) LIKE %a% AND CAST(largebinary_col@1 AS
Utf8View) LIKE %a% AND CAST(binaryview_col@2 AS Utf8View) LIKE %a%
+physical_plan DataSourceExec: file_groups={1 group:
[[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/binary_as_string.parquet]]},
projection=[binary_col, largebinary_col, binaryview_col], file_type=parquet,
predicate=CAST(binary_col@0 AS Utf8View) LIKE %a% AND CAST(largebinary_col@1 AS
Utf8View) LIKE %a% AND CAST(binaryview_col@2 AS Utf8View) LIKE %a%
Review Comment:
Right I'm wondering if the perf degradation we're seeing is just loss of
parallelism
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]