adriangb commented on code in PR #20015:
URL: https://github.com/apache/datafusion/pull/20015#discussion_r2731667683
##########
datafusion/sqllogictest/test_files/projection_pushdown.slt:
##########
@@ -1038,7 +1084,295 @@ SELECT id, s['value'] + 100, s['label'] || '_test' FROM
simple_struct ORDER BY i
2 300 beta_test
#####################
-# Section 11: Cleanup
+# Section 11: FilterExec Projection Pushdown - Handling Predicate Column
Requirements
+#####################
+
+# When pushing a projection down through FilterExec, the optimizer must ensure
the projection
+# includes all columns referenced in the filter predicate. If columns are
missing, the optimizer
+# augments (adds) them to the projection, filters the data, then removes them
from output.
+# These tests verify: column augmentation, complex predicates, column reuse,
and index remapping.
+
+###
+# Test 11.1: Baseline - No augmentation needed
+# When the projection includes 'id' and the predicate references 'id', no
extra columns
+# are added. The projection pushes down directly: FilterExec -> DataSourceExec
with [id, s['value']].
+###
+
+query TT
+EXPLAIN SELECT id, s['value'] FROM simple_struct WHERE id > 1;
+----
+logical_plan
+01)Projection: simple_struct.id, get_field(simple_struct.s, Utf8("value"))
+02)--Filter: simple_struct.id > Int64(1)
+03)----TableScan: simple_struct projection=[id, s],
partial_filters=[simple_struct.id > Int64(1)]
+physical_plan
+01)ProjectionExec: expr=[id@0 as id, get_field(s@1, value) as
simple_struct.s[value]]
+02)--FilterExec: id@0 > 1
+03)----DataSourceExec: file_groups={1 group:
[[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/projection_pushdown/simple.parquet]]},
projection=[id, s], file_type=parquet, predicate=id@0 > 1,
pruning_predicate=id_null_count@1 != row_count@2 AND id_max@0 > 1,
required_guarantees=[]
+
+# Verify correctness
+query II
+SELECT id, s['value'] FROM simple_struct WHERE id > 1 ORDER BY id LIMIT 2;
+----
+2 200
+3 150
+
+###
+# Test 11.2: Complex predicate with multiple logical branches
+# Query projects only s['value'], but predicate uses 'id' in multiple
conditions:
+# (id > 1 AND (id < 4 OR id = 5)). The optimizer adds 'id' for filtering, then
removes it.
+# Verifies all column references in AND/OR branches are correctly handled.
+###
+
+query TT
+EXPLAIN SELECT s['value'] FROM simple_struct WHERE id > 1 AND (id < 4 OR id =
5);
+----
+logical_plan
+01)Projection: get_field(simple_struct.s, Utf8("value"))
+02)--Filter: simple_struct.id > Int64(1) AND (simple_struct.id < Int64(4) OR
simple_struct.id = Int64(5))
+03)----TableScan: simple_struct projection=[id, s],
partial_filters=[simple_struct.id > Int64(1), simple_struct.id < Int64(4) OR
simple_struct.id = Int64(5)]
+physical_plan
+01)ProjectionExec: expr=[get_field(s@0, value) as simple_struct.s[value]]
+02)--FilterExec: id@0 > 1 AND (id@0 < 4 OR id@0 = 5), projection=[s@1]
+03)----DataSourceExec: file_groups={1 group:
[[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/projection_pushdown/simple.parquet]]},
projection=[id, s], file_type=parquet, predicate=id@0 > 1 AND (id@0 < 4 OR
id@0 = 5), pruning_predicate=id_null_count@1 != row_count@2 AND id_max@0 > 1
AND (id_null_count@1 != row_count@2 AND id_min@3 < 4 OR id_null_count@1 !=
row_count@2 AND id_min@3 <= 5 AND 5 <= id_max@0), required_guarantees=[]
+
+# Verify correctness - should return rows where (id > 1) AND ((id < 4) OR (id
= 5))
+# That's: id=2,3 (1<id<4) and id=5 (id=5 from second branch)
+query I
+SELECT s['value'] FROM simple_struct WHERE id > 1 AND (id < 4 OR id = 5) ORDER
BY s['value'];
+----
+150
+200
+250
+
+###
+# Test 11.3: Column reused in multiple conditions
+# When 'id' appears multiple times (id > 1 AND id < 5), it's added to the
projection
+# only once, and all predicate references point to the same column index.
+###
+
+query TT
+EXPLAIN SELECT s['value'] FROM simple_struct WHERE id > 1 AND id < 5;
+----
+logical_plan
+01)Projection: get_field(simple_struct.s, Utf8("value"))
+02)--Filter: simple_struct.id > Int64(1) AND simple_struct.id < Int64(5)
+03)----TableScan: simple_struct projection=[id, s],
partial_filters=[simple_struct.id > Int64(1), simple_struct.id < Int64(5)]
+physical_plan
+01)ProjectionExec: expr=[get_field(s@0, value) as simple_struct.s[value]]
+02)--FilterExec: id@0 > 1 AND id@0 < 5, projection=[s@1]
+03)----DataSourceExec: file_groups={1 group:
[[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/projection_pushdown/simple.parquet]]},
projection=[id, s], file_type=parquet, predicate=id@0 > 1 AND id@0 < 5,
pruning_predicate=id_null_count@1 != row_count@2 AND id_max@0 > 1 AND
id_null_count@1 != row_count@2 AND id_min@3 < 5, required_guarantees=[]
+
+# Verify correctness - should return rows where 1 < id < 5 (id=2,3,4)
+query I
+SELECT s['value'] FROM simple_struct WHERE id > 1 AND id < 5 ORDER BY
s['value'];
+----
+150
+200
+300
+
+###
+# Test 11.4: Core augmentation with column reordering
+# Projects: s['value'], s['label'], id (note 'id' is LAST in the SELECT list)
+# Predicate: id > 1
+# When 'id' is added for filtering, columns are reordered to match input
schema order.
+# Result: 'id' moves from SELECT position 3 to physical plan index 2.
+# Physical plan shows: FilterExec: id@2 > 1 (note the index after reordering).
+###
+
+query TT
+EXPLAIN SELECT s['value'], s['label'], id FROM simple_struct WHERE id > 1;
+----
+logical_plan
+01)Projection: get_field(simple_struct.s, Utf8("value")),
get_field(simple_struct.s, Utf8("label")), simple_struct.id
+02)--Filter: simple_struct.id > Int64(1)
+03)----TableScan: simple_struct projection=[id, s],
partial_filters=[simple_struct.id > Int64(1)]
+physical_plan
+01)ProjectionExec: expr=[get_field(s@1, value) as simple_struct.s[value],
get_field(s@1, label) as simple_struct.s[label], id@0 as id]
+02)--FilterExec: id@0 > 1
+03)----DataSourceExec: file_groups={1 group:
[[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/projection_pushdown/simple.parquet]]},
projection=[id, s], file_type=parquet, predicate=id@0 > 1,
pruning_predicate=id_null_count@1 != row_count@2 AND id_max@0 > 1,
required_guarantees=[]
Review Comment:
Amended in 37e75b1
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]