Hello,

I hope this is the right place to ask this.

While working on a project based on arrow-datafusion, I came across some weird 
behavior where a projection did not get eliminated as expected, thus breaking a 
custom optimizer rule's assumption (into which I won't go further, as it's not 
important for this question).

Specifically, I found an execution plan like:

Projection [col1, col2]
     TableScan projection=[col1, col2, col3], full_filters=[ ... col3 ...]

to not be simplified to:

TableScan projection=[col1, col2], full_filters=[... col3 ...]

This does seem to be intentional, as I have found this 
snippet<https://github.com/apache/arrow-datafusion/blob/c97048d178594b10b813c6bcd1543f157db4ba3f/datafusion/optimizer/src/push_down_projection.rs#L174>
 in the optimizer rule:

...
LogicalPlan::TableScan(scan)
                if !scan.projected_schema.fields().is_empty() =>
            {
                let mut used_columns: HashSet<Column> = HashSet::new();
                // filter expr may not exist in expr in projection.
                // like: TableScan: t1 projection=[bool_col, int_col], 
full_filters=[t1.id = Int32(1)]
                // projection=[bool_col, int_col] don't contain `ti.id`.
                exprlist_to_columns(&scan.filters, &mut used_columns)?;
...

However, the comment does not explain why we need to keep the extra projection 
and return the extra column - after all, the filters inside of the scan are 
internal to that scan, and should not affect the execution plan, right?

I am looking forward to any opinions.

Best regards, Markus Appel


Reply via email to