adragomir opened a new issue, #14540: URL: https://github.com/apache/datafusion/issues/14540
### Describe the bug With the following table structure: ``` CREATE OR REPLACE TABLE t1 ( date DATE, timestamp TIMESTAMP_S, ids STRUCT( id1 VARCHAR, extra INT4 ), struct STRUCT( var1 VARCHAR, extra VARCHAR ) ); ``` and the following query ``` WITH events AS ( SELECT ids.id1 as device, struct.var1 as user, timestamp FROM t1 WHERE date='2025-01-03' ) SELECT *, LAG(user, 1) OVER (PARTITION BY device ORDER BY timestamp) AS prev FROM events WHERE device IS NOT NULL AND device != '' AND user IS NOT NULL AND user != '' LIMIT 100 ``` I get into a situation where the 2 optimization steps given above fight between them: 1. First, push_down_filter pushes the filter down ``` Projection: events.device, events.user, events.timestamp, lag(events.user,Int64(1)) PARTITION BY [events.device] ORDER BY [events.timestamp ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW AS prev Limit: skip=0, fetch=100 WindowAggr: windowExpr=[[lag(events.user, Int64(1)) PARTITION BY [events.device] ORDER BY [events.timestamp ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW]] SubqueryAlias: events Projection: get_field(t1.ids, Utf8("id1")) AS device, get_field(t1.struct, Utf8("var1")) AS user, t1.timestamp Filter: get_field(t1.ids, Utf8("id1")) IS NOT NULL AND get_field(t1.ids, Utf8("id1")) != Utf8("") AND get_field(t1.struct, Utf8("var1")) IS NOT NULL AND get_field(t1.struct, Utf8("var1")) != Utf8("") AND t1._ACP_DATE = Date32("2025-01-03") TableScan: t1, partial_filters=[get_field(t1.ids, Utf8("id1")) IS NOT NULL, get_field(t1.ids, Utf8("id1")) != Utf8(""), get_field(t1.struct, Utf8("var1")) IS NOT NULL, get_field(t1.struct, Utf8("var1")) != Utf8(""), t1._ACP_DATE = Date32("2025-01-03")] ``` 2. Then the common_sub_expression_eliminate acts on the `Filter`, and of course, adds a `Projection` below it to contain the aliases: ``` Projection: events.DeviceId, events.UserId, events.timestamp, lag(events.UserId,Int64(1)) PARTITION BY [events.DeviceId] ORDER BY [events.timestamp ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW AS PreviousUserColName Limit: skip=0, fetch=100 WindowAggr: windowExpr=[[lag(events.UserId, Int64(1)) PARTITION BY [events.DeviceId] ORDER BY [events.timestamp ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW]] SubqueryAlias: events Projection: get_field(t1.endUserIDs, Utf8("id1")) AS DeviceId, get_field(t1.struct, Utf8("var1")) AS UserId, t1.timestamp Filter: __common_expr_1 IS NOT NULL AND __common_expr_1 != Utf8("") AND get_field(t1.struct, Utf8("var1")) IS NOT NULL AND get_field(t1.struct, Utf8("var1")) != Utf8("") AND t1._ACP_DATE = Date32("2025-01-03") Projection: get_field(t1.endUserIDs, Utf8("id1")) AS __common_expr_1, t1._ACP_DATE, t1.timestamp, t1.endUserIDs, t1.struct TableScan: t1 projection=[_ACP_DATE, timestamp, endUserIDs, struct], partial_filters=[get_field(t1.endUserIDs, Utf8("id1")) IS NOT NULL, get_field(t1.endUserIDs, Utf8("id1")) != Utf8(""), get_field(t1.struct, Utf8("var1")) IS NOT NULL, get_field(t1.struct, Utf8("var1")) != Utf8(""), t1._ACP_DATE = Date32("2025-01-03")] ``` 3. At the next run, `push_down_filter` sees a Filter on top a Projection and pushes it down, deleting the alias probably because Filters can't have aliases ``` Projection: events.DeviceId, events.UserId, events.timestamp, lag(events.UserId,Int64(1)) PARTITION BY [events.DeviceId] ORDER BY [events.timestamp ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW AS PreviousUserColName Limit: skip=0, fetch=100 WindowAggr: windowExpr=[[lag(events.UserId, Int64(1)) PARTITION BY [events.DeviceId] ORDER BY [events.timestamp ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW]] SubqueryAlias: events Projection: get_field(t1.endUserIDs, Utf8("id1")) AS DeviceId, get_field(t1.struct, Utf8("var1")) AS UserId, t1.timestamp Projection: get_field(t1.endUserIDs, Utf8("id1")) AS __common_expr_1, t1._ACP_DATE, t1.timestamp, t1.endUserIDs, t1.struct Filter: get_field(t1.endUserIDs, Utf8("id1")) IS NOT NULL AND get_field(t1.endUserIDs, Utf8("id1")) != Utf8("") AND get_field(t1.struct, Utf8("var1")) IS NOT NULL AND get_field(t1.struct, Utf8("var1")) != Utf8("") AND t1._ACP_DATE = Date32("2025-01-03") TableScan: t1 projection=[_ACP_DATE, timestamp, endUserIDs, struct], partial_filters=[get_field(t1.endUserIDs, Utf8("id1")) IS NOT NULL, get_field(t1.endUserIDs, Utf8("id1")) != Utf8(""), get_field(t1.struct, Utf8("var1")) IS NOT NULL, get_field(t1.struct, Utf8("var1")) != Utf8(""), t1._ACP_DATE = Date32("2025-01-03")] ``` And no progress is made. This seems like it could happen in other cases. Maybe we could have something like hints, like a flag on the logical plan that can say that the plan was added during a previous optimization step ? The problem seems complicated though. ### To Reproduce _No response_ ### Expected behavior _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org