adragomir opened a new issue, #14540:
URL: https://github.com/apache/datafusion/issues/14540

   ### Describe the bug
   
   With the following table structure: 
   ```
   CREATE OR REPLACE TABLE t1 (
        date DATE,
        timestamp TIMESTAMP_S,
     ids STRUCT(
       id1 VARCHAR,
       extra INT4
     ),
     struct STRUCT(
       var1 VARCHAR,
       extra VARCHAR
     )
   );
   ```
   and the following query 
   
   ```
           WITH events AS (
               SELECT
                   ids.id1 as device,
                   struct.var1 as user,
                   timestamp
               FROM t1
               WHERE
                   date='2025-01-03'
           )
           SELECT
               *,
               LAG(user, 1) OVER (PARTITION BY device ORDER BY timestamp) AS 
prev
           FROM events
           WHERE
               device IS NOT NULL AND device != ''
               AND user IS NOT NULL AND user != ''
           LIMIT 100
   
   ```
   
   I get into a situation where the 2 optimization steps given above fight 
between them: 
   1. First, push_down_filter pushes the filter down
   
   ```
       Projection: events.device, events.user, events.timestamp, 
lag(events.user,Int64(1)) PARTITION BY [events.device] ORDER BY 
[events.timestamp ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT 
ROW AS prev
         Limit: skip=0, fetch=100
           WindowAggr: windowExpr=[[lag(events.user, Int64(1)) PARTITION BY 
[events.device] ORDER BY [events.timestamp ASC NULLS LAST] RANGE BETWEEN 
UNBOUNDED PRECEDING AND CURRENT ROW]]
             SubqueryAlias: events
               Projection: get_field(t1.ids, Utf8("id1")) AS device, 
get_field(t1.struct, Utf8("var1")) AS user, t1.timestamp
                 Filter: get_field(t1.ids, Utf8("id1")) IS NOT NULL AND 
get_field(t1.ids, Utf8("id1")) != Utf8("") AND get_field(t1.struct, 
Utf8("var1")) IS NOT NULL AND get_field(t1.struct, Utf8("var1")) != Utf8("") 
AND t1._ACP_DATE = Date32("2025-01-03")
                   TableScan: t1, partial_filters=[get_field(t1.ids, 
Utf8("id1")) IS NOT NULL, get_field(t1.ids, Utf8("id1")) != Utf8(""), 
get_field(t1.struct, Utf8("var1")) IS NOT NULL, get_field(t1.struct, 
Utf8("var1")) != Utf8(""), t1._ACP_DATE = Date32("2025-01-03")]
   ```
   
   2. Then the common_sub_expression_eliminate acts on the `Filter`, and of 
course, adds a `Projection` below it to contain the aliases: 
   
   ```
       Projection: events.DeviceId, events.UserId, events.timestamp, 
lag(events.UserId,Int64(1)) PARTITION BY [events.DeviceId] ORDER BY 
[events.timestamp ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT 
ROW AS PreviousUserColName
         Limit: skip=0, fetch=100
           WindowAggr: windowExpr=[[lag(events.UserId, Int64(1)) PARTITION BY 
[events.DeviceId] ORDER BY [events.timestamp ASC NULLS LAST] RANGE BETWEEN 
UNBOUNDED PRECEDING AND CURRENT ROW]]
             SubqueryAlias: events
               Projection: get_field(t1.endUserIDs, Utf8("id1")) AS DeviceId, 
get_field(t1.struct, Utf8("var1")) AS UserId, t1.timestamp
                 Filter: __common_expr_1 IS NOT NULL AND __common_expr_1 != 
Utf8("") AND get_field(t1.struct, Utf8("var1")) IS NOT NULL AND 
get_field(t1.struct, Utf8("var1")) != Utf8("") AND t1._ACP_DATE = 
Date32("2025-01-03")
                   Projection: get_field(t1.endUserIDs, Utf8("id1")) AS 
__common_expr_1, t1._ACP_DATE, t1.timestamp, t1.endUserIDs, t1.struct
                     TableScan: t1 projection=[_ACP_DATE, timestamp, 
endUserIDs, struct], partial_filters=[get_field(t1.endUserIDs, Utf8("id1")) IS 
NOT NULL, get_field(t1.endUserIDs, Utf8("id1")) != Utf8(""), 
get_field(t1.struct, Utf8("var1")) IS NOT NULL, get_field(t1.struct, 
Utf8("var1")) != Utf8(""), t1._ACP_DATE = Date32("2025-01-03")]
   ```
   
   3. At the next run, `push_down_filter` sees a Filter on top a Projection and 
pushes it down, deleting the alias probably because Filters can't have aliases 
   
   ```
       Projection: events.DeviceId, events.UserId, events.timestamp, 
lag(events.UserId,Int64(1)) PARTITION BY [events.DeviceId] ORDER BY 
[events.timestamp ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT 
ROW AS PreviousUserColName
         Limit: skip=0, fetch=100
           WindowAggr: windowExpr=[[lag(events.UserId, Int64(1)) PARTITION BY 
[events.DeviceId] ORDER BY [events.timestamp ASC NULLS LAST] RANGE BETWEEN 
UNBOUNDED PRECEDING AND CURRENT ROW]]
             SubqueryAlias: events
               Projection: get_field(t1.endUserIDs, Utf8("id1")) AS DeviceId, 
get_field(t1.struct, Utf8("var1")) AS UserId, t1.timestamp
                 Projection: get_field(t1.endUserIDs, Utf8("id1")) AS 
__common_expr_1, t1._ACP_DATE, t1.timestamp, t1.endUserIDs, t1.struct
                   Filter: get_field(t1.endUserIDs, Utf8("id1")) IS NOT NULL 
AND get_field(t1.endUserIDs, Utf8("id1")) != Utf8("") AND get_field(t1.struct, 
Utf8("var1")) IS NOT NULL AND get_field(t1.struct, Utf8("var1")) != Utf8("") 
AND t1._ACP_DATE = Date32("2025-01-03")
                     TableScan: t1 projection=[_ACP_DATE, timestamp, 
endUserIDs, struct], partial_filters=[get_field(t1.endUserIDs, Utf8("id1")) IS 
NOT NULL, get_field(t1.endUserIDs, Utf8("id1")) != Utf8(""), 
get_field(t1.struct, Utf8("var1")) IS NOT NULL, get_field(t1.struct, 
Utf8("var1")) != Utf8(""), t1._ACP_DATE = Date32("2025-01-03")]
   ```
   
   And no progress is made. 
   
   This seems like it could happen in other cases. Maybe we could have 
something like hints, like a flag on the logical plan that can say that the 
plan was added during a previous optimization step ? The problem seems 
complicated though. 
   
   
   
   ### To Reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to