hozan23 opened a new issue, #19353:
URL: https://github.com/apache/datafusion/issues/19353

   ### Describe the bug
   
   When a query contains a `GROUP BY` clause, DataFusion collapses complex 
projection expressions into a single Column whose name is a stringified version 
of the full expression. This causes the original expression tree (`Expr`) to be 
lost, existing only within the expression of the `AggregatePlan`
   
   For the same projection expression:
   
   Without `GROUP BY`: the projection remains a fully composed `Expr` (e.g. 
Cast, ScalarFunction, etc.).
   
   With `GROUP BY`: the projection is replaced with `Alias(Column(<stringified 
expression>))`.
   
   This prevents analyzer or optimizer rules from inspecting or rewriting the 
original expression.
   
   ### To Reproduce
   
   
   ```rust
   use std::sync::Arc;
   
   use datafusion::{
       arrow::{
           array::{Int32Array, RecordBatch, StringArray, TimestampSecondArray},
           datatypes::{DataType, Field, Schema, TimeUnit},
       },
       common::tree_node::{Transformed, TreeNode},
       datasource::MemTable,
       execution::{SessionStateBuilder, context::SessionContext},
       logical_expr::LogicalPlan,
   };
   
   fn create_test_session() -> SessionContext {
       let state = SessionStateBuilder::new().with_default_features().build();
       let ctx = SessionContext::new_with_state(state);
   
       let schema = Arc::new(Schema::new(vec![
           Field::new("id", DataType::Int32, false),
           Field::new("name", DataType::Utf8, false),
           Field::new(
               "_modifiedDateTime",
               DataType::Timestamp(TimeUnit::Second, None),
               false,
           ),
       ]));
   
       let batch = RecordBatch::try_new(
           schema.clone(),
           vec![
               Arc::new(Int32Array::from(vec![1, 2, 3])),
               Arc::new(StringArray::from(vec!["Alice", "Bob", "Charlie"])),
               Arc::new(TimestampSecondArray::from(vec![
                   chrono::Utc::now().timestamp(),
                   chrono::Utc::now().timestamp(),
                   chrono::Utc::now().timestamp(),
               ])),
           ],
       )
       .unwrap();
   
       let table = MemTable::try_new(schema, vec![vec![batch]]).unwrap();
       ctx.register_table("users", Arc::new(table)).unwrap();
   
       ctx
   }
   
   #[tokio::main]
   async fn main() {
       let ctx = create_test_session();
       let sql = r#"
           SELECT 
               CAST(
                   DATE_FORMAT(
                       CAST(
                           TO_DATE(
                               DATE_FORMAT(
                                   CAST(`_modifiedDateTime` AS TIMESTAMP), 
                                   '%Y-%m-%dT%H:%i:%S'
                               ), 
                               '%Y-%m-%dT%H:%i:%S'
                           ) AS TIMESTAMP 
                       ), 
                       '%Y-%m-%d %H:%i:%s'
                   ) AS TIMESTAMP 
               ) AS qt_5anmnq5myd
           FROM users group by qt_5anmnq5myd;
       "#;
   
       let plan = ctx.state().create_logical_plan(sql).await.unwrap();
   
       plan.transform(|p| match p {
           LogicalPlan::Projection(ref projection) => {
               println!("DEBUG {:?}", projection.expr);
               Ok(Transformed::no(p))
           }
           _ => Ok(Transformed::no(p)),
       })
       .unwrap();
   }
   
   ```
   
   ### Expected behavior
   
   The projection expression should remain a structured `Expr` tree regardless 
of whether a `GROUP BY` clause is present.
   
   Specifically:
   
   The logical plan should preserve the original composed expression (Cast, 
ScalarFunction, etc.).
   
   Analyzer and optimizer rules should be able to traverse and rewrite the 
projection expression consistently for both grouped and non-grouped queries.
   
   ### Additional context
   
   By tracing  in datafusion/sql/src/select.rs, the expression is rewritten 
during aggregation planning:
   
   ```
   if !group_by_exprs.is_empty() || !aggr_exprs.is_empty() {
       self.aggregate(...)
   }
   ```
   
   
   Before aggregation: `select_exprs` contains the fully composed ``.
   
   After aggregation: `select_exprs_post_aggr` contains 
`Alias(Column(<stringified expression>))`, and the schema contains only this 
single stringified column.
   
   This causes problems for downstream `AnalyzerRules`. In our case, we have an 
analyzer that rewrites schemas to match a remote database schema, but for 
grouped queries we can no longer identify or rewrite the original projection 
expression.
   
   We implemented a workaround that attempts to reconstruct the projection 
expression by matching it back to the `GROUP BY` expression and re-stringifying 
it, but this is fragile and not ideal.
   
   A possible improvement would be to delay collapsing projection expressions 
into stringified Columns until later(e.g. after analyzer rules or during 
optimization planning), so expression structure is preserved during logical 
analysis.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to