peter-toth commented on code in PR #10333:
URL: https://github.com/apache/datafusion/pull/10333#discussion_r1595242837
##########
datafusion/optimizer/src/common_subexpr_eliminate.rs:
##########
@@ -656,24 +656,16 @@ enum VisitRecord {
EnterMark(usize),
/// the node's children were skipped => jump to f_up on same node
JumpMark,
- /// Accumulated identifier of sub expression.
- ExprItem(Identifier),
}
impl ExprIdentifierVisitor<'_> {
/// Find the first `EnterMark` in the stack, and accumulates every
`ExprItem`
/// before it.
- fn pop_enter_mark(&mut self) -> Option<(usize, Identifier)> {
- let mut desc = String::new();
-
- while let Some(item) = self.visit_stack.pop() {
+ fn pop_enter_mark(&mut self) -> Option<usize> {
Review Comment:
We shoudn't change this part.
The logic that builds up an identifier using `visit_stack` / 3 kinds of
`VisitRecord` is neccessary and actually a very clever and way to build up an
identifier from the current node and sub-identifiers. (An identifier to be a
`String` was not that a clever decision and will be fixed in
https://github.com/apache/datafusion/issues/10426, but that's a different
issue).
This PR shouldn't change what an identifier is / how it is built up
otherwise we end up with identifier colliding bugs again. The `IdArray`,
`ExprStats` and `CommonExprs` datastructures require an dentifier to represent
a full expression subtreee. This means that:
```
fn expr_identifier(expr: &Expr) -> Identifier {
format!("#{{{expr}}}")
}
```
would cause bugs as shown in 1. of
https://github.com/apache/datafusion/pull/10396.
I.e. if we encountered both `col("a") + col("b")` and `col("a + b")` in the
expression list to be CSEd and we used `"{expr}"` (the non-unique stringified
representation) as identifiers then the equal identifier (`"a + b"`) of those 2
different expressions would collide and we counted 2 for the occurance of one
of the 2 expressions (and the other expression's count would be lost) resulting
wrong CSE.
Please note that currently the identifier of `col("a") + col("b")` is `"{a +
b|b|a}"` so it doesn't collide with `col("a + b")`'s identifier: `"{a + b}"`.
Again, this is hard to test now because of the resolution bug:
https://github.com/apache/datafusion/issues/10413.
I.e. if we wrote a test where we have
```
select a + b, "a + b" from (
select 1 as a, 2 as b, 1 as "a + b"
)
```
then currently it gets resolved as
```
select "a + b", "a + b" from (
select 1 as a, 2 as b, 1 as "a + b"
)
```
and this prevents me to create a test case for CSE identifier collision.
(Please note that I'm simplifying the identifier collision exmple as simple
columns (`col("a + b")`) are not subject to CSE.)
What this PR can do is to change the aliases (use something else than
identifiers) to make the plans more readable.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]