Re: [PR] Introduce `CompoundExpr` to handle the complex field access chain [datafusion-sqlparser-rs]

via GitHub Fri, 06 Dec 2024 08:41:20 -0800


iffyio commented on code in PR #1551:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1551#discussion_r1871791291



##########
src/dialect/snowflake.rs:
##########
@@ -234,6 +234,10 @@ impl Dialect for SnowflakeDialect {
             RESERVED_FOR_IDENTIFIER.contains(&kw)
         }
     }
+
+    fn supports_partiql(&self) -> bool {
+        true
+    }

Review Comment:
   Ah I don't think this is necessarily correct since partiql is a redshift 
feature, was this required somewhere?



##########
src/ast/mod.rs:
##########
@@ -1094,6 +1053,25 @@ impl fmt::Display for Subscript {
     }
 }
 
+/// The contents inside the `.` in an access chain.
+/// It can be an expression or a subscript.
+#[derive(Debug, Clone, PartialEq, PartialOrd, Eq, Ord, Hash)]
+#[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]
+#[cfg_attr(feature = "visitor", derive(Visit, VisitMut))]
+pub enum AccessField {
+    Expr(Expr),
+    Subscript(Subscript),

Review Comment:
   Can we add a doc  comment on the individual variant mentioning an example? 
e.g.
   ```sql
   `root.field1`
   `root[1:2:3]`
   ```
   so that its clear what the syntax looks like if someone ends up on the enum 
definition.



##########
src/ast/mod.rs:
##########
@@ -1094,6 +1053,25 @@ impl fmt::Display for Subscript {
     }
 }
 
+/// The contents inside the `.` in an access chain.
+/// It can be an expression or a subscript.
+#[derive(Debug, Clone, PartialEq, PartialOrd, Eq, Ord, Hash)]
+#[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]
+#[cfg_attr(feature = "visitor", derive(Visit, VisitMut))]
+pub enum AccessField {
+    Expr(Expr),

Review Comment:
   Would it make sense to call this e.g `Dot(Expr)` or similar? Thinking if the 
idea is to include other styles later and this variant is specific to `.`



##########
src/parser/mod.rs:
##########
@@ -1144,53 +1144,52 @@ impl<'a> Parser<'a> {
         w_span: Span,
     ) -> Result<Expr, ParserError> {
         match self.peek_token().token {
-            Token::LParen | Token::Period => {
-                let mut id_parts: Vec<Ident> = vec![w.to_ident(w_span)];
-                let mut ending_wildcard: Option<TokenWithSpan> = None;
-                while self.consume_token(&Token::Period) {
-                    let next_token = self.next_token();
-                    match next_token.token {
-                        Token::Word(w) => 
id_parts.push(w.to_ident(next_token.span)),
-                        Token::Mul => {
-                            // Postgres explicitly allows funcnm(tablenm.*) 
and the
-                            // function array_agg traverses this control flow
-                            if dialect_of!(self is PostgreSqlDialect) {
-                                ending_wildcard = Some(next_token);
-                                break;
-                            } else {
-                                return self.expected("an identifier after 
'.'", next_token);
-                            }
-                        }
-                        Token::SingleQuotedString(s) => 
id_parts.push(Ident::with_quote('\'', s)),
-                        _ => {
-                            return self.expected("an identifier or a '*' after 
'.'", next_token);
-                        }
+            Token::Period => 
self.parse_compound_expr(Expr::Identifier(w.to_ident(w_span)), vec![]),
+            Token::LParen => {
+                let id_parts = vec![w.to_ident(w_span)];
+                // parse_comma_outer_join is used to parse the following 
pattern:
+                if dialect_of!(self is SnowflakeDialect | MsSqlDialect)
+                    && self.consume_tokens(&[Token::LParen, Token::Plus, 
Token::RParen])
+                {
+                    Ok(Expr::OuterJoin(Box::new(
+                        match <[Ident; 1]>::try_from(id_parts) {
+                            Ok([ident]) => Expr::Identifier(ident),
+                            Err(parts) => Expr::CompoundIdentifier(parts),
+                        },
+                    )))
+                } else {
+                    let mut expr = self.parse_function(ObjectName(id_parts))?;
+                    // consume all period if it's a method chain
+                    if self.dialect.supports_methods() {
+                        expr = self.try_parse_method(expr)?
                     }
-                }
-
-                if let Some(wildcard_token) = ending_wildcard {
-                    Ok(Expr::QualifiedWildcard(
-                        ObjectName(id_parts),
-                        AttachedToken(wildcard_token),
-                    ))
-                } else if self.consume_token(&Token::LParen) {
-                    if dialect_of!(self is SnowflakeDialect | MsSqlDialect)
-                        && self.consume_tokens(&[Token::Plus, Token::RParen])
-                    {
-                        Ok(Expr::OuterJoin(Box::new(
-                            match <[Ident; 1]>::try_from(id_parts) {
-                                Ok([ident]) => Expr::Identifier(ident),
-                                Err(parts) => Expr::CompoundIdentifier(parts),
-                            },
-                        )))
-                    } else {
-                        self.prev_token();
-                        self.parse_function(ObjectName(id_parts))
+                    let mut fields = vec![];
+                    // if the function returns an array, it can be subscripted
+                    if self.consume_token(&Token::LBracket) {
+                        self.parse_multi_dim_subscript(&mut fields)?;

Review Comment:
   Oh I think here reading I would have expected that 
`self.parse_compound_expr` below takes care of subscript :thinking: the doc 
comment on the the function seems to suggest that it would (the `a.b[1].c` 
example)



##########
src/ast/mod.rs:
##########
@@ -1094,6 +1053,25 @@ impl fmt::Display for Subscript {
     }
 }
 
+/// The contents inside the `.` in an access chain.
+/// It can be an expression or a subscript.
+#[derive(Debug, Clone, PartialEq, PartialOrd, Eq, Ord, Hash)]
+#[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]
+#[cfg_attr(feature = "visitor", derive(Visit, VisitMut))]
+pub enum AccessField {

Review Comment:
   maybe the enum can be called `AccessExpr`? thinking field might rather 
suggest that its a scalar/property



##########
src/ast/mod.rs:
##########
@@ -1094,6 +1053,25 @@ impl fmt::Display for Subscript {
     }
 }
 
+/// The contents inside the `.` in an access chain.

Review Comment:
   is the 'inside the `.`' part correct given its not necessarily delimited by 
`.`?



##########
src/ast/mod.rs:
##########
@@ -624,6 +590,12 @@ pub enum Expr {
     Identifier(Ident),
     /// Multi-part identifier, e.g. `table_alias.column` or `schema.table.col`
     CompoundIdentifier(Vec<Ident>),
+    /// Multi-part Expression accessing. It's used to represent an access 
chain from a root expression.
+    /// e.g. `expr[0]`, `expr[0][0]`, or `expr.field1.filed2[1].field3`, ...

Review Comment:
   ```suggestion
       /// e.g. `expr[0]`, `expr[0][0]`, or `expr.field1.field2[1].field3`, ...
   ```



##########
src/parser/mod.rs:
##########
@@ -1144,53 +1144,52 @@ impl<'a> Parser<'a> {
         w_span: Span,
     ) -> Result<Expr, ParserError> {
         match self.peek_token().token {
-            Token::LParen | Token::Period => {
-                let mut id_parts: Vec<Ident> = vec![w.to_ident(w_span)];
-                let mut ending_wildcard: Option<TokenWithSpan> = None;
-                while self.consume_token(&Token::Period) {
-                    let next_token = self.next_token();
-                    match next_token.token {
-                        Token::Word(w) => 
id_parts.push(w.to_ident(next_token.span)),
-                        Token::Mul => {
-                            // Postgres explicitly allows funcnm(tablenm.*) 
and the
-                            // function array_agg traverses this control flow
-                            if dialect_of!(self is PostgreSqlDialect) {
-                                ending_wildcard = Some(next_token);
-                                break;
-                            } else {
-                                return self.expected("an identifier after 
'.'", next_token);
-                            }
-                        }
-                        Token::SingleQuotedString(s) => 
id_parts.push(Ident::with_quote('\'', s)),
-                        _ => {
-                            return self.expected("an identifier or a '*' after 
'.'", next_token);
-                        }
+            Token::Period => 
self.parse_compound_expr(Expr::Identifier(w.to_ident(w_span)), vec![]),
+            Token::LParen => {
+                let id_parts = vec![w.to_ident(w_span)];
+                // parse_comma_outer_join is used to parse the following 
pattern:
+                if dialect_of!(self is SnowflakeDialect | MsSqlDialect)
+                    && self.consume_tokens(&[Token::LParen, Token::Plus, 
Token::RParen])
+                {
+                    Ok(Expr::OuterJoin(Box::new(
+                        match <[Ident; 1]>::try_from(id_parts) {
+                            Ok([ident]) => Expr::Identifier(ident),
+                            Err(parts) => Expr::CompoundIdentifier(parts),
+                        },
+                    )))
+                } else {
+                    let mut expr = self.parse_function(ObjectName(id_parts))?;
+                    // consume all period if it's a method chain
+                    if self.dialect.supports_methods() {

Review Comment:
   `try_parse_method` already does the `if self.dialect.supports_methods()` 
check so that we should be able to skip the if condition?



##########
src/lib.rs:
##########
@@ -138,6 +138,7 @@ extern crate alloc;
 #[macro_use]
 #[cfg(test)]
 extern crate pretty_assertions;
+extern crate core;

Review Comment:
   Oh can this be removed if unused?



##########
src/parser/mod.rs:
##########
@@ -1427,6 +1426,112 @@ impl<'a> Parser<'a> {
         }
     }
 
+    /// Try to parse an [Expr::CompoundExpr] like `a.b.c` or `a.b[1].c`.
+    /// If all the fields are `Expr::Identifier`s, return an 
[Expr::CompoundIdentifier] instead.
+    /// If only the root exists, return the root.
+    /// If self supports [Dialect::supports_partiql], it will fall back when 
occurs [Token::LBracket] for JsonAccess parsing.
+    pub fn parse_compound_expr(
+        &mut self,
+        root: Expr,
+        mut chain: Vec<AccessField>,
+    ) -> Result<Expr, ParserError> {
+        let mut ending_wildcard: Option<TokenWithSpan> = None;
+        let mut ending_lbracket = false;
+        while self.consume_token(&Token::Period) {
+            let next_token = self.next_token();
+            match next_token.token {
+                Token::Word(w) => {
+                    let expr = Expr::Identifier(w.to_ident(next_token.span));
+                    chain.push(AccessField::Expr(expr));
+                    if self.consume_token(&Token::LBracket) {
+                        if self.dialect.supports_partiql() {
+                            ending_lbracket = true;
+                            break;
+                        } else {
+                            self.parse_multi_dim_subscript(&mut chain)?
+                        }
+                    }
+                }
+                Token::Mul => {
+                    // Postgres explicitly allows funcnm(tablenm.*) and the
+                    // function array_agg traverses this control flow
+                    if dialect_of!(self is PostgreSqlDialect) {
+                        ending_wildcard = Some(next_token);
+                        break;
+                    } else {
+                        return self.expected("an identifier after '.'", 
next_token);
+                    }
+                }
+                Token::SingleQuotedString(s) => {
+                    let expr = Expr::Identifier(Ident::with_quote('\'', s));
+                    chain.push(AccessField::Expr(expr));
+                }
+                _ => {
+                    return self.expected("an identifier or a '*' after '.'", 
next_token);
+                }
+            }
+        }
+
+        // if dialect supports partiql, we need to go back one Token::LBracket 
for the JsonAccess parsing
+        if self.dialect.supports_partiql() && ending_lbracket {
+            self.prev_token();
+        }
+
+        if let Some(wildcard_token) = ending_wildcard {

Review Comment:
   I'm wondering if it is required that `parse_compound_expr` track the 
wildcard and other cases that follow. I imagined the behavior somewhat similar 
to what the doc comment suggests, in that the method only looks to parse a 
chain if one exists (it looks like the function and outer join parsing is 
duplicated here which doesnt seem ideal).
   
   Would it be feasible to have this function only take in the root and then 
look to parse only the chain? e.g. roughly
   ```rust
   let mut chain = vec![];
   loop {
     match self.peek().token {
       Token::Period => {
         self.next_token(); // `.`
         if let Some(expr) = self.maybe_parse(Self::parse_sub_expr(
             // Guessing a bit offhand that involving precedence here
             // could help avoid the need to flatten afterwards. I _think_ that 
was
             // the intent with `exprs_to_idents`? idea is that in this context,
             // the dot is essentially an operator similar to `[]`
             self.dialect.prec_value(Percedence::Dot)
         ))? {
           chain.push(AccessField::Expr(expr));
         } else {
           self.prev_token(); // `.`
           break
         }
       }
       Token::LBracket => {
         // We can probably let the subscript method handle the `[` and `]`
         if let Some(subscript) = self.maybe_parse(Self::parse_subscript)? {
           chain.push(AccessField::Subscript(subscript));
         } else {
           break
         }
       }
     }
   }
   ```



##########
src/parser/mod.rs:
##########
@@ -1144,53 +1144,52 @@ impl<'a> Parser<'a> {
         w_span: Span,
     ) -> Result<Expr, ParserError> {
         match self.peek_token().token {
-            Token::LParen | Token::Period => {
-                let mut id_parts: Vec<Ident> = vec![w.to_ident(w_span)];
-                let mut ending_wildcard: Option<TokenWithSpan> = None;
-                while self.consume_token(&Token::Period) {
-                    let next_token = self.next_token();
-                    match next_token.token {
-                        Token::Word(w) => 
id_parts.push(w.to_ident(next_token.span)),
-                        Token::Mul => {
-                            // Postgres explicitly allows funcnm(tablenm.*) 
and the
-                            // function array_agg traverses this control flow
-                            if dialect_of!(self is PostgreSqlDialect) {
-                                ending_wildcard = Some(next_token);
-                                break;
-                            } else {
-                                return self.expected("an identifier after 
'.'", next_token);
-                            }
-                        }
-                        Token::SingleQuotedString(s) => 
id_parts.push(Ident::with_quote('\'', s)),
-                        _ => {
-                            return self.expected("an identifier or a '*' after 
'.'", next_token);
-                        }
+            Token::Period => 
self.parse_compound_expr(Expr::Identifier(w.to_ident(w_span)), vec![]),
+            Token::LParen => {
+                let id_parts = vec![w.to_ident(w_span)];
+                // parse_comma_outer_join is used to parse the following 
pattern:

Review Comment:
   is this comment out of date? it seems to suggest a method 
`parse_comma_outer_join` that doesn't exist?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Introduce `CompoundExpr` to handle the complex field access chain [datafusion-sqlparser-rs]

Reply via email to