Re: [PR] Fix tokenization of qualified identifiers with numeric prefix. [datafusion-sqlparser-rs]

via GitHub Thu, 10 Apr 2025 07:27:19 -0700


romanb commented on code in PR #1803:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1803#discussion_r2037530085



##########
src/tokenizer.rs:
##########
@@ -895,7 +895,7 @@ impl<'a> Tokenizer<'a> {
         };
 
         let mut location = state.location();
-        while let Some(token) = self.next_token(&mut state)? {
+        while let Some(token) = self.next_token(&mut state, buf.last().map(|t| 
&t.token))? {

Review Comment:
   Thank you for the review and the suggestion. The latest commit implements an 
alternative approach within `Parser#parse_compound_expr`. However, I have to 
admit that I personally prefer the previous fix in the tokenizer, for a few 
reasons:
   
     1. The `Tokenizer` is part of the public API of the crate, and without 
fixing this in the tokenizer, it will continue to exhibit the following 
behavior with the Mysql dialect:
   
        * `t.1to2` tokenizes to  `t` (Word), `.1to2` (Word)
        * `t.1e2` tokenizes to `t` (Word), `.1e2` (Number)
        * ``t.`1to2` `` tokenizes to `t` (Word), `.` (Period), `1to2` (Word)
        * ``t.`1e2` `` tokenizes to `t` (Word), `.` (Period) `1e2` (Word)
   
           This could be very surprising for users and arguably be considered 
incorrect (when using the Mysql dialect).
   
     2. The handling of `Word` and `Number` tokens in `parse_compound_expr` is 
not the most elegant with having to split off the `.` and having to create the 
correct off-by-one spans accordingly.
   
     3. Unqualified identifiers that start with digits are already handled in 
the tokenizer - and I think have to be (see 
https://github.com/apache/datafusion-sqlparser-rs/pull/856). Handling the same 
problem of misinterpreted tokens for qualified identifiers in the parser seems 
a bit disconnected and a like a downstream solution to the tokenizer producing 
incorrect tokens, at least that is how I currently perceive it (see point 1).
   
   Let me know which solution you prefer (with or without the latest commit) or 
if you have another, better idea.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix tokenization of qualified identifiers with numeric prefix. [datafusion-sqlparser-rs]

Reply via email to