Re: Allowing Unicode Whitespace in Lexer

2024-03-27 Thread Mich Talebzadeh
looks fine except that processing all Unicode whitespace characters might add overhead to the parsing process, potentially impacting performance. Although I think this is a moot point +1 Mich Talebzadeh, Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom

Re: Allowing Unicode Whitespace in Lexer

2024-03-27 Thread Gengliang Wang
+1, this is a reasonable change. Gengliang On Wed, Mar 27, 2024 at 9:54 AM serge rielau.com wrote: > Going once, going twice, …. last call for objections > On Mar 23, 2024 at 5:29 PM -0700, serge rielau.com , > wrote: > > Hello, > > I have a PR https://github.com/apache/spark/pull/45620 ready

Re: Allowing Unicode Whitespace in Lexer

2024-03-27 Thread serge rielau . com
Going once, going twice, …. last call for objections On Mar 23, 2024 at 5:29 PM -0700, serge rielau.com , wrote: Hello, I have a PR https://github.com/apache/spark/pull/45620 ready to go that will extend the definition of whitespace (what separates token) from the small set of ASCII characters

Re: Allowing Unicode Whitespace in Lexer

2024-03-27 Thread serge rielau . com
Yeah I heard about that. This IMHO is a bit more worrying, and we do not have teh "excuse" that it is transparent. Also, which of these would be STRING and which IDENTIFIER? On Mar 25, 2024 at 1:06 PM -0700, Alex Cruise , wrote: While we're at it, maybe consider allowing "smart quotes" too :) -0

Re: Allowing Unicode Whitespace in Lexer

2024-03-25 Thread Alex Cruise
While we're at it, maybe consider allowing "smart quotes" too :) -0xe1a On Sat, Mar 23, 2024 at 5:29 PM serge rielau.com wrote: > Hello, > > I have a PR https://github.com/apache/spark/pull/45620 ready to go that > will extend the definition of whitespace (what separates token) from the > smal

Allowing Unicode Whitespace in Lexer

2024-03-23 Thread serge rielau . com
Hello, I have a PR https://github.com/apache/spark/pull/45620 ready to go that will extend the definition of whitespace (what separates token) from the small set of ASCII characters space, tab, linefeed to those defined in Unicode. While this is a small and safe change, it is one where we would