[Lldb-commits] [lldb] [LLDB] Add Lexer (with tests) for DIL (Data Inspection Language). (PR #123521)

Pavel Labath via lldb-commits Mon, 03 Feb 2025 03:39:40 -0800

labath wrote:

> To the best of my knowledge, all the languages that we want to support have 
> roughly the same definition of what a valid identifier is: a letter or 
> underscore, followed by a sequence of letters, digits and underscores, where 
> 'letters' are defined as 'a..z' and 'A..Z'. The one's I've been able to check 
> do not allow arbitrary characters in their identifiers. So that's what I 
> implemented (acknowledging that I currently only recognize ascii at the 
> moment, and fully plan to add utf8 support in the future). I added the 
> ability to recognize the '$' at the front specifically to allow DIL users to 
> ask about registers and LLDB convenience variables, which (to the best of my 
> knowledge) allow '$' only in the first position, and not all by itself.


I don't know how you were checking that, but I'm certain that's not the case. I 
[already gave you](https://godbolt.org/z/o7qbfeWve) an example of C code which 
contradicts all of these (note that there can be difference between what's 
considered a valid identifier by the specification of a language, and what an 
actual compiler for that language will accept). And I'm not even mentioning all 
of the names that can be constructed by synthetic child providers.

You say you want to add utf-8 support. How do you intend to do that? Do you 
want to enumerate all of the supported characters in each language? Check which 
language supports variable names in Klingon? Some of the rules can be really 
obscure. For example, Java accepts £ (`\xA3`) as a variable name, but not © 
(`\xa9`). I'm sure they had some reason to choose that, but I'd rather not have 
to find that out.

OTOH, if you just accept all of the high-bit ascii values as valid characters, 
then you can support utf8 with a single line of code. And you're not regressing 
anything because that's exactly what the current implementation does.

I don't think this list has to be set in stone. For example, `frame variable` 
currently accepts `@` as a variable name. I believe you don't have any plans 
for that operator, so I'd just stick to that. If we can come up with some fancy 
use for it (maybe as an escape character?), then I'm certainly open to changing 
its classification.

> I am not sure I see that benefits of expanding what DIL recognizes as a valid 
> identifier beyond what the languages LLDB supports recognize?

For me the main benefits are:
- simplicity of the implementation
- being able to express a wide range of variable names, even for languages we 
don't support right now
- matching status quo

That said, I would like to hear what you think are the benefits of *not* 
recognizing wider set of identifier values. And I'm not talking about names 
like `123foo` (it sounds like there's consensus to ban those). I'm thinking 
more of names like `$`, `foo$`, `💩`, etc. 

https://github.com/llvm/llvm-project/pull/123521
_______________________________________________
lldb-commits mailing list
lldb-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits

[Lldb-commits] [lldb] [LLDB] Add Lexer (with tests) for DIL (Data Inspection Language). (PR #123521)

Reply via email to