jimingham wrote:

> > To the best of my knowledge, all the languages that we want to support have 
> > roughly the same definition of what a valid identifier is: a letter or 
> > underscore, followed by a sequence of letters, digits and underscores, 
> > where 'letters' are defined as 'a..z' and 'A..Z'. The one's I've been able 
> > to check do not allow arbitrary characters in their identifiers. So that's 
> > what I implemented (acknowledging that I currently only recognize ascii at 
> > the moment, and fully plan to add utf8 support in the future). I added the 
> > ability to recognize the '$' at the front specifically to allow DIL users 
> > to ask about registers and LLDB convenience variables, which (to the best 
> > of my knowledge) allow '$' only in the first position, and not all by 
> > itself.
> 
> I don't know how you were checking that, but I'm certain that's not the case. 
> I [already gave you](https://godbolt.org/z/o7qbfeWve) an example of C code 
> which contradicts all of these (note that there can be difference between 
> what's considered a valid identifier by the specification of a language, and 
> what an actual compiler for that language will accept). And I'm not even 
> mentioning all of the names that can be constructed by synthetic child 
> providers.
> 
> You say you want to add utf-8 support. How do you intend to do that? Do you 
> want to enumerate all of the supported characters in each language? Check 
> which language supports variable names in Klingon? Some of the rules can be 
> really obscure. For example, Java accepts £ (`\xA3`) as a variable name, but 
> not © (`\xa9`). I'm sure they had some reason to choose that, but I'd rather 
> not have to find that out.
> 
> OTOH, if you just accept all of the high-bit ascii values as valid 
> characters, then you can support utf8 with a single line of code. And you're 
> not regressing anything because that's exactly what the current 
> implementation does.
> 
> I don't think this list has to be set in stone. For example, `frame variable` 
> currently accepts `@` as a variable name. I believe you don't have any plans 
> for that operator, so I'd just stick to that. If we can come up with some 
> fancy use for it (maybe as an escape character?), then I'm certainly open to 
> changing its classification.
> 
> > I am not sure I see that benefits of expanding what DIL recognizes as a 
> > valid identifier beyond what the languages LLDB supports recognize?
> 
> For me the main benefits are:
> 
> * simplicity of the implementation
> * being able to express a wide range of variable names, even for languages we 
> don't support right now
> * matching status quo
> 
> That said, I would like to hear what you think are the benefits of _not_ 
> recognizing wider set of identifier values. And I'm not talking about names 
> like `123foo` (it sounds like there's consensus to ban those). I'm thinking 
> more of names like `$`, `foo$`, `💩`, etc.

There's one small area that our desire to parse the widest range of identifiers 
comes into conflict with, namely the "persistent variable" namespace.  For the 
C family of languages, we reserve the initial `$` for this purpose, for swift 
(where `$1` etc. were already taken) we use `$R` as the namespace identifier.  
This doesn't matter for the DIL at present, because `frame var` and the like 
don't have access to the persistent variables, but I don't think in the long 
run that's desirable.  But it looks like to get that right generally we would 
have to figure out the language of the identifier which we don't want to do.  
I'm not sure how to finesse this (other than to keep not allowing access to 
persistent variables in the DIL...)

https://github.com/llvm/llvm-project/pull/123521
_______________________________________________
lldb-commits mailing list
lldb-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits

Reply via email to