llvm-beanz wrote:

> I didn't implement the tokenizer because I found that the extra level of 
> abstraction to be redundant/not beneficial with the StringRef operations.

I disagree, and generally string operations tend to muddy up and can lead to 
inefficiencies.

> Looking at the DXC implementation, the usage of the Tokenizer is either 
> GetAndMatchToken, or, getToken for an identifier with a switch on a small 
> subset of tokens. These are effectively just 
> StringRef::consume_front/StringSwitch with the buffer abstracted into the 
> Tokenizer.

I don't think DXC's implementation is a good reference.

> Since we can just go through the buffer from left to right and construct the 
> RootElements in place, then we will not reference a previous token, and so, 
> defining/lexing an intermediate Token seems redundant.
> 
> What aspects are you referring to that would warrant it?

In general, we should lex tokens once, transform them to enums and associated 
state and move on. Let's take this root signature as an example:

```
DescriptorTable(CBV(b0, space=1)))
```

This becomes a token stream something like:
* `DescriptorTable` - keyword
* `(` - lparen
* `CBV` - keyword
* `(` - lparen
* `b0` - register
* `,` - comma
* `space` - keyword
*  `=` - equal
* `1` - number
* `)` - rparen
* `)` - rparen
* `)` - rparen

Having a token representation where keywords and grammar tokens are converted 
to enumerations prevents having string or character operations throughout the 
parser. This is in line with Clang's tokenizer design, and seems like something 
we should also match.

Having the tokenizer also be able to pre-parse numbers and register tokens into 
constituent parts ensures that the lexing errors are simple to emit and occur 
where expected.

The lexing rules for HLSL are pretty simple. I would probably write the Lexer 
in an iterator pattern and just have a token iterator that walks token to token 
with a small copyable state. That would allow lookahead where necessary. You 
don't need to design this the way I would, Clang's model of the "Parser" 
preserving the current lexer state is also reasonable (that is more similar to 
how DXC implements this.

In either case, abstracting string and pointer manipulation is really 
important. If we add new root signature keywords I shouldn't need to add new 
logic for string comparisons. I would look at how Clang's TokenKinds.def 
defines keyword and punctuator tokens and I would look to define our parser 
similarly.

https://github.com/llvm/llvm-project/pull/121799
_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to