On Mon, Jun 18, 2018 at 9:16 PM, Bart <b...@freeuk.com> wrote: > On 18/06/2018 11:45, Chris Angelico wrote: >> >> On Mon, Jun 18, 2018 at 8:33 PM, Bart <b...@freeuk.com> wrote: > > > >>> You're right in that neither task is that trivial. >>> >>> I can remove comments by writing a tokeniser which scans Python source >>> and >>> re-outputs tokens one at a time. Such a tokeniser normally ignores >>> comments. >>> >>> But to remove type hints, a deeper understanding of the input is needed. >>> I >>> would need a parser rather than a tokeniser. So it is harder. >> >> >> They would actually both end up the same. To properly recognize >> comments, you need to understand enough syntax to recognize them. To >> properly recognize type hints, you need to understand enough syntax to >> recognize them. And in both cases, you need to NOT discard important >> information like consecutive whitespace. > > > No. If syntax is defined on top of tokens, then at the token level, you > don't need to know any syntax. The process that scans characters looking for > the next token, will usually discard comments. Job done.
And it also will usually discard formatting (consecutive whitespace, etc). So unless you're okay with reconstructing functionally-equivalent code, rather than actually preserving the original code, you cannot merely tokenize. You have to use a special form of tokenization that actually keeps all that. > It is very different for type-hints as you will need to properly parse the > source code. > > As a simpler example, if the task was the eliminate the "+" symbol, that > would be one kind of token; it would just be skipped when encountered. But > if the requirement to eliminate only unary "+", and leave binary "+", then > that can't be done at tokeniser level; it will not know the context. Right. You can fairly easily reconstruct code that uses a single newline for any NEWLINE token, and a single space in any location where whitespace makes sense. It's not so easy to correctly reconstruct "x*y + a*b" with the original spacing. > What will those look like? If copyright/licence comments have their own > specific syntax, then they just become another token which has to be > recognised. If they have specific syntax, they're not comments, are they? > The main complication I can see is that, if this is really a one-time > source-to-source translator so that you will be working with the result, > then usually you will want to keep the comments. > > Then it is a question of more precisely defining the task that such a > translator is to perform. Right, exactly. So you need to do an actual smart parse, which - as mentioned - is functionally equivalent whether you're stripping comments or some lexical token. ChrisA -- https://mail.python.org/mailman/listinfo/python-list