Bart <b...@freeuk.com> writes: > On 18/06/2018 11:45, Chris Angelico wrote: >> On Mon, Jun 18, 2018 at 8:33 PM, Bart <b...@freeuk.com> wrote: > > >>> You're right in that neither task is that trivial. >>> >>> I can remove comments by writing a tokeniser which scans Python source and >>> re-outputs tokens one at a time. Such a tokeniser normally ignores comments. >>> >>> But to remove type hints, a deeper understanding of the input is needed. I >>> would need a parser rather than a tokeniser. So it is harder. >> >> They would actually both end up the same. To properly recognize >> comments, you need to understand enough syntax to recognize them. To >> properly recognize type hints, you need to understand enough syntax to >> recognize them. And in both cases, you need to NOT discard important >> information like consecutive whitespace. > > No. If syntax is defined on top of tokens, then at the token level, > you don't need to know any syntax. The process that scans characters > looking for the next token, will usually discard comments. Job done.
You don't even need to scan for tokens other than strings. From what I read in the documentation a simple scanner like this would do the trick: %option noyywrap %x sqstr dqstr sqtstr dqtstr %% \' ECHO; BEGIN(sqstr); \" ECHO; BEGIN(dqstr); \'\'\' ECHO; BEGIN(dqtstr); \"\"\" ECHO; BEGIN(dqtstr); <dqstr>\" | <sqstr>\' | <sqtstr>\'\'\' | <dqtstr>\"\"\" ECHO; BEGIN(INITIAL); <sqstr>\\\' | <dqstr>\\\" | <sqstr,dqstr,sqtstr,dqtstr,INITIAL>. ECHO; #.* %% int main(void) { yylex(); } and it's only this long because there are four kinds of string. Not being a Python expert, there may be some corner case errors. And really there are comments that should not be removed such as #! on line 1 and encoding declarations, but they would just need another line or two. -- Ben. -- https://mail.python.org/mailman/listinfo/python-list