New submission from Ned Batchelder: When tokenizing with tokenize.generate_tokens, if the code ends with whitespace (no newline), the tokenizer produces an ERRORTOKEN for each space. Additionally, the regex that fails to find tokens in those spaces is linear in the number of spaces, so the overall performance is O(n**2).
I found this while tokenizing code samples uploaded to a public web site. One sample for some reason ended with 40,000 spaces, which was taking two hours to tokenize. Demonstration: {{{ import token import tokenize try: from cStringIO import StringIO except: from io import StringIO code = "@"+" "*10000 code_reader = StringIO(code).readline for num, (ttyp, ttok, _, _, _) in enumerate(tokenize.generate_tokens(code_reader)): print("%5d %15s %r" % (num, token.tok_name[ttyp], ttok)) }}} ---------- components: Library (Lib) messages: 172244 nosy: nedbat priority: normal severity: normal stage: patch review status: open title: Trailing whitespace makes tokenize.generate_tokens pathological versions: Python 2.7, Python 3.3 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue16152> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com