New submission from Ned Batchelder:

When tokenizing with tokenize.generate_tokens, if the code ends with whitespace 
(no newline), the tokenizer produces an ERRORTOKEN for each space.  
Additionally, the regex that fails to find tokens in those spaces is linear in 
the number of spaces, so the overall performance is O(n**2).

I found this while tokenizing code samples uploaded to a public web site.  One 
sample for some reason ended with 40,000 spaces, which was taking two hours to 
tokenize.

Demonstration:

{{{
import token
import tokenize

try:
    from cStringIO import StringIO
except:
    from io import StringIO

code = "@"+" "*10000
code_reader = StringIO(code).readline

for num, (ttyp, ttok, _, _, _) in 
enumerate(tokenize.generate_tokens(code_reader)):
    print("%5d %15s %r" % (num, token.tok_name[ttyp], ttok))
}}}

----------
components: Library (Lib)
messages: 172244
nosy: nedbat
priority: normal
severity: normal
stage: patch review
status: open
title: Trailing whitespace makes tokenize.generate_tokens pathological
versions: Python 2.7, Python 3.3

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue16152>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to