[issue24426] re.split performance degraded significantly by capturing group

2015-06-21 Thread Serhiy Storchaka
Changes by Serhiy Storchaka : -- resolution: -> fixed stage: patch review -> resolved status: open -> closed ___ Python tracker ___ _

[issue24426] re.split performance degraded significantly by capturing group

2015-06-21 Thread Roundup Robot
Roundup Robot added the comment: New changeset 7e46a503dd16 by Serhiy Storchaka in branch 'default': Issue #24426: Fast searching optimization in regular expressions now works https://hg.python.org/cpython/rev/7e46a503dd16 -- nosy: +python-dev ___ Pyt

[issue24426] re.split performance degraded significantly by capturing group

2015-06-13 Thread Patrick Maupin
Patrick Maupin added the comment: OK, thanks. -- ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.p

[issue24426] re.split performance degraded significantly by capturing group

2015-06-13 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: This is a reason to file a feature request to regex. In 3.3 re was slower than regex in some cases: $ ./python -m timeit -s "import re; p = re.compile('\n\r'); s = ('a'*100 + '\n\r')*1000" -- "p.split(s)" Python 3.3 re : 1000 loops, best of 3: 952 usec per

[issue24426] re.split performance degraded significantly by capturing group

2015-06-13 Thread Patrick Maupin
Patrick Maupin added the comment: > (stuff about cPython) No, I was curious about whether somebody maintained pure-Python fixes (e.g. to the re parser and compiler). Those could be in a regular package that fixed some corner cases such as the capture group you just applied a patch for. > ...

[issue24426] re.split performance degraded significantly by capturing group

2015-06-13 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: > 1) Do you know if anybody maintains a patched version of the Python code > anywhere? I could put a package up on github/PyPI, if not. Sorry, perhaps I misunderstood you. There are unofficial mirrors of CPython on Bitbucket [1] and GitHub [2]. They don't c

[issue24426] re.split performance degraded significantly by capturing group

2015-06-11 Thread Patrick Maupin
Patrick Maupin added the comment: Thank you for the quick response, Serhiy. I had started investigating and come to the conclusion that it was a problem with the compiler rather than the C engine. Interestingly, my next step was going to be to use names for the compiler constants, and then I

[issue24426] re.split performance degraded significantly by capturing group

2015-06-11 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Here is a patch that adds more optimizations for searching patterns that starts with a literal string and groups. In particular it includes a case when a pattern starts with a group containing single character. Examples: $ ./python -m timeit -s "import re;

[issue24426] re.split performance degraded significantly by capturing group

2015-06-10 Thread Patrick Maupin
Patrick Maupin added the comment: Just to be perfectly clear, this is no exaggeration: My original file was slightly over 5GB. I have approximately 1050 bad strings in it, averaging around 11 characters per string. If I split it without capturing those 1050 strings, it takes 3.7 seconds. If

[issue24426] re.split performance degraded significantly by capturing group

2015-06-10 Thread Patrick Maupin
Patrick Maupin added the comment: 1) I have obviously oversimplified my test case, to the point where a developer thinks I'm silly enough to reach for the regex module just to split on a linefeed. 2) '\n(?<=(\n))' -- yes, of course, any casual user of the re module would immediately choose th

[issue24426] re.split performance degraded significantly by capturing group

2015-06-10 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Splitting with pattern '\n(?<=(\n))' produces the same result as with pattern '(\n)' and is as fast as with pattern '\n'. -- ___ Python tracker _

[issue24426] re.split performance degraded significantly by capturing group

2015-06-10 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Regular expression is optimized for the case when it starts with constant string or charset. It is no degradation when using '(\n)', but rather an optimization of '\n'. -- ___ Python tracker

[issue24426] re.split performance degraded significantly by capturing group

2015-06-10 Thread Ezio Melotti
Changes by Ezio Melotti : -- nosy: +serhiy.storchaka ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: https://ma

[issue24426] re.split performance degraded significantly by capturing group

2015-06-10 Thread Patrick Maupin
New submission from Patrick Maupin: The addition of a capturing group in a re.split() pattern, e.g. using '(\n)' instead of '\n', causes a factor of 10 performance degradation. I use re.split a() lot, but never noticed the issue before. It was extremely noticeable on 1000 patterns in a 5BG fi