[issue11454] email.message import time

2013-06-26 Thread R. David Murray
R. David Murray added the comment: I've checked in the encode version of the method. I'm going to pass on doing the other inlines, given that the improvement isn't that large. I will, however, keep the issue in mind as I make other changes to the code, and there will be a general performance

[issue11454] email.message import time

2013-06-26 Thread Roundup Robot
Roundup Robot added the comment: New changeset 520490c4c388 by R David Murray in branch 'default': #11454: Reduce email module load time, improve surrogate check efficiency. http://hg.python.org/cpython/rev/520490c4c388 -- nosy: +python-dev ___ Python

[issue11454] email.message import time

2013-03-14 Thread Ezio Melotti
Changes by Ezio Melotti : -- stage: -> patch review versions: +Python 3.4 -Python 3.3 ___ Python tracker ___ ___ Python-bugs-list mai

[issue11454] email.message import time

2012-09-23 Thread R. David Murray
R. David Murray added the comment: Well, "other" surrogates will cause a different error later than with the current _has_surrogates logic, but it won't be any more mysterious than what would happen now, I think. Normally, if I understand correctly, other surrogates should never occur, so I d

[issue11454] email.message import time

2012-09-23 Thread Ezio Melotti
Ezio Melotti added the comment: > They are precompiled because for a program processing lots of email, > they are hot spots. OK, I didn't know they were hot spots. Note that the regex are not recompiled everytime: they are compiled the first time and then taken from the cache (assuming they d

[issue11454] email.message import time

2012-09-23 Thread R. David Murray
R. David Murray added the comment: Oh, yeah, and the encode benchmark is very instructive, thanks Serhiy :) -- ___ Python tracker ___

[issue11454] email.message import time

2012-09-23 Thread R. David Murray
R. David Murray added the comment: Woops. Can you explain your changes to the ecre regex (keeping in mind that I don't know much about regex syntax). -- ___ Python tracker ___

[issue11454] email.message import time

2012-09-23 Thread R. David Murray
R. David Murray added the comment: I'm really not willing to inline any of those pre-compiled regular expressions. They are precompiled because for a program processing lots of email, they are hot spots. We could use the same "compile on demand" dodge on them, though. Can you explain your ch

[issue11454] email.message import time

2012-09-20 Thread Arfrever Frehtes Taifersar Arahesis
Changes by Arfrever Frehtes Taifersar Arahesis : -- nosy: +Arfrever ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscri

[issue11454] email.message import time

2012-09-19 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: def _has_surrogates(s): try: s.encode() return False except UnicodeEncodeError: return True Results: 0.26 <- re.compile(short_regex).search 0.06 <- try encode -- ___ Python tracker

[issue11454] email.message import time

2012-09-19 Thread Ezio Melotti
Ezio Melotti added the comment: It would be better to add/improve the _has_surrogates tests before committing. The patch I attached is also still valid if you want a further speed up improvement. -- ___ Python tracker

[issue11454] email.message import time

2012-09-19 Thread R. David Murray
R. David Murray added the comment: It passed the email test suite. Patch attached. -- Added file: http://bugs.python.org/file27226/email_import_speedup.patch ___ Python tracker

[issue11454] email.message import time

2012-09-19 Thread Ezio Melotti
Ezio Melotti added the comment: That might work. To avoid the overhead of the cache lookup I was thinking about something like regex = None def _has_surrogates(s): global regex if regex is None: regex = re.compile(short_regex) return regex.search(s) but I have discarded it

[issue11454] email.message import time

2012-09-19 Thread R. David Murray
R. David Murray added the comment: This issue may be about reducing the startup time, but this function is a hot spot in the email package so I would prefer to sacrifice startup time optimization for an increase in speed. However, given the improvements to import locking in 3.3, what about a s

[issue11454] email.message import time

2012-09-19 Thread Ezio Melotti
Ezio Melotti added the comment: Yes, however it has a startup cost that the function that returns re.search(short_regex, s) and the one with functool.partial don't have, because with these the compilation happens at the first call. If we use one of these two, the startup time will be reduced a

[issue11454] email.message import time

2012-09-19 Thread R. David Murray
R. David Murray added the comment: So by your measurements the short search is the clear winner? -- ___ Python tracker ___ ___ Python-

[issue11454] email.message import time

2012-09-19 Thread Ezio Melotti
Ezio Melotti added the comment: Attached new benchmark file. Results: Testing runtime of the _has_surrogates functions Generating chars... Generating samples... 1.61 <- re.compile(current_regex).search 0.24 <- re.compile(short_regex).search 15.13 <- return any(c in surrogates for c in s)

[issue11454] email.message import time

2012-09-19 Thread Ezio Melotti
Changes by Ezio Melotti : Removed file: http://bugs.python.org/file27203/issue11454_surr1.py ___ Python tracker ___ ___ Python-bugs-list maili

[issue11454] email.message import time

2012-09-19 Thread Ezio Melotti
Changes by Ezio Melotti : Removed file: http://bugs.python.org/file27223/issue11454_surr1.py ___ Python tracker ___ ___ Python-bugs-list maili

[issue11454] email.message import time

2012-09-19 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Faster set-version: $ ./python -m timeit -s 'h=lambda s, hn=set(map(chr, range(0xDC80, 0xDD00))).isdisjoint: not hn(s); s = "A"*1000' 'h(s)' 1 loops, best of 3: 43.8 usec per loop -- ___ Python tracker

[issue11454] email.message import time

2012-09-19 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Startup-time: $ ./python -m timeit -s 'import re' 're.compile("([^\ud800-\udbff]|\A)[\udc00-\udfff]([^\udc00-\udfff]|\Z)").search; re.purge()' 100 loops, best of 3: 4.16 msec per loop $ ./python -m timeit -s 'import re' 're.purge()' 're.compile("[\udc8

[issue11454] email.message import time

2012-09-19 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: > I haven't checked the startup-time, but I suspect it won't be better -- maybe > even worse. I suppose it will be much better. -- ___ Python tracker __

[issue11454] email.message import time

2012-09-19 Thread Ezio Melotti
Ezio Melotti added the comment: > What about _has_surrogates = re.compile('[^\udc80-\udcff]*\Z').match ? The runtime is a bit slower than re.compile('[\udc80-\udcff]').search, but otherwise it's faster than all the other alternatives. I haven't checked the startup-time, but I suspect it won't

[issue11454] email.message import time

2012-09-19 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: > If I change the regex to _has_surrogates = > re.compile('[\udc80-\udcff]').search, the tests still pass but there's no > improvement on startup time (note: the previous regex was matching all the > surrogates in this range too, however I'm not sure how wel

[issue11454] email.message import time

2012-09-18 Thread Ezio Melotti
Ezio Melotti added the comment: re.compile seems twice as fast as pickle.loads: import re import pickle import timeit N = 10 s = "r = re.compile('[\\udc80-\\udcff]')" t = timeit.Timer(s, 'import re') print("%6.2f <- re.compile" % t.timeit(number=N)) s = "r = pickle.loads(p)" p = pickle.du

[issue11454] email.message import time

2012-09-16 Thread R. David Murray
R. David Murray added the comment: Considering how often that test is done, I would consider the compiled version of the short regex the clear winner based on your numbers. I wonder if we could precompile the regex and load it from a pickle. -- ___

[issue11454] email.message import time

2012-09-15 Thread Ezio Melotti
Ezio Melotti added the comment: Given that high surrogates are U+D800..U+DBFF, and low ones are U+DC00..U+DFFF, '([^\ud800-\udbff]|\A)[\udc00-\udfff]([^\udc00-\udfff]|\Z)' means "a low surrogates, preceded by either an high one or line beginning, and followed by another low one or line end".

[issue11454] email.message import time

2012-09-15 Thread R. David Murray
R. David Murray added the comment: It detects whether a string contains any characters have been surrogate escaped by the surrogate escape handler. I disliked using it, but I didn't know of any better way to do that detection. It's on my long list of things to come back to eventually and try

[issue11454] email.message import time

2012-09-15 Thread Ezio Melotti
Ezio Melotti added the comment: I tried to remove a few unused regex and inline some of the others (the re module has its own caching anyway and they don't seem to be documented), but it didn't get so much faster (see attached patch). I then put the second list of email imports of the previo

[issue11454] email.message import time

2011-03-11 Thread Ross Lagerwall
Changes by Ross Lagerwall : -- title: urllib.request import time -> email.message import time ___ Python tracker ___ ___ Python-bugs-l