[issue12177] re.match raises MemoryError
Matthew Boehm added the comment: Here are some windows results with Python 2.7: >>> import re >>> re.match("()*?1", "1") <_sre.SRE_Match object at 0x025C0E60> >>> re.match("()+?1", "1") >>> re.match("()+?1", "11") <_sre.SRE_Match object at 0x025C0E60> >>> re.match("()*?1", "11") <_sre.SRE_Match object at 0x025C3C60> <_sre.SRE_Match object at 0x025C3C60> >>> re.match("()*?1", "a1") Traceback (most recent call last): File "", line 1, in re.match("()*?1", "a1") File "C:\Python27\lib\re.py", line 137, in match return _compile(pattern, flags).match(string) MemoryError >>> re.match("()+?1", "a1") Traceback (most recent call last): File "", line 1, in re.match("()+?1", "a1") File "C:\Python27\lib\re.py", line 137, in match return _compile(pattern, flags).match(string) MemoryError Note that when matching to a string starting with "1", the matcher will not throw a MemoryError. -- nosy: +Matthew.Boehm ___ Python tracker <http://bugs.python.org/issue12177> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12855] open() and codecs.open() treat form-feed differently
New submission from Matthew Boehm : A file opened with codecs.open() splits on a form feed character (\x0c) while a file opened with open() does not. >>> with open("formfeed.txt", "w") as f: ... f.write("line \fone\nline two\n") ... >>> with open("formfeed.txt", "r") as f: ... s = f.read() ... >>> s 'line \x0cone\nline two\n' >>> print s line one line two >>> import codecs >>> with open("formfeed.txt", "rb") as f: ... lines = f.readlines() ... >>> lines ['line \x0cone\n', 'line two\n'] >>> with codecs.open("formfeed.txt", "r", encoding="ascii") as f: ... lines2 = f.readlines() ... >>> lines2 [u'line \x0c', u'one\n', u'line two\n'] >>> Note that lines contains two items while lines2 has 3. Issue 7643 has a good discussion on newlines in python, but I did not see this discrepancy mentioned. -- components: Interpreter Core messages: 143182 nosy: Matthew.Boehm priority: normal severity: normal status: open title: open() and codecs.open() treat form-feed differently type: behavior versions: Python 2.7 ___ Python tracker <http://bugs.python.org/issue12855> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12855] open() and codecs.open() treat form-feed differently
Matthew Boehm added the comment: Thanks for explaining the reasoning. Perhaps I should add this to the python wiki (http://wiki.python.org/moin/Unicode) ? It would be nice if it fit in the docs somewhere, but I'm not sure where. I'm curious how (or if) 2to3 would handle this as well, but I'm closing this issue as it's now clear to me why these two are expected to act differently. -- ___ Python tracker <http://bugs.python.org/issue12855> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12855] open() and codecs.open() treat form-feed differently
Changes by Matthew Boehm : -- resolution: -> wont fix status: open -> closed ___ Python tracker <http://bugs.python.org/issue12855> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12855] open() and codecs.open() treat form-feed differently
Matthew Boehm added the comment: I'll suggest a patch for the documentation when I get to my home computer in an hour or two. -- assignee: -> docs@python components: +Documentation -Interpreter Core nosy: +docs@python resolution: wont fix -> status: closed -> open ___ Python tracker <http://bugs.python.org/issue12855> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12855] open() and codecs.open() treat form-feed differently
Matthew Boehm added the comment: I'm taking a look at the docs now. I'm considering adding a table/list of characters python treats as newlines, but it seems like this might fit better as a note in http://docs.python.org/library/stdtypes.html#str.splitlines or somewhere else in stdtypes. I'll start working on it now, but please let me know what you think about this. This is my first attempt at a patch, so I greatly appreciate your help so far. -- ___ Python tracker <http://bugs.python.org/issue12855> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12855] linebreak sequences should be better documented
Matthew Boehm added the comment: I've attached a patch for python2.7 that adds a small not to library/stdtypes.html#str.splitlines explaining which sequences are treated as line breaks: """ Note: Python recognizes "\r", "\n", and "\r\n" as line boundaries for strings. In addition to these, Unicode strings can have line boundaries of u"\x0b", u"\x0c", u"\x85", u"\u2028", and u"\u2029" """ Additional thoughts: * Would it be better to put this note in a different place? * It looks like \x0b and \x0c (vertical tab and form feed) were first considered line breaks in Python 2.7, probably related to this note from "What's New in 2.7": "The Unicode database provided by the unicodedata module is now used internally to determine which characters are numeric, whitespace, or represent line breaks." It might be worth putting a "changed in 2.7" note somewhere in the docs. Please let me know of any thoughts you have and I'll be glad to make any desired changes and submit a new patch. -- keywords: +patch title: open() and codecs.open() treat form-feed differently -> linebreak sequences should be better documented Added file: http://bugs.python.org/file23069/linebreakdoc.py27.patch ___ Python tracker <http://bugs.python.org/issue12855> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12855] linebreak sequences should be better documented
Matthew Boehm added the comment: I can fix the patch to list all the unicode line boundaries. The three places I've considered putting it are: 1. On the howto/unicode.html 2. Somewhere in the stdtypes.html#typesseq description (maybe with other notes at the bottom) 3. As a note to the stdtypes.html#str.splitlines method description (where it is in the previous patch.) I can move it to any of these places if you think it's a better fit. I'll fix the list so that it's complete, add a note about \x0b and \x0c being added in 2.7/3.2, and possibly reference it from StreamReader.readline. After confirming that my documentation matches the style guide, I'll make the docs, test the output, and upload a patch. I can do this for 2.7, 3.2 and 3.3 separately. Let me know if that sounds good and if you have any further thoughts. I should be able to upload new patches in 10 hours (after work today). -- ___ Python tracker <http://bugs.python.org/issue12855> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12855] linebreak sequences should be better documented
Matthew Boehm added the comment: I've attached a patch for 2.7 and will attach one for 3.2 in a minute. I built the docs for both 2.7 and 3.2 and verified that there were no warnings and that the resulting web pages looked okay. Things to consider: * Placement of unicode.splitlines() method: I placed it next to str.splitlines. I didn't want to place it with the unicode methods further down because docs say "The following methods are present only on unicode objects" * The docs for codecs.readlines() already mentions "Line-endings are implemented using the codec’s decoder method and are included in the list entries if keepends is true." * Feel free to make any wording/style suggestions. -- Added file: http://bugs.python.org/file23076/linebreakdoc.v2.py27.patch ___ Python tracker <http://bugs.python.org/issue12855> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12855] linebreak sequences should be better documented
Changes by Matthew Boehm : Added file: http://bugs.python.org/file23077/linebreakdoc.v2.py32.patch ___ Python tracker <http://bugs.python.org/issue12855> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com