[issue10759] HTMLParser.unescape() cannot handle HTML entities with incorrect syntax (e.g. &#hearts; )
New submission from Martin Potthast : The title says it all; try the minimal example. -- components: Library (Lib) files: parser-fail.py messages: 124506 nosy: Martin.Potthast priority: normal severity: normal status: open title: HTMLParser.unescape() cannot handle HTML entities with incorrect syntax (e.g. &#hearts;) versions: Python 2.6 Added file: http://bugs.python.org/file20139/parser-fail.py ___ Python tracker <http://bugs.python.org/issue10759> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10759] HTMLParser.unescape() fails on HTML entities with incorrect syntax (e.g. &#hearts; )
Changes by Martin Potthast : -- title: HTMLParser.unescape() cannot handle HTML entities with incorrect syntax (e.g. &#hearts;) -> HTMLParser.unescape() fails on HTML entities with incorrect syntax (e.g. &#hearts;) ___ Python tracker <http://bugs.python.org/issue10759> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10759] HTMLParser.unescape() fails on HTML entities with incorrect syntax (e.g. &#hearts; )
Martin Potthast added the comment: I'd suggest to better verify the input and return such strings unchanged. -- type: -> behavior ___ Python tracker <http://bugs.python.org/issue10759> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10759] HTMLParser.unescape() fails on HTML entities with incorrect syntax (e.g. &#hearts; )
Martin Potthast added the comment: Agreed. Here's a patch for HTMLParser. That was easy enough. With regard to tests, there seems to be already one called test_malformatted_charref in test_htmlparser.py. However, the test tests the whole parser and not only HTMLParser.unescape(). At the same time, HTMLParser.unescape() has the following comment: "# Internal -- helper to remove special character quoting" It appears the syntax check is done in line 168 already, but since the unescape function is publicly visible, I'd say that it should be capable of handling all kinds of malformed input, despite that comment. Maybe this comment should be removed. I'm not entirely sure how to write the test properly, since it doesn't fit into the framework provided by test_htmlparser.py; and unfortunately, my time is rather short at the moment. -- keywords: +patch Added file: http://bugs.python.org/file20141/HTMLParser.py.diff ___ Python tracker <http://bugs.python.org/issue10759> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10759] HTMLParser.unescape() fails on HTML entities with incorrect syntax (e.g. &#hearts; )
Martin Potthast added the comment: Why not simply remove the additional check in line 168 and leave the responsibility to check the validity of its input to the unescape function (be it explicitly or, like now, lazily). That way, the code changes are minimal, the existing test covers the current issue, and the function gets more robust. By the way, I came across this function via Stackoverflow: http://stackoverflow.com/questions/2087370 -- ___ Python tracker <http://bugs.python.org/issue10759> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com