[issue10759] HTMLParser.unescape() cannot handle HTML entities with incorrect syntax (e.g. &#hearts; )

2010-12-22 Thread Martin Potthast

New submission from Martin Potthast :

The title says it all; try the minimal example.

--
components: Library (Lib)
files: parser-fail.py
messages: 124506
nosy: Martin.Potthast
priority: normal
severity: normal
status: open
title: HTMLParser.unescape() cannot handle HTML entities with incorrect syntax 
(e.g. &#hearts;)
versions: Python 2.6
Added file: http://bugs.python.org/file20139/parser-fail.py

___
Python tracker 
<http://bugs.python.org/issue10759>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10759] HTMLParser.unescape() fails on HTML entities with incorrect syntax (e.g. &#hearts; )

2010-12-22 Thread Martin Potthast

Changes by Martin Potthast :


--
title: HTMLParser.unescape() cannot handle HTML entities with incorrect syntax 
(e.g. &#hearts;) -> HTMLParser.unescape() fails on HTML entities with incorrect 
syntax (e.g. &#hearts;)

___
Python tracker 
<http://bugs.python.org/issue10759>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10759] HTMLParser.unescape() fails on HTML entities with incorrect syntax (e.g. &#hearts; )

2010-12-22 Thread Martin Potthast

Martin Potthast  added the comment:

I'd suggest to better verify the input and return such strings unchanged.

--
type:  -> behavior

___
Python tracker 
<http://bugs.python.org/issue10759>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10759] HTMLParser.unescape() fails on HTML entities with incorrect syntax (e.g. &#hearts; )

2010-12-22 Thread Martin Potthast

Martin Potthast  added the comment:

Agreed. Here's a patch for HTMLParser. That was easy enough.

With regard to tests, there seems to be already one called 
test_malformatted_charref in test_htmlparser.py. However, the test tests the 
whole parser and not only HTMLParser.unescape().

At the same time, HTMLParser.unescape() has the following comment:
"# Internal -- helper to remove special character quoting"

It appears the syntax check is done in line 168 already, but since the unescape 
function is publicly visible, I'd say that it should be capable of handling all 
kinds of malformed input, despite that comment. Maybe this comment should be 
removed.

I'm not entirely sure how to write the test properly, since it doesn't fit into 
the framework provided by test_htmlparser.py; and unfortunately, my time is 
rather short at the moment.

--
keywords: +patch
Added file: http://bugs.python.org/file20141/HTMLParser.py.diff

___
Python tracker 
<http://bugs.python.org/issue10759>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10759] HTMLParser.unescape() fails on HTML entities with incorrect syntax (e.g. &#hearts; )

2010-12-22 Thread Martin Potthast

Martin Potthast  added the comment:

Why not simply remove the additional check in line 168 and leave the 
responsibility to check the validity of its input to the unescape function (be 
it explicitly or, like now, lazily). That way, the code changes are minimal, 
the existing test covers the current issue, and the function gets more robust.

By the way, I came across this function via Stackoverflow:
http://stackoverflow.com/questions/2087370

--

___
Python tracker 
<http://bugs.python.org/issue10759>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com