[issue25258] HtmlParser doesn't handle void element tags correctly
New submission from Chenyun Yang: For void elements such as (, ), there doesn't need to have xhtml empty end tag. HtmlParser which relies on the XHTML empty end syntax failed to handle this situation. from HTMLParser import HTMLParser # create a subclass and override the handler methods class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): print "Encountered a start tag:", tag def handle_endtag(self, tag): print "Encountered an end tag :", tag def handle_data(self, data): print "Encountered some data :", data >>> parser.feed('') Encountered a start tag: link Encountered a start tag: img >>> parser.feed('') Encountered a start tag: link Encountered an end tag : link Encountered a start tag: img Encountered an end tag : img Reference: https://github.com/python/cpython/blob/bdfb14c688b873567d179881fc5bb67363a6074c/Lib/html/parser.py http://www.w3.org/TR/html5/syntax.html#void-elements -- components: Library (Lib) messages: 251792 nosy: Chenyun Yang priority: normal severity: normal status: open title: HtmlParser doesn't handle void element tags correctly versions: Python 2.7 ___ Python tracker <http://bugs.python.org/issue25258> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue25258] HtmlParser doesn't handle void element tags correctly
Chenyun Yang added the comment: I think the bug is mostly about inconsistent behavior: and shouldn't be parsed differently. This causes problem in the case that the parser won't be able to know consistently whether it has ended the visit of tag. I propose one fix which will be: in the `parse_internal' method call, check for void elements and call `handle_startendtag' On Tue, Sep 29, 2015 at 1:27 PM, Martin Panter wrote: > > Martin Panter added the comment: > > Also applies to Python 3, though I’m not sure I would consider it a bug. > > -- > nosy: +martin.panter > versions: +Python 3.4, Python 3.5, Python 3.6 > > ___ > Python tracker > <http://bugs.python.org/issue25258> > ___ > -- ___ Python tracker <http://bugs.python.org/issue25258> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue25258] HtmlParser doesn't handle void element tags correctly
Chenyun Yang added the comment: the example you give for is a different case. , are void elements which are allowed to have no close tag; without is a browser implementation detail, most browser autocompletes . Without the parser calls the handle_endtag(), the client code which uses HTMLParser won't be able to know whether the a traversal is finished. Do you have a strong reason why we should include the knowledge of void elements into the HTMLParser at this line? https://github.com/python/cpython/blob/bdfb14c688b873567d179881fc5bb67363a6074c/Lib/html/parser.py#L341 if end.endswith('/>') or (end.endswith('>') and tag in VOID_ELEMENTS) On Wed, Sep 30, 2015 at 7:05 PM, Martin Panter wrote: > > Martin Panter added the comment: > > My thinking is that the knowledge that does not have a closing tag > is at a higher level than the current HTMLParser class. It is similar to > knowing where the following HTML implicitly closes the elements: > > Item AItem B > > In both cases I would not expect the HTMLParser to report “virtual” empty > or closing tags. I don’t think it should report an empty or closing > tag just because that is easy to do, because it would be > inconsistent with other implied HTML tags. But maybe see what other people > say. > > I don’t know your particular use case, but I would suggest if you need to > parse non-XML HTML tags, use the handle_starttag() method and don’t > rely on the end tag :) > > -- > > ___ > Python tracker > <http://bugs.python.org/issue25258> > ___ > -- ___ Python tracker <http://bugs.python.org/issue25258> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue25258] HtmlParser doesn't handle void element tags correctly
Chenyun Yang added the comment: I am fine with either handle_startendtag or handle_starttag, The issue is that the behavior is consistent for the two equally valid syntax ( and are handled differently); this inconsistent cannot be fixed from the inherited class as (handle_* calls are dispatched in the internal method of HTMLParser) On Fri, Oct 2, 2015 at 12:42 PM, Ezio Melotti wrote: > > Ezio Melotti added the comment: > > Note that HTMLParser tries to follow the HTML5 specs, and for this case > they say [0]: > "Set the self-closing flag of the current tag token. Switch to the data > state. Emit the current tag token." > > So it seems that for , only the (and not the closing ) > should be emitted. HTMLParser has no way to set the self-closing flag, so > calling handle_startendtag seems the most reasonable things to do, since it > allows tree-builders to set the flag themselves. That said, the default > implementation of handle_startendtag should indeed just call > handle_starttag, however this would be a backward-incompatible change. > > [0]: http://www.w3.org/TR/html5/syntax.html#self-closing-start-tag-state > > -- > type: -> behavior > > ___ > Python tracker > <http://bugs.python.org/issue25258> > ___ > -- ___ Python tracker <http://bugs.python.org/issue25258> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue25258] HtmlParser doesn't handle void element tags correctly
Chenyun Yang added the comment: Correct for previous comment, consistent -> not consistent On Fri, Oct 2, 2015 at 1:16 PM, Chenyun Yang wrote: > > Chenyun Yang added the comment: > > I am fine with either handle_startendtag or handle_starttag, > > The issue is that the behavior is consistent for the two equally valid > syntax ( and are handled differently); this inconsistent cannot > be fixed from the inherited class as (handle_* calls are dispatched in the > internal method of HTMLParser) > > On Fri, Oct 2, 2015 at 12:42 PM, Ezio Melotti > wrote: > > > > > Ezio Melotti added the comment: > > > > Note that HTMLParser tries to follow the HTML5 specs, and for this case > > they say [0]: > > "Set the self-closing flag of the current tag token. Switch to the data > > state. Emit the current tag token." > > > > So it seems that for , only the (and not the closing ) > > should be emitted. HTMLParser has no way to set the self-closing flag, > so > > calling handle_startendtag seems the most reasonable things to do, since > it > > allows tree-builders to set the flag themselves. That said, the default > > implementation of handle_startendtag should indeed just call > > handle_starttag, however this would be a backward-incompatible change. > > > > [0]: http://www.w3.org/TR/html5/syntax.html#self-closing-start-tag-state > > > > -- > > type: -> behavior > > > > ___ > > Python tracker > > <http://bugs.python.org/issue25258> > > ___ > > > > -- > > ___ > Python tracker > <http://bugs.python.org/issue25258> > ___ > -- ___ Python tracker <http://bugs.python.org/issue25258> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue25258] HtmlParser doesn't handle void element tags correctly
Chenyun Yang added the comment: handle_startendtag is also called for non-void elements, such as , so the override example will break in those situation. The compatible patch I proposed right now is just one liner checker: # http://www.w3.org/TR/html5/syntax.html#void-elements <https://www.google.com/url?q=http://www.w3.org/TR/html5/syntax.html%23void-elements&usg=AFQjCNFVtfyZ53NDOHlPq896qmX5b8fPTA>_VOID_ELEMENT_TAGS = frozenset(['area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'keygen','link', 'meta', 'param', 'source', 'track', 'wbr'])class HTMLParser.HTMLParser: # Internal -- handle starttag, return end or -1 if not terminated def parse_starttag(self, i): #...if end.endswith('/>'): # XHTML-style empty tag: self.handle_startendtag(tag, attrs) #PATCH#elif end.endswith('>') and tag in _VOID_ELEMENT_TAGS: self.handle_startendtag(tag, attrs)#PATCH# -- ___ Python tracker <http://bugs.python.org/issue25258> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com