Bugs item #1055864, was opened at 2004-10-28 00:59 Message generated for change (Settings changed) made by fdrake You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1055864&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. >Category: Python Library Group: None Status: Open Resolution: None Priority: 5 Submitted By: Luke Bradley (neptune235) >Assigned to: Fred L. Drake, Jr. (fdrake) Summary: HTMLParser not compliant to XHTML spec Initial Comment: HTMLParser has a problem related to the fact that is doesn't seem to comply to the spec for XHTML. What I am refering to can be read about here: http://www.w3.org/TR/xhtml1/#h-4.8 In a nutshell, HTMLParser doesn't treat data inside 'script' or 'style' elements as #PCDATA, but rather behaves like an HTML 4 parser even for XHTML documents, parsing only end tags. As a result, entity references in javascript are not converted as they should be. XHTML authors writing to spec can expect entities in script sections of XHTML documents to be converted if the script is not explicitly escaped as a CDATA section. which brings up problem two, That sections explicitly escaped as CDATA are also parsed as HTML 4 'script' and 'style' sections...End tags are parsed... My understanding is that this is bad as well: http://www.w3.org/TR/2004/REC-xml-20040204/#dt-cdsection because CDend is the only thing that's supposed to be parsed in a CDATA section for all XML documents? ---------------------------------------------------------------------- Comment By: Luke Bradley (neptune235) Date: 2004-10-28 18:23 Message: Logged In: YES user_id=178561 Sure. I'll attach it as a file: tidytest2.py btw: I'm no guru so tell me if I'm misinterpretting the w3c. I'm just trying to use HTMLParser in such a way that it won't mangle anybodies script sections, and I want to have all my bases covered. ---------------------------------------------------------------------- Comment By: Martin v. Löwis (loewis) Date: 2004-10-28 15:41 Message: Logged In: YES user_id=21627 Can you give an example demonstrating this problem, please? A Python script with a small embedded HTML file, and a PASS/FAIL condition would be best. ---------------------------------------------------------------------- Comment By: Luke Bradley (neptune235) Date: 2004-10-28 04:31 Message: Logged In: YES user_id=178561 I also reported bug 1051840. I discovered this when I was looking for a universal way to handle all the wierd things people do with their script sections on HTML/XHTML pages on the net. I've ended up modifying HTMLParser.py so that the HTMLParser class has an extra attribute called last_match, which is the exact string of HTML that whatever handler event is being called for...So that putting: sys.stdout.write(self.last_match) or sys.stdout.write(self.get_last_match()) for every handler event (except handle_data, which can be directly outputted) will output the page exactly as was inputted. This allows me to handle all oddities in people's code at the level of my handler, without changing HTMLParser in any other way... Here's the code, attached. Not that you care, but on the off chance that you guys might want to think about doing something like this....:) ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1055864&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com