[ python-Bugs-1055864 ] HTMLParser not compliant to XHTML spec

SourceForge.net Tue, 02 May 2006 13:47:58 -0700

Bugs item #1055864, was opened at 2004-10-28 00:59
Message generated for change (Settings changed) made by fdrake
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1055864&group_id=5470


Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
>Category: Python Library
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Luke Bradley (neptune235)
>Assigned to: Fred L. Drake, Jr. (fdrake)
Summary: HTMLParser not compliant to XHTML spec

Initial Comment:
HTMLParser has a problem related to the fact that is
doesn't seem to comply to the spec for XHTML. What I am
refering to can be read about here:
http://www.w3.org/TR/xhtml1/#h-4.8
In a nutshell, HTMLParser doesn't treat data inside
'script' or 'style' elements as #PCDATA, but rather
behaves like an HTML 4 parser even for XHTML documents,
parsing only end tags. As a result, entity references
in javascript are not converted as they should be.
XHTML authors writing to spec can expect entities in
script sections of XHTML documents to be converted if
the script is not explicitly escaped as a CDATA
section. which brings up problem two, That sections
explicitly escaped as CDATA are also parsed as HTML 4
'script' and 'style' sections...End tags are parsed...
My understanding is that this is bad as well:
http://www.w3.org/TR/2004/REC-xml-20040204/#dt-cdsection
because CDend is the only thing that's supposed to be
parsed in a CDATA section for all XML documents?



----------------------------------------------------------------------

Comment By: Luke Bradley (neptune235)
Date: 2004-10-28 18:23

Message:
Logged In: YES 
user_id=178561

Sure. I'll attach it as a file: tidytest2.py

btw: I'm no guru so tell me if I'm misinterpretting the w3c.
I'm just trying to use HTMLParser in such a way that it
won't mangle anybodies script sections, and I want to have
all my bases covered.

----------------------------------------------------------------------

Comment By: Martin v. LÃ¶wis (loewis)
Date: 2004-10-28 15:41

Message:
Logged In: YES 
user_id=21627

Can you give an example demonstrating this problem, please?
A Python script with a small embedded HTML file, and a
PASS/FAIL condition would be best.

----------------------------------------------------------------------

Comment By: Luke Bradley (neptune235)
Date: 2004-10-28 04:31

Message:
Logged In: YES 
user_id=178561

I also reported bug 1051840. I discovered this when I was
looking for a universal way to handle all the wierd things
people do with their script sections on HTML/XHTML pages on
the net. I've ended up modifying HTMLParser.py so that the
HTMLParser class has an extra attribute called last_match,
which is the exact string of HTML that whatever handler
event  is being called for...So that putting:
sys.stdout.write(self.last_match) 
or
sys.stdout.write(self.get_last_match())
for every handler event (except handle_data, which can be
directly outputted) will output the page exactly as was
inputted. This allows me to handle all oddities in people's
code at the level of my handler, without changing HTMLParser
in any other way...
Here's the code, attached. Not that you care, but on the off
chance that you guys might want to think about doing
something like this....:)

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1055864&group_id=5470
_______________________________________________
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[ python-Bugs-1055864 ] HTMLParser not compliant to XHTML spec

Reply via email to