New submission from Arman <arman.hunan...@gmail.com>: When HTMLParser reaches CDATA element it enters cdata mode by calling set_cdata_mode (file html/parser.py line 270). this method assigns self.interesting member new value r'<(/|\Z)'. But this is not correct. Consider following case
<script language="javascript"> <!-- if (window.adgroupid == undefined) { window.adgroupid = Math.round(Math.random() * 1000); } document.write('<scr'+'ipt language="javascript1.1" src="http://adserver.adtech.de/addyn|3.0|876|2378574|0|225|ADTECH;loc=100;target=_blank;key=;grp='+window.adgroupid+';misc='+new Date().getTime()+'"></scri'+'pt>'); //--> </script> </scri'+'pt> matches with r'<(/|\Z)' and parser gets confused and produce wrong results. You can see such real htmls in www.ahram.org.eg www.chefkoch.de www.chemieonline.de www.eip.gov.eg www.rezepte.li www.scienceworld.com The solution can be to keep interesting_cdata_script = re.compile(r'<(/|\z)script') interesting_cdata_style = re.compile(r'<(/|\z)style') instead of interesting_cdata = re.compile(r'<(/|\Z)') and depending on what tag is begins (script or style) set_cdata_mode can assign correct regexp to self.interesting member. Please contact with me via email if you need more details. arman.hunan...@gmail.com ---------- components: Library (Lib) messages: 113688 nosy: Hunanyan priority: normal severity: normal status: open title: html parser bug related with CDATA sections type: behavior versions: Python 3.1 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue9577> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com