Sérgio Monteiro Basto wrote: > Stefan Behnel wrote: > > > Sérgio Monteiro Basto wrote: > >> but is one single error that blocks this. > >> Finally I found it , it is : > >> <td colspan="2"align="center" > >> if I put : > >> <td colspan="2" align="center" > >> > >> p = re.compile('"align') > >> content = p.sub('" align', content) > >> > >> I can parse the html > >> I don't know if it a bug of HTMLParser > > > > Sure, and next time your key doesn't open your neighbours house, please > > report to the building company to have them fix the door. > > > > The question, here, is if > <td colspan="2"align="center" > is valid HTML or not ? > I think is valid , if so it's a bug on HTMLParser
According to the HTML 4.01 specification this is *not valid* HTML. """ Elements may have associated properties, called attributes, which may have values (by default, or set by authors or scripts). Attribute/value pairs appear before the final ">" of an element's start tag. Any number of (legal) attribute value pairs, separated by spaces, may appear in an element's start tag. """ > if not, we still have a very bad message error (EOF in middle of > construct !?) HTMLParser can deal with some errors e.g. lack of ending tags, but it can't handle many other problems. > I have to use HTMLParser because I want use only python 2.4 standard , I > have to install the scripts in many machines. > And I have to parse many different sites, I just want extract the links, so > with a clean up before parse solve very quickly my problem. In Python 2.4 you have to use some third party module. There is no other option for _invalid_ HTML. IMHO BeautifulSoup is the best among them. -- HTH, Rob -- http://mail.python.org/mailman/listinfo/python-list