Rob Wolfe wrote: > > Sérgio Monteiro Basto wrote: >> Stefan Behnel wrote: >> >> > Sérgio Monteiro Basto wrote: >> >> but is one single error that blocks this. >> >> Finally I found it , it is : >> >> <td colspan="2"align="center" >> >> if I put : >> >> <td colspan="2" align="center" >> >> >> >> p = re.compile('"align') >> >> content = p.sub('" align', content) >> >> >> >> I can parse the html >> >> I don't know if it a bug of HTMLParser >> > >> > Sure, and next time your key doesn't open your neighbours house, please >> > report to the building company to have them fix the door. >> > >> >> The question, here, is if >> <td colspan="2"align="center" >> is valid HTML or not ? >> I think is valid , if so it's a bug on HTMLParser > > According to the HTML 4.01 specification this is *not valid* HTML. > > """ > Elements may have associated properties, called attributes, which may > have values > (by default, or set by authors or scripts). Attribute/value pairs > appear before the final > ">" of an element's start tag. Any number of (legal) attribute value > pairs, separated > by spaces, may appear in an element's start tag. > """ > >> if not, we still have a very bad message error (EOF in middle of >> construct !?) > > HTMLParser can deal with some errors e.g. lack of ending tags, > but it can't handle many other problems. > >> I have to use HTMLParser because I want use only python 2.4 standard , I >> have to install the scripts in many machines. >> And I have to parse many different sites, I just want extract the links, >> so with a clean up before parse solve very quickly my problem. > > In Python 2.4 you have to use some third party module. There is no > other option for _invalid_ HTML. IMHO BeautifulSoup is the best among > them. >
Many thanks Rob , you have been clear has water thanks, > -- > HTH, > Rob -- Best regards, -- Sérgio M. B. -- http://mail.python.org/mailman/listinfo/python-list