Re: trying to parse non valid html documents with HTMLParser

2005-08-03 Thread Benjamin Niemann
Steve M wrote: >>You were right, the HTMLParser of htmllib is more permissive. He just > ignores the bad tags ! > > The HTMLParser on my distribution is a she. But then again, I am using > ActivePython on Windows... Although building parsers is for some strange reason one of my favourite program

Re: trying to parse non valid html documents with HTMLParser

2005-08-03 Thread Steve M
>You were right, the HTMLParser of htmllib is more permissive. He just ignores the bad tags ! The HTMLParser on my distribution is a she. But then again, I am using ActivePython on Windows... -- http://mail.python.org/mailman/listinfo/python-list

Re: trying to parse non valid html documents with HTMLParser

2005-08-03 Thread florent
> Are you saying that Beautiful Soup can't parse the HTML? If so, I'm > sure the author would like an example so he can "fix" it. I finally use the htmllib module wich is more permissive than the HTMLParser module when parsing bad html documents. Anyway, where can I find the author's contact in

Re: trying to parse non valid html documents with HTMLParser

2005-08-03 Thread florent
> AFAIK not with HTMLParser or htmllib. You might try (if you haven't done > yet) htmllib and see, which parser is more forgiving. You were right, the HTMLParser of htmllib is more permissive. He just ignores the bad tags ! -- http://mail.python.org/mailman/listinfo/python-list

Re: trying to parse non valid html documents with HTMLParser

2005-08-03 Thread Benji York
florent wrote: > True, I just want to extract some data from html documents. But the > problem is the same. The parser looses the position he was in the string > when he encounters a bad tag. Are you saying that Beautiful Soup can't parse the HTML? If so, I'm sure the author would like an exam

Re: trying to parse non valid html documents with HTMLParser

2005-08-03 Thread florent
> From http://www.crummy.com/software/BeautifulSoup/: > > You didn't write that awful page. You're just trying to get > some data out of it. Right now, you don't really care what > HTML is supposed to look like. > > Neither does this parser. True, I just want to extract some dat

Re: trying to parse non valid html documents with HTMLParser

2005-08-03 Thread florent
> AFAIK not with HTMLParser or htmllib. You might try (if you haven't done > yet) htmllib and see, which parser is more forgiving. Thanks, I'll try htmllib. In other case, I found a solution. Feeding data to the HTMLParser by chunks extracted from the string using string.split("<"), will allow me

Re: trying to parse non valid html documents with HTMLParser

2005-08-02 Thread Benji York
florent wrote: > I'm trying to parse html documents from the web, using the HTMLParser > class of the HTMLParser module (python 2.3), but some web documents are > not fully valids. From http://www.crummy.com/software/BeautifulSoup/: You didn't write that awful page. You're just trying to

Re: trying to parse non valid html documents with HTMLParser

2005-08-02 Thread Benjamin Niemann
florent wrote: > I'm trying to parse html documents from the web, using the HTMLParser > class of the HTMLParser module (python 2.3), but some web documents are > not fully valids. Some?? Most of them :( > When the parser finds an invalid tag, he raises an > exception. Then it seems impossible