Steve M wrote:
>>You were right, the HTMLParser of htmllib is more permissive. He just
> ignores the bad tags !
>
> The HTMLParser on my distribution is a she. But then again, I am using
> ActivePython on Windows...
Although building parsers is for some strange reason one of my favourite
program
>You were right, the HTMLParser of htmllib is more permissive. He just
ignores the bad tags !
The HTMLParser on my distribution is a she. But then again, I am using
ActivePython on Windows...
--
http://mail.python.org/mailman/listinfo/python-list
> Are you saying that Beautiful Soup can't parse the HTML? If so, I'm
> sure the author would like an example so he can "fix" it.
I finally use the htmllib module wich is more permissive than the
HTMLParser module when parsing bad html documents.
Anyway, where can I find the author's contact in
> AFAIK not with HTMLParser or htmllib. You might try (if you haven't done
> yet) htmllib and see, which parser is more forgiving.
You were right, the HTMLParser of htmllib is more permissive. He just
ignores the bad tags !
--
http://mail.python.org/mailman/listinfo/python-list
florent wrote:
> True, I just want to extract some data from html documents. But the
> problem is the same. The parser looses the position he was in the string
> when he encounters a bad tag.
Are you saying that Beautiful Soup can't parse the HTML? If so, I'm
sure the author would like an exam
> From http://www.crummy.com/software/BeautifulSoup/:
>
> You didn't write that awful page. You're just trying to get
> some data out of it. Right now, you don't really care what
> HTML is supposed to look like.
>
> Neither does this parser.
True, I just want to extract some dat
> AFAIK not with HTMLParser or htmllib. You might try (if you haven't done
> yet) htmllib and see, which parser is more forgiving.
Thanks, I'll try htmllib.
In other case, I found a solution. Feeding data to the HTMLParser by
chunks extracted from the string using string.split("<"), will allow me
florent wrote:
> I'm trying to parse html documents from the web, using the HTMLParser
> class of the HTMLParser module (python 2.3), but some web documents are
> not fully valids.
From http://www.crummy.com/software/BeautifulSoup/:
You didn't write that awful page. You're just trying to
florent wrote:
> I'm trying to parse html documents from the web, using the HTMLParser
> class of the HTMLParser module (python 2.3), but some web documents are
> not fully valids.
Some?? Most of them :(
> When the parser finds an invalid tag, he raises an
> exception. Then it seems impossible