subject:"HTMLParser fragility"

Re: HTMLParser fragility

2006-04-10 Thread John J. Lee

"Lawrence D'Oliveiro" <[EMAIL PROTECTED]> writes: > I've been using HTMLParser to scrape Web sites. The trouble with this > is, there's a lot of malformed HTML out there. Real browsers have to be > written to cope gracefully with this, but HTMLParser does not. Not only > does it raise an except

Re: HTMLParser fragility

2006-04-07 Thread Richie Hindle

[Richie] > But Tidy fails on huge numbers of real-world HTML pages. [...] > Is there a Python HTML tidier which will do as good a job as a browser? [Walter] > You can also use the HTML parser from libxml2 [Paul] > libxml2 will attempt to parse HTML if asked to [...] See how it fixes > up the mi

Re: HTMLParser fragility

2006-04-06 Thread Lawrence D'Oliveiro

In article <[EMAIL PROTECTED]>, Rene Pijlman <[EMAIL PROTECTED]> wrote: >2. Use something more foregiving, like BeautifulSoup. >http://www.crummy.com/software/BeautifulSoup/ That sounds like what I'm after! -- http://mail.python.org/mailman/listinfo/python-list

Re: HTMLParser fragility

2006-04-06 Thread Paul Boddie

Richie Hindle wrote: > > But Tidy fails on huge numbers of real-world HTML pages. Simple things like > misspelled tags make it fail: > > >>> from mx.Tidy import tidy > >>> results = tidy("Hello world!") [Various error messages] > Is there a Python HTML tidier which will do as good a job as a bro

Re: HTMLParser fragility

2006-04-06 Thread Walter Dörwald

Rene Pijlman wrote: > Lawrence D'Oliveiro: >> I've been using HTMLParser to scrape Web sites. The trouble with this >> is, there's a lot of malformed HTML out there. Real browsers have to be >> written to cope gracefully with this, but HTMLParser does not. > > There are two solutions to this: >

Re: HTMLParser fragility

2006-04-05 Thread Richie Hindle

[Daniel] > You could try HTMLTidy (http://www.egenix.com/files/python/mxTidy.html) > as a first step to get well formed HTML. But Tidy fails on huge numbers of real-world HTML pages. Simple things like misspelled tags make it fail: >>> from mx.Tidy import tidy >>> results = tidy("Hello world!"

Re: HTMLParser fragility

2006-04-05 Thread Daniel Dittmar

Lawrence D'Oliveiro wrote: > I've been using HTMLParser to scrape Web sites. The trouble with this > is, there's a lot of malformed HTML out there. Real browsers have to be > written to cope gracefully with this, but HTMLParser does not. Not only > does it raise an exception, but the parser obje

Re: HTMLParser fragility

2006-04-05 Thread Rene Pijlman

Lawrence D'Oliveiro: >I've been using HTMLParser to scrape Web sites. The trouble with this >is, there's a lot of malformed HTML out there. Real browsers have to be >written to cope gracefully with this, but HTMLParser does not. There are two solutions to this: 1. Tidy the source before parsin

HTMLParser fragility

2006-04-05 Thread Lawrence D'Oliveiro

I've been using HTMLParser to scrape Web sites. The trouble with this is, there's a lot of malformed HTML out there. Real browsers have to be written to cope gracefully with this, but HTMLParser does not. Not only does it raise an exception, but the parser object then gets into a confused state

Re: HTMLParser fragility

Re: HTMLParser fragility

Re: HTMLParser fragility

Re: HTMLParser fragility

Re: HTMLParser fragility

Re: HTMLParser fragility

Re: HTMLParser fragility

Re: HTMLParser fragility

HTMLParser fragility

9 matches

Site Navigation

Mail list logo

Footer information