[web2py] Re: in trunk - scraping utils

mdipierro Tue, 25 May 2010 19:07:52 -0700

It makes assumptions. It fails if Python HTMLParser fails. For
example:

>>> from gluon.html import TAG
>>> print TAG('<a>aaa<b>bbb<c>cccc</b>ddd</a>eee')
<a>aaa<b>bbb<c>cccc</c></b>ddd</a>eee
>>> print TAG('<a>aaa<b>bbb<c>cccc</b>dddeee')
<a>aaa<b>bbb<c>cccc</c></b>dddeee</a>
>>> print TAG('<a>aaa<b x=">bbb<c>cccc</b>dddeee')
<a>aaa</a>
>>> print TAG('<a>aaa<b bbb<c y=">cccc</b>dddeee')
Traceback (most recent call last):
HTMLParser.HTMLParseError: malformed start tag, at line 1, column 13


On May 25, 8:49 pm, Richard <richar...@gmail.com> wrote:
> how robust have you found HTMLParser with badly formed HTML?
>
> On May 26, 1:11 am, mdipierro <mdipie...@cs.depaul.edu> wrote:
>
> > Here is a one liner to remove all tags from a some html text:
>
> > >>> html = '<div>hello<span>world</span></div>'
> > >>> print TAG(html).flatten()
>
> > helloworld
>
> > On May 25, 10:02 am, mdipierro <mdipie...@cs.depaul.edu> wrote:
>
> > > yet a better syntax and more API:
>
> > > 1) no more web2pyHTMLParser, use TAG(...) instead. and flatten (remove
> > > tags)
>
> > > >>> a=TAG('<div>Hello<span>world</span></div>')
> > > >>> print a
>
> > > <div>Hello<span>world</span></div>>>> print a.element('span')
> > > <span>world</span>
> > > >>> print a.flatten()
>
> > > Helloworld
>
> > > 2) search by multiple conditions, including regex
> > > for example, find all external links in a page
>
> > > >>> import re, urllib
> > > >>> html = urllib.urlopen('http://web2py.com').read()
> > > >>> elements = TAG(html).elements('a',_href=re.compile('^http'))
> > > >>> for e in elements: print e['_href']
>
> > >http://web2py.com/bookhttp://www.python.orghttp://mycti.cti.depaul.ed...
> > > ....
>
> > > I think we just blew BeautifulSoup out of the water.
>
> > > Massimo
>
> > > On May 25, 7:59 am, mdipierro <mdipie...@cs.depaul.edu> wrote:
>
> > > > The entire code is 40 lines and uses the python built-in html parser.
> > > > It will not be a problem to maintain it. Actually we could even use
> > > > this simplify both XML(...,sanitize) and gluon.contrib.markdown.WIKI
>
> > > > On May 25, 12:50 am, Thadeus Burgess <thade...@thadeusb.com> wrote:
>
> > > > > > So why our own?
>
> > > > > Because it converts it into web2py helpers.
>
> > > > > And you don't have to deal with installing anything other than web2py.
>
> > > > > --
> > > > > Thadeus
>
> > > > > On Tue, May 25, 2010 at 12:14 AM, Kevin Bowling 
> > > > > <kevin.bowl...@gmail.com> wrote:
> > > > > > Hmm, I wonder if this is worth the possible maintenance cost?  It 
> > > > > > also
> > > > > > transcends the role of a web framework and now you are getting into
> > > > > > network programming.
>
> > > > > > I have a currently deployed screen scraping app and found PyQuery to
> > > > > > be more than adequate.  There is also lxml directly, or Beautiful
> > > > > > Soup.  A simple import away and they integrate with web2py or 
> > > > > > anything
> > > > > > else just fine.  So why our own?
>
> > > > > > Regards,
> > > > > > Kevin
>
> > > > > > On May 24, 9:35 pm, mdipierro <mdipie...@cs.depaul.edu> wrote:
> > > > > >> New in trunk. Screen scraping capabilities.
>
> > > > > >> Example:>>> import re
> > > > > >> >>> from gluon.html import web2pyHTMLParser
> > > > > >> >>> from urllib import urlopen
> > > > > >> >>> html=urlopen('http://nobelprize.org/nobel_prizes/physics/laureates/1921/einstein-bi...()
> > > > > >> >>> tree=web2pyHTMLParser(html).tree  ### NEW!!
> > > > > >> >>> elements=tree.elements('div') # search by tag type
> > > > > >> >>> elements=tree.elements(_id="Einstein") # search by attribute 
> > > > > >> >>> value (id for example)
> > > > > >> >>> elements=tree.elements(find='Einstein') # text search NEW!!
> > > > > >> >>> elements=tree.elements(find=re.compile('Einstein')) # search 
> > > > > >> >>> via regex NEW!!
> > > > > >> >>> print elements[0]
>
> > > > > >> <title>Albert Einstein - Biography</title>>>> print elements[0][0]
>
> > > > > >> Albert Einstein - Biography>>> elements[0].append(SPAN(' 
> > > > > >> modified'))
>
> > > > > >> <title>Albert Einstein - Biography<span>modified</span></title>>>> 
> > > > > >> print tree
>
> > > > > >> <html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml";>
> > > > > >> <head>
> > > > > >>   <title>Albert Einstein - Biography<span>modified<span></title>
> > > > > >> ...

[web2py] Re: in trunk - scraping utils

Reply via email to