Re: [web2py] Re: in trunk - scraping utils

Thadeus Burgess Mon, 24 May 2010 22:50:40 -0700

> So why our own?

Because it converts it into web2py helpers.


And you don't have to deal with installing anything other than web2py.

--
Thadeus





On Tue, May 25, 2010 at 12:14 AM, Kevin Bowling <[email protected]> wrote:
> Hmm, I wonder if this is worth the possible maintenance cost?  It also
> transcends the role of a web framework and now you are getting into
> network programming.
>
> I have a currently deployed screen scraping app and found PyQuery to
> be more than adequate.  There is also lxml directly, or Beautiful
> Soup.  A simple import away and they integrate with web2py or anything
> else just fine.  So why our own?
>
> Regards,
> Kevin
>
> On May 24, 9:35 pm, mdipierro <[email protected]> wrote:
>> New in trunk. Screen scraping capabilities.
>>
>> Example:>>> import re
>> >>> from gluon.html import web2pyHTMLParser
>> >>> from urllib import urlopen
>> >>> html=urlopen('http://nobelprize.org/nobel_prizes/physics/laureates/1921/einstein-bi...()
>> >>> tree=web2pyHTMLParser(html).tree  ### NEW!!
>> >>> elements=tree.elements('div') # search by tag type
>> >>> elements=tree.elements(_id="Einstein") # search by attribute value (id 
>> >>> for example)
>> >>> elements=tree.elements(find='Einstein') # text search NEW!!
>> >>> elements=tree.elements(find=re.compile('Einstein')) # search via regex 
>> >>> NEW!!
>> >>> print elements[0]
>>
>> <title>Albert Einstein - Biography</title>>>> print elements[0][0]
>>
>> Albert Einstein - Biography>>> elements[0].append(SPAN(' modified'))
>>
>> <title>Albert Einstein - Biography<span>modified</span></title>>>> print tree
>>
>> <html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml";>
>> <head>
>>   <title>Albert Einstein - Biography<span>modified<span></title>
>> ...
>

Re: [web2py] Re: in trunk - scraping utils

Reply via email to