[web2py] Re: in trunk - scraping utils

2010-05-25 Thread mdipierro
there are docstrings. I will write something more asap. On May 25, 10:28 pm, weheh wrote: > This is very nice. I think Thadeus' point is well made. I agree it's > useful. It is fringe, but I absolutely need this and will be using it > on my current project. Where's the doc?

[web2py] Re: in trunk - scraping utils

2010-05-25 Thread weheh
This is very nice. I think Thadeus' point is well made. I agree it's useful. It is fringe, but I absolutely need this and will be using it on my current project. Where's the doc?

Re: [web2py] Re: in trunk - scraping utils

2010-05-25 Thread Álvaro Justen
On Tue, May 25, 2010 at 12:11, mdipierro wrote: > Here is a one liner to remove all tags from a some html text: > html = 'helloworld' print TAG(html).flatten() > helloworld Very good! -- Álvaro Justen - Turicas http://blog.justen.eng.br/ 21 9898-0141

[web2py] Re: in trunk - scraping utils

2010-05-25 Thread mdipierro
It makes assumptions. It fails if Python HTMLParser fails. For example: >>> from gluon.html import TAG >>> print TAG('aaabbbdddeee') aaabbbdddeee >>> print TAG('aaabbbdddeee') aaabbbdddeee >>> print TAG('aaadddeee') Traceback (most recent call last): HTMLParser.HTMLParseError:

[web2py] Re: in trunk - scraping utils

2010-05-25 Thread Richard
how robust have you found HTMLParser with badly formed HTML? On May 26, 1:11 am, mdipierro wrote: > Here is a one liner to remove all tags from a some html text: > > >>> html = 'helloworld' > >>> print TAG(html).flatten() > > helloworld > > On May 25, 10:02 am, mdipierro wrote: > > > yet a bett

[web2py] Re: in trunk - scraping utils

2010-05-25 Thread Richard
Was going to say "web2pyHTMLParser" is too cumbersome - glad you changed to "TAG" I do some scraping with lxml so am also wary about including this, but the example look very convenient. On May 26, 1:11 am, mdipierro wrote: > Here is a one liner to remove all tags from a some html text: > > >>

[web2py] Re: in trunk - scraping utils

2010-05-25 Thread mdipierro
Here is a one liner to remove all tags from a some html text: >>> html = 'helloworld' >>> print TAG(html).flatten() helloworld On May 25, 10:02 am, mdipierro wrote: > yet a better syntax and more API: > > 1) no more web2pyHTMLParser, use TAG(...) instead. and flatten (remove > tags) > > >>> a=TA

[web2py] Re: in trunk - scraping utils

2010-05-25 Thread mdipierro
yet a better syntax and more API: 1) no more web2pyHTMLParser, use TAG(...) instead. and flatten (remove tags) >>> a=TAG('Helloworld') >>> print a Helloworld >>> print a.element('span') world >>> print a.flatten() Helloworld 2) search by multiple conditions, including regex for example, find all

[web2py] Re: in trunk - scraping utils

2010-05-25 Thread mdipierro
The entire code is 40 lines and uses the python built-in html parser. It will not be a problem to maintain it. Actually we could even use this simplify both XML(...,sanitize) and gluon.contrib.markdown.WIKI On May 25, 12:50 am, Thadeus Burgess wrote: > > So why our own? > > Because it converts it

Re: [web2py] Re: in trunk - scraping utils

2010-05-24 Thread Thadeus Burgess
> So why our own? Because it converts it into web2py helpers. And you don't have to deal with installing anything other than web2py. -- Thadeus On Tue, May 25, 2010 at 12:14 AM, Kevin Bowling wrote: > Hmm, I wonder if this is worth the possible maintenance cost?  It also > transcends the r

[web2py] Re: in trunk - scraping utils

2010-05-24 Thread Kevin Bowling
Hmm, I wonder if this is worth the possible maintenance cost? It also transcends the role of a web framework and now you are getting into network programming. I have a currently deployed screen scraping app and found PyQuery to be more than adequate. There is also lxml directly, or Beautiful Sou