Here is a one liner to remove all tags from a some html text: >>> html = '<div>hello<span>world</span></div>' >>> print TAG(html).flatten() helloworld
On May 25, 10:02 am, mdipierro <mdipie...@cs.depaul.edu> wrote: > yet a better syntax and more API: > > 1) no more web2pyHTMLParser, use TAG(...) instead. and flatten (remove > tags) > > >>> a=TAG('<div>Hello<span>world</span></div>') > >>> print a > > <div>Hello<span>world</span></div>>>> print a.element('span') > <span>world</span> > >>> print a.flatten() > > Helloworld > > 2) search by multiple conditions, including regex > for example, find all external links in a page > > >>> import re, urllib > >>> html = urllib.urlopen('http://web2py.com').read() > >>> elements = TAG(html).elements('a',_href=re.compile('^http')) > >>> for e in elements: print e['_href'] > > http://web2py.com/bookhttp://www.python.orghttp://mycti.cti.depaul.edu/people/facultyInfo_mycti.asp?id=343 > .... > > I think we just blew BeautifulSoup out of the water. > > Massimo > > On May 25, 7:59 am, mdipierro <mdipie...@cs.depaul.edu> wrote: > > > The entire code is 40 lines and uses the python built-in html parser. > > It will not be a problem to maintain it. Actually we could even use > > this simplify both XML(...,sanitize) and gluon.contrib.markdown.WIKI > > > On May 25, 12:50 am, Thadeus Burgess <thade...@thadeusb.com> wrote: > > > > > So why our own? > > > > Because it converts it into web2py helpers. > > > > And you don't have to deal with installing anything other than web2py. > > > > -- > > > Thadeus > > > > On Tue, May 25, 2010 at 12:14 AM, Kevin Bowling <kevin.bowl...@gmail.com> > > > wrote: > > > > Hmm, I wonder if this is worth the possible maintenance cost? It also > > > > transcends the role of a web framework and now you are getting into > > > > network programming. > > > > > I have a currently deployed screen scraping app and found PyQuery to > > > > be more than adequate. There is also lxml directly, or Beautiful > > > > Soup. A simple import away and they integrate with web2py or anything > > > > else just fine. So why our own? > > > > > Regards, > > > > Kevin > > > > > On May 24, 9:35 pm, mdipierro <mdipie...@cs.depaul.edu> wrote: > > > >> New in trunk. Screen scraping capabilities. > > > > >> Example:>>> import re > > > >> >>> from gluon.html import web2pyHTMLParser > > > >> >>> from urllib import urlopen > > > >> >>> html=urlopen('http://nobelprize.org/nobel_prizes/physics/laureates/1921/einstein-bi...() > > > >> >>> tree=web2pyHTMLParser(html).tree ### NEW!! > > > >> >>> elements=tree.elements('div') # search by tag type > > > >> >>> elements=tree.elements(_id="Einstein") # search by attribute value > > > >> >>> (id for example) > > > >> >>> elements=tree.elements(find='Einstein') # text search NEW!! > > > >> >>> elements=tree.elements(find=re.compile('Einstein')) # search via > > > >> >>> regex NEW!! > > > >> >>> print elements[0] > > > > >> <title>Albert Einstein - Biography</title>>>> print elements[0][0] > > > > >> Albert Einstein - Biography>>> elements[0].append(SPAN(' modified')) > > > > >> <title>Albert Einstein - Biography<span>modified</span></title>>>> > > > >> print tree > > > > >> <html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"> > > > >> <head> > > > >> <title>Albert Einstein - Biography<span>modified<span></title> > > > >> ...