It makes assumptions. It fails if Python HTMLParser fails. For example: >>> from gluon.html import TAG >>> print TAG('<a>aaa<b>bbb<c>cccc</b>ddd</a>eee') <a>aaa<b>bbb<c>cccc</c></b>ddd</a>eee >>> print TAG('<a>aaa<b>bbb<c>cccc</b>dddeee') <a>aaa<b>bbb<c>cccc</c></b>dddeee</a> >>> print TAG('<a>aaa<b x=">bbb<c>cccc</b>dddeee') <a>aaa</a> >>> print TAG('<a>aaa<b bbb<c y=">cccc</b>dddeee') Traceback (most recent call last): HTMLParser.HTMLParseError: malformed start tag, at line 1, column 13
On May 25, 8:49 pm, Richard <richar...@gmail.com> wrote: > how robust have you found HTMLParser with badly formed HTML? > > On May 26, 1:11 am, mdipierro <mdipie...@cs.depaul.edu> wrote: > > > Here is a one liner to remove all tags from a some html text: > > > >>> html = '<div>hello<span>world</span></div>' > > >>> print TAG(html).flatten() > > > helloworld > > > On May 25, 10:02 am, mdipierro <mdipie...@cs.depaul.edu> wrote: > > > > yet a better syntax and more API: > > > > 1) no more web2pyHTMLParser, use TAG(...) instead. and flatten (remove > > > tags) > > > > >>> a=TAG('<div>Hello<span>world</span></div>') > > > >>> print a > > > > <div>Hello<span>world</span></div>>>> print a.element('span') > > > <span>world</span> > > > >>> print a.flatten() > > > > Helloworld > > > > 2) search by multiple conditions, including regex > > > for example, find all external links in a page > > > > >>> import re, urllib > > > >>> html = urllib.urlopen('http://web2py.com').read() > > > >>> elements = TAG(html).elements('a',_href=re.compile('^http')) > > > >>> for e in elements: print e['_href'] > > > >http://web2py.com/bookhttp://www.python.orghttp://mycti.cti.depaul.ed... > > > .... > > > > I think we just blew BeautifulSoup out of the water. > > > > Massimo > > > > On May 25, 7:59 am, mdipierro <mdipie...@cs.depaul.edu> wrote: > > > > > The entire code is 40 lines and uses the python built-in html parser. > > > > It will not be a problem to maintain it. Actually we could even use > > > > this simplify both XML(...,sanitize) and gluon.contrib.markdown.WIKI > > > > > On May 25, 12:50 am, Thadeus Burgess <thade...@thadeusb.com> wrote: > > > > > > > So why our own? > > > > > > Because it converts it into web2py helpers. > > > > > > And you don't have to deal with installing anything other than web2py. > > > > > > -- > > > > > Thadeus > > > > > > On Tue, May 25, 2010 at 12:14 AM, Kevin Bowling > > > > > <kevin.bowl...@gmail.com> wrote: > > > > > > Hmm, I wonder if this is worth the possible maintenance cost? It > > > > > > also > > > > > > transcends the role of a web framework and now you are getting into > > > > > > network programming. > > > > > > > I have a currently deployed screen scraping app and found PyQuery to > > > > > > be more than adequate. There is also lxml directly, or Beautiful > > > > > > Soup. A simple import away and they integrate with web2py or > > > > > > anything > > > > > > else just fine. So why our own? > > > > > > > Regards, > > > > > > Kevin > > > > > > > On May 24, 9:35 pm, mdipierro <mdipie...@cs.depaul.edu> wrote: > > > > > >> New in trunk. Screen scraping capabilities. > > > > > > >> Example:>>> import re > > > > > >> >>> from gluon.html import web2pyHTMLParser > > > > > >> >>> from urllib import urlopen > > > > > >> >>> html=urlopen('http://nobelprize.org/nobel_prizes/physics/laureates/1921/einstein-bi...() > > > > > >> >>> tree=web2pyHTMLParser(html).tree ### NEW!! > > > > > >> >>> elements=tree.elements('div') # search by tag type > > > > > >> >>> elements=tree.elements(_id="Einstein") # search by attribute > > > > > >> >>> value (id for example) > > > > > >> >>> elements=tree.elements(find='Einstein') # text search NEW!! > > > > > >> >>> elements=tree.elements(find=re.compile('Einstein')) # search > > > > > >> >>> via regex NEW!! > > > > > >> >>> print elements[0] > > > > > > >> <title>Albert Einstein - Biography</title>>>> print elements[0][0] > > > > > > >> Albert Einstein - Biography>>> elements[0].append(SPAN(' > > > > > >> modified')) > > > > > > >> <title>Albert Einstein - Biography<span>modified</span></title>>>> > > > > > >> print tree > > > > > > >> <html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"> > > > > > >> <head> > > > > > >> <title>Albert Einstein - Biography<span>modified<span></title> > > > > > >> ...