[web2py] Re: parsehtml

mdipierro Tue, 25 May 2010 09:35:08 -0700

I cannot push it until tonight but I have this:

>>> a=TAG('<h1>Header</h1><p>this is a     test</p>')
>>> print a
<h1>Header</h1><p>this is a test</p>
>>> a.flatten()
'Headerthis is a     test'
>>> a.flatten(filter=lambda x: re.sub('\s+',' ',x))
'Headerthis is a test'
>>> a.flatten(filter=lambda x: re.sub('\s+','-',x))
'Headerthis-is-a-test'
>>> a.flatten(render=dict(h1=lambda x: '#'+x+'\n\n'),filter=lambda x: 
>>> x.replace(' ','-'))
'#Header\n\nthis-is-a-test'


filter is applied to text and render is applier to tags.
so your

   result = web2pyHTMLParser(form.vars.input).tree

could be written as

   result = TAG(form.vars.input).flatten(filter=lambda x: re.sub('\s
+',' ',x)), render=dict(br=lambda x:'\n',p=lambda x: x+'\n'))

Can somebody propose better names for "filter" ad "render"? I could
not come up with anything better.

Massimo





On May 25, 10:24 am, Iceberg <iceb...@21cn.com> wrote:
> Hi Massimo, Good to know you finally made it! :-)
>
> Albeit not knowing where and when to use this new feature, I came up
> with an HTML Optimizier such as [1], in a dozen lines of web2py code.
>
> [1]http://www.iwebtool.com/html_optimizer
>
> [2] Put this inside your controller.
>
> def easter(): # This code release in public domain
>     from gluon.html import web2pyHTMLParser
>     form = FORM(
>         TEXTAREA(_name='input'), BR(),
>         INPUT(_type='submit', _value='Optimize!'), )
>     result = ''
>     if form.accepts(request.vars, keepvalues=True):
>         result = web2pyHTMLParser(form.vars.input).tree
>     return {'':DIV(
>         'Insert your HTML code to optimize:',
>         form,
>         FIELDSET(PRE(str(result))),)}
>
> Well, not exactly an html optimizer, because our version does not
> strip spaces inside text content. Just for fun.
>
> Regards,
> Iceberg
>
> On May25, 4:27am, mdipierro <mdipie...@cs.depaul.edu> wrote:
>
> > Good suggestion. Now you can do
>
> >     >>> from gluon.html import web2pyHTMLParser
> >     >>> tree = web2pyHTMLParser('hello<div a="b">world</
> > div>').tree
> >     >>> tree.element(_a='b')
> > ['_c']=5
> >     >>>
> > str(tree)
> >     'hello<div a="b" c="5">world</div>'
>
> > works great!
>
> > On May 24, 5:11 am, Iceberg <iceb...@21cn.com> wrote:
>
> > > I did not try but I assume the builtin python module HTMLParser
> > > already handle at least (1) tags like <input />, not sure about (2)
> > > and (3).
>
> > > On May24, 4:32am, mdipierro <mdipie...@cs.depaul.edu> wrote:
>
> > > > hmmm.... somehow I did not save comments in the file.
>
> > > > This does not handle well:
>
> > > > 1) tags like <input />
> > > > 2) attributes that contain > in quotes <a onclick="if(a>b)alert()">
> > > > 3) attributes that contain escaped quotes <a onclick="var a=\"x\"">
>
> > > > On May 23, 10:46 am, Massimo Di Pierro <mdipie...@cs.depaul.edu>
> > > > wrote:
>
> > > > > Anybody interested in helping with this?
>
> > > > > It scrapes an html files and converts into a tree hierarchy of web2py 
> > > > >  
> > > > > helpers
>
> > > > > '<div>xxx</div>' -> DIV('xxx')
>
> > > > > It kind of works but fails at three exceptions described in the file.
>
> > > > > Massimo
>
> > > > >  parsehtml.py
> > > > > 1KViewDownload

[web2py] Re: parsehtml

Reply via email to