Re: Buffering HTML as HTMLParser reads it?

Bruno Desthuilliers Tue, 07 Aug 2007 14:48:29 -0700

[EMAIL PROTECTED] a écrit :
> Hello,
> 
> I am working on a project where I'm using python to parse HTML pages,
> transforming data between certain tags. Currently the HTMLParser class
> is being used for this. In a nutshell, its pretty simple -- I'm
> feeding the contents of the HTML page to HTMLParser, then I am
> overriding the appropriate handle_ method to handle this extracted
> data. In that method, I take the found data and I transform it into
> another string based on some logic.
> 
> Now, what I would like to do here is take that transformed string and
> put it "back into" the HTML document. Has anybody ever implemented
> something like this with HTMLParser?


Works the same with any sax (event-based) parser. First subclass the 
parser, adding a 'buffer' (best is to use a file-like object so you can 
either write to a stream, a file, a cStringIO etc) attribute to it and 
making all the handlers writing to this buffer. Then subclass your 
customized parser, and only override the needed handlers.

Q&D example implementation:

def format_attrs(attrs) :
   return ' '.join('%s=%s' % attr for attr in attrs)

def format_tag(tag, attrs, formats):
   attrs = format_attrs(attrs)
   return formats[bool(attrs)] % dict(tag=tag, attrs=attrs)

class BufferedHTMLParser(HTMLParser):
   START_TAG_FORMATS = ('<%(tag)s>', '<%(tag)s %(attrs)s>')
   STARTEND_TAG_FORMATS = ('<%(tag)s />', '<%(tag)s %(attrs)s />')

   def __init__(self, buffer):
     self.buffer = buffer

   def handle_starttag(self, tag, attrs):
      self.buffer.write(format_tag(tag,attrs,self.START_TAG_FORMATS))
                
   def handle_startendtag(self, tag):
     self.buffer.write(format_tag(tag,attrs,self.STARTEND_TAG_FORMATS))

   def handle_endtag(self, tag):
     self.buffer.write('</%s> % tag)

   def handle_data(self, data):
     self.buffer.write(data)

   # etc for all handlers


class MyParser(BufferedHtmlParser):
    def handle_data(self, data):
       data = data.replace(
         'Ni',
         "Ekky-ekky-ekky-ekky-z'Bang, zoom-Boing, z'nourrrwringmm"
         )
       BufferedHTMLParser.handle_data(self, data)

HTH
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Buffering HTML as HTMLParser reads it?

Reply via email to