[EMAIL PROTECTED] a écrit : > Hello, > > I am working on a project where I'm using python to parse HTML pages, > transforming data between certain tags. Currently the HTMLParser class > is being used for this. In a nutshell, its pretty simple -- I'm > feeding the contents of the HTML page to HTMLParser, then I am > overriding the appropriate handle_ method to handle this extracted > data. In that method, I take the found data and I transform it into > another string based on some logic. > > Now, what I would like to do here is take that transformed string and > put it "back into" the HTML document. Has anybody ever implemented > something like this with HTMLParser?
Works the same with any sax (event-based) parser. First subclass the parser, adding a 'buffer' (best is to use a file-like object so you can either write to a stream, a file, a cStringIO etc) attribute to it and making all the handlers writing to this buffer. Then subclass your customized parser, and only override the needed handlers. Q&D example implementation: def format_attrs(attrs) : return ' '.join('%s=%s' % attr for attr in attrs) def format_tag(tag, attrs, formats): attrs = format_attrs(attrs) return formats[bool(attrs)] % dict(tag=tag, attrs=attrs) class BufferedHTMLParser(HTMLParser): START_TAG_FORMATS = ('<%(tag)s>', '<%(tag)s %(attrs)s>') STARTEND_TAG_FORMATS = ('<%(tag)s />', '<%(tag)s %(attrs)s />') def __init__(self, buffer): self.buffer = buffer def handle_starttag(self, tag, attrs): self.buffer.write(format_tag(tag,attrs,self.START_TAG_FORMATS)) def handle_startendtag(self, tag): self.buffer.write(format_tag(tag,attrs,self.STARTEND_TAG_FORMATS)) def handle_endtag(self, tag): self.buffer.write('</%s> % tag) def handle_data(self, data): self.buffer.write(data) # etc for all handlers class MyParser(BufferedHtmlParser): def handle_data(self, data): data = data.replace( 'Ni', "Ekky-ekky-ekky-ekky-z'Bang, zoom-Boing, z'nourrrwringmm" ) BufferedHTMLParser.handle_data(self, data) HTH -- http://mail.python.org/mailman/listinfo/python-list