Hi MRAB, I was trying to avoid regex because my poor old brain has trouble with it. I have to admin though, that line is slick! I'll have to go through my regex documentation to try and figure out what it actually means.
Thanks! -----Original Message----- From: python-list-bounces+joe=goldthwaites....@python.org [mailto:python-list-bounces+joe=goldthwaites....@python.org] On Behalf Of MRAB Sent: Thursday, November 25, 2010 9:03 PM To: python-list@python.org Subject: Re: Parsing markup. On 26/11/2010 03:28, Joe Goldthwaite wrote: > I'm attempting to parse some basic tagged markup. The output of the > TinyMCE editor returns a string that looks something like this; > > <p>This is a paragraph with <b>bold</b> and <i>italic</i> elements in > it</p><p>It can be made up of multiple lines separated by pagagraph > tags.</p> > > I'm trying to render the paragraph into a bit mapped image. I need > to parse it out into the various paragraph and bold/italic pieces. > I'm not sure the best way to approach it. Elementree and lxml seem > to want a full formatted page, not a small segment like this one. > When I tried to feed a line similar to the above to lxml I got an > error; "XMLSyntaxError: Extra content at the end of the document". > I'd probably use a regex: >>> import re >>> text = "<p>This is a paragraph with <b>bold</b> and <i>italic</i> elements in it</p><p>It can be made up of multiple lines separated by pagagraph tags.</p>" >>> re.findall(r"</?\w+>|[^<>]+", text) ['<p>', 'This is a paragraph with ', '<b>', 'bold', '</b>', ' and ', '<i>', 'italic', '</i>', ' elements in it', '</p>', '<p>', 'It can be made up of multiple lines separated by pagagraph tags.', '</p>'] -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list