A colleague has asked me this and I don't know the answer. Can anyone here help with this? Thanks in advance.
Here is his email: I am trying to parse an HTML document using the xml.dom.minidom parser and then outputting a valid HTML document, all using the ISO-8859-1 charset. For example: My input: <?xml version="1.0" encoding="ISO-8859-1"?> <html> <head> <title></title> <meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type" /> </head> <body> € </body> </html> Desired output: <?xml version="1.0" encoding="ISO-8859-1"?> <html> <head> <title></title> <meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type" /> </head> <body> € </body> </html> Note that it doesn't matter if the '<?xml version="1.0" encoding="ISO-8859-1"?>' header gets stripped. What does matter is that the input document has the 'ISO-8859-1' charset and is an ANSI encoded file. The problem I get is that when I run, for example: from xml.dom.minidom import parseString output = parseString(strHTML).toxml() The output is: <?xml version="1.0" encoding="iso-8859-1"?> <html> <head> <title/> <meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type"/> </head> <body> € </body> </html> So it encodes the entity reference to € (Euro sign). I need it to remain as € so that the resulting HTML can render properly in a browser. Is there a way to make the parser not convert the entity references? Or is there a convenient post processing function that will do the conversion? -- Dale Strickland-Clark Riverhall Systems www.riverhall.co.uk -- http://mail.python.org/mailman/listinfo/python-list