Re: minidom and unicode errors

Abhimanyu Seth Mon, 06 Mar 2006 22:22:13 -0800

On 3/7/06, Fredrik Lundh <[EMAIL PROTECTED]> wrote:

Abhimanyu Seth wrote:

> I'm trying to parse and modify an XML document using xml.dom.minidom module
> and Python 2.4.2
>
> >> from xml.dom import minidom
> >> dom = minidom.parse ("c:/test.txt")
>
> If the xml file contains a non-ascii character, then i get a parse error.
> I have the following line in my xml file:
> <target>Exception beim Löschen des Audit-Moduls aufgetreten. Exception Stack
> lautet: %1.</target>
> ExpatError: not well-formed (invalid token): line 8, column 27
>
> If I remove the ö character, then it works fine. I'm guessing this has to do
> with the default encoding which is ascii. I guess i can change the encoding
> by modifying a file on my machine that the interpretter reads while loading,
> but then how do I get my program to work on different machines?

the default encoding for XML is UTF-8.  If you're using any other encoding
in your XML file, you have to specify that in the file itself, by putting an
<?xml?> construct at the top of the file.  e.g.

    <?xml version="1.0" encoding="ISO-8859-1"?>
    ... rest of XML file follows ...

> Also, while writing such a special character to the file, I get an error.
> >> document.writexml (file (myFile, "w"), encoding='utf-8')
>
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position
> 16: ordinal not in range(128)

not sure; maybe you've added byte strings (encoded strings instead of Unicode
strings) to the document, or maybe there's a bug in minidom.  What happens if
you remove the encoding argument?  If you still get the same error after doing
that, make sure you use only Unicode strings when you add stuff to the document.

hope this helps!

</F>

--
http://mail.python.org/mailman/listinfo/python-list

I've specified utf-8 in the xml header
<?xml version="1.0" encoding="utf-8"?>

In writexml (), even without specifying the encoding, I get the same error. That't why I tried manually specifying the encoding.

But I managed to find a workaround.
I got some clues from http://evanjones.ca/python-utf8.html

According to the site,

import codecs
fileObj = 
codecs.open( "someFile", "r", "utf-8" )
u = fileObj.read() # Returns a Unicode string from the UTF-8 bytes in the file

should return me a unicode string. But I still get an error.
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 407-410: invalid data

I can't figure out why! Why can't it parse ö character as unicode?

Anyway,
>> f = codecs.open ("c:/test.txt", "r", "latin-1")
>> dom = minidom.parseString (codecs.encode (f.read(), "utf-8"))

works. But then I dunno if this will work for chinese or other unicode characters.
How do I make my code read unicode files?

Also, while writing the xml file, I now use codecs.open ()
>> document.writexml (codecs.open (mFile, "w", "utf-8"), encoding="utf-8")

IMHO, writexml should be taking care of this, instead of me having to use codecs. I guess this is a bug.

--
Regards,
Abhimanyu

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: minidom and unicode errors

Reply via email to