Re: Ascii to Unicode.

Thomas Jollans Wed, 28 Jul 2010 12:27:23 -0700

On 07/28/2010 08:32 PM, Joe Goldthwaite wrote:
> Hi,
> 
> I've got an Ascii file with some latin characters. Specifically \xe1 and
> \xfc.  I'm trying to import it into a Postgresql database that's running in
> Unicode mode. The Unicode converter chokes on those two characters.
> 
> I could just manually replace those to characters with something valid but
> if any other invalid characters show up in later versions of the file, I'd
> like to handle them correctly.
> 
> 
> I've been playing with the Unicode stuff and I found out that I could
> convert both those characters correctly using the latin1 encoder like this;
> 
> 
>       import unicodedata
> 
>       s = '\xe1\xfc'
>       print unicode(s,'latin1')
> 
> 
> The above works.  When I try to convert my file however, I still get an
> error;
> 
>       import unicodedata
> 
>       input = file('ascii.csv', 'r')
>       output = file('unicode.csv','w')


output is still a binary file - there are no unicode files. You need to
encode the text somehow.

> Traceback (most recent call last):
>   File "C:\Users\jgold\CloudmartFiles\UnicodeTest.py", line 10, in __main__
>     output.write(unicode(line,'latin1'))
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position
> 295: ordinal not in range(128)

by default, Python tries to encode strings using ASCII. This, obviously,
won't work here.

Do you know which encoding your database expects ? I'd assume it'd
understand UTF-8. Everybody uses UTF-8.


>
>       for line in input.xreadlines():
>               output.write(unicode(line,'latin1'))

unicode(line, 'latin1') is unicode, you need it to be a UTF-8 bytestring:

unicode(line, 'latin1').encode('utf-8')

or:

line.decode('latin1').encode('utf-8')
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Ascii to Unicode.

Reply via email to