Re: Ascii to Unicode.

2010-07-30 Thread John Machin
On Jul 30, 4:18 am, Carey Tilden wrote: > In this case, you've been able to determine the > correct encoding (latin-1) for those errant bytes, so the file itself > is thus known to be in that encoding. The most probably "correct" encoding is, as already stated, and agreed by the OP to be, cp1252.

RE: Ascii to Unicode.

2010-07-30 Thread Lawrence D'Oliveiro
In message , Joe Goldthwaite wrote: > Next I tried to write the unicodestring object to a file thusly; > > output.write(unicodestring) > > I would have expected the write function to request the byte string from > the unicodestring object and simply write that byte string to a file. Encoded ac

Re: Ascii to Unicode.

2010-07-30 Thread Lawrence D'Oliveiro
In message <4c51d3b6$0$1638$742ec...@news.sonic.net>, John Nagle wrote: > UTF-8 is a stream format for Unicode. It's slightly compressed ... “Variable-length” is not the same as “compressed”. Particularly if you’re mainly using non-Roman scripts... -- http://mail.python.org/mailman/listin

RE: Ascii to Unicode.

2010-07-30 Thread Lawrence D'Oliveiro
In message , Joe Goldthwaite wrote: > Ascii.csv isn't really a latin-1 encoded file. It's an ascii file with a > few characters above the 128 range that are causing Postgresql Unicode > errors. Those characters work fine in the Windows world but they're not > the correct byte representation for

Re: Ascii to Unicode.

2010-07-29 Thread Nobody
On Thu, 29 Jul 2010 23:49:40 +, Steven D'Aprano wrote: > It looks to me like Python uses a 16-bit implementation internally, It typically uses the platform's wchar_t, which is 16-bit on Windows and (typically) 32-bit on Unix. IIRC, it's possible to build Python with 32-bit Unicode on Windows

Re: Ascii to Unicode.

2010-07-29 Thread Mark Tolonen
"Joe Goldthwaite" wrote in message news:5a04846ed83745a8a99a944793792...@newmbp... Hi Steven, I read through the article you referenced. I understand Unicode better now. I wasn't completely ignorant of the subject. My confusion is more about how Python is handling Unicode than Unicode its

Re: Ascii to Unicode.

2010-07-29 Thread Steven D'Aprano
On Thu, 29 Jul 2010 11:14:24 -0700, Ethan Furman wrote: > Don't think of unicode as a byte stream. It's a bunch of numbers that > map to a bunch of symbols. Not only are Unicode strings a bunch of numbers ("code points", in Unicode terminology), but the numbers are not necessarily all the same

Re: Ascii to Unicode.

2010-07-29 Thread MRAB
John Nagle wrote: On 7/28/2010 3:58 PM, Joe Goldthwaite wrote: This still seems odd to me. I would have thought that the unicode function would return a properly encoded byte stream that could then simply be written to disk. Instead it seems like you have to re-encode the byte stream to some

Re: Ascii to Unicode.

2010-07-29 Thread John Nagle
On 7/28/2010 3:58 PM, Joe Goldthwaite wrote: This still seems odd to me. I would have thought that the unicode function would return a properly encoded byte stream that could then simply be written to disk. Instead it seems like you have to re-encode the byte stream to some kind of escaped Ascii

Re: Ascii to Unicode.

2010-07-29 Thread Ethan Furman
Joe Goldthwaite wrote: Hi Ulrich, Ascii.csv isn't really a latin-1 encoded file. It's an ascii file with a few characters above the 128 range . . . It took me a while to get this point too (if you already have "gotten it", I apologize, but the above comment leads me to believe you haven't).

Re: Ascii to Unicode.

2010-07-29 Thread Carey Tilden
On Thu, Jul 29, 2010 at 10:59 AM, Joe Goldthwaite wrote: > Hi Ulrich, > > Ascii.csv isn't really a latin-1 encoded file.  It's an ascii file with a > few characters above the 128 range that are causing Postgresql Unicode > errors.  Those characters work fine in the Windows world but they're not th

Re: Ascii to Unicode.

2010-07-29 Thread Ethan Furman
Joe Goldthwaite wrote: Hi Steven, I read through the article you referenced. I understand Unicode better now. I wasn't completely ignorant of the subject. My confusion is more about how Python is handling Unicode than Unicode itself. I guess I'm fighting my own misconceptions. I do that a lot

RE: Ascii to Unicode.

2010-07-29 Thread Joe Goldthwaite
Hi Ulrich, Ascii.csv isn't really a latin-1 encoded file. It's an ascii file with a few characters above the 128 range that are causing Postgresql Unicode errors. Those characters work fine in the Windows world but they're not the correct byte representation for Unicode. What I'm attempting to d

RE: Ascii to Unicode.

2010-07-29 Thread Joe Goldthwaite
Hi Steven, I read through the article you referenced. I understand Unicode better now. I wasn't completely ignorant of the subject. My confusion is more about how Python is handling Unicode than Unicode itself. I guess I'm fighting my own misconceptions. I do that a lot. It's hard for me to un

Re: Ascii to Unicode.

2010-07-29 Thread Ulrich Eckhardt
Joe Goldthwaite wrote: > import unicodedata > > input = file('ascii.csv', 'rb') > output = file('unicode.csv','wb') > > for line in input.xreadlines(): > unicodestring = unicode(line, 'latin1') > output.write(unicodestring.encode('utf-8')) # This second encode >

Re: Ascii to Unicode.

2010-07-28 Thread Steven D'Aprano
On Wed, 28 Jul 2010 15:58:01 -0700, Joe Goldthwaite wrote: > This still seems odd to me. I would have thought that the unicode > function would return a properly encoded byte stream that could then > simply be written to disk. Instead it seems like you have to re-encode > the byte stream to some

RE: Ascii to Unicode.

2010-07-28 Thread Joe Goldthwaite
> Hello hello ... you are running on Windows; the likelihood that you > actually have data encoded in latin1 is very very small. Follow MRAB's > answer but replace "latin1" by "cp1252". I think you're right. The database I'm working with is a US zip code database. It gets updated monthly. The p

Re: Ascii to Unicode.

2010-07-28 Thread John Machin
On Jul 29, 4:32 am, "Joe Goldthwaite" wrote: > Hi, > > I've got an Ascii file with some latin characters. Specifically \xe1 and > \xfc.  I'm trying to import it into a Postgresql database that's running in > Unicode mode. The Unicode converter chokes on those two characters. > > I could just manua

Re: Ascii to Unicode.

2010-07-28 Thread Thomas Jollans
On 07/28/2010 09:29 PM, John Nagle wrote: > for rawline in input : > unicodeline = unicode(line,'latin1')# Latin-1 to Unicode > output.write(unicodeline.encode('utf-8')) # Unicode to as UTF-8 you got your blocks wrong. -- http://mail.python.org/mailman/listinfo/python-list

Re: Ascii to Unicode.

2010-07-28 Thread John Nagle
On 7/28/2010 11:32 AM, Joe Goldthwaite wrote: Hi, I've got an Ascii file with some latin characters. Specifically \xe1 and \xfc. I'm trying to import it into a Postgresql database that's running in Unicode mode. The Unicode converter chokes on those two characters. I could just manually replac

Re: Ascii to Unicode.

2010-07-28 Thread Thomas Jollans
On 07/28/2010 08:32 PM, Joe Goldthwaite wrote: > Hi, > > I've got an Ascii file with some latin characters. Specifically \xe1 and > \xfc. I'm trying to import it into a Postgresql database that's running in > Unicode mode. The Unicode converter chokes on those two characters. > > I could just ma

Re: Ascii to Unicode.

2010-07-28 Thread MRAB
Joe Goldthwaite wrote: Hi, I've got an Ascii file with some latin characters. Specifically \xe1 and \xfc. I'm trying to import it into a Postgresql database that's running in Unicode mode. The Unicode converter chokes on those two characters. I could just manually replace those to character

Re: ascii to unicode line endings

2007-05-03 Thread Marc 'BlackJack' Rintsch
In <[EMAIL PROTECTED]>, fidtz wrote: import codecs testASCII = file("c:\\temp\\test1.txt",'w') testASCII.write("\n") testASCII.close() testASCII = file("c:\\temp\\test1.txt",'r') testASCII.read() > '\n' > Bit pattern on disk : \0x0D\0x0A testASCII.seek(0) te

Re: ascii to unicode line endings

2007-05-03 Thread fidtz
On 3 May, 13:39, "Jerry Hill" <[EMAIL PROTECTED]> wrote: > On 2 May 2007 09:19:25 -0700, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > > > The code: > > > import codecs > > > udlASCII = file("c:\\temp\\CSVDB.udl",'r') > > udlUNI = codecs.open("c:\\temp\\CSVDB2.udl",'w',"utf_16") > > udlUNI.write(u

Re: ascii to unicode line endings

2007-05-03 Thread fidtz
On 3 May, 13:00, Jean-Paul Calderone <[EMAIL PROTECTED]> wrote: > On 3 May 2007 04:30:37 -0700, [EMAIL PROTECTED] wrote: > > > > >On 2 May, 17:29, Jean-Paul Calderone <[EMAIL PROTECTED]> wrote: > >> On 2 May 2007 09:19:25 -0700, [EMAIL PROTECTED] wrote: > > >> >The code: > > >> >import codecs > > >

Re: ascii to unicode line endings

2007-05-03 Thread Jerry Hill
On 2 May 2007 09:19:25 -0700, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > The code: > > import codecs > > udlASCII = file("c:\\temp\\CSVDB.udl",'r') > udlUNI = codecs.open("c:\\temp\\CSVDB2.udl",'w',"utf_16") > udlUNI.write(udlASCII.read()) > udlUNI.close() > udlASCII.close() > > This doesn't se

Re: ascii to unicode line endings

2007-05-03 Thread Jean-Paul Calderone
On 3 May 2007 04:30:37 -0700, [EMAIL PROTECTED] wrote: >On 2 May, 17:29, Jean-Paul Calderone <[EMAIL PROTECTED]> wrote: >> On 2 May 2007 09:19:25 -0700, [EMAIL PROTECTED] wrote: >> >> >> >> >The code: >> >> >import codecs >> >> >udlASCII = file("c:\\temp\\CSVDB.udl",'r') >> >udlUNI = codecs.open("c

Re: ascii to unicode line endings

2007-05-03 Thread fidtz
On 2 May, 17:29, Jean-Paul Calderone <[EMAIL PROTECTED]> wrote: > On 2 May 2007 09:19:25 -0700, [EMAIL PROTECTED] wrote: > > > > >The code: > > >import codecs > > >udlASCII = file("c:\\temp\\CSVDB.udl",'r') > >udlUNI = codecs.open("c:\\temp\\CSVDB2.udl",'w',"utf_16") > > >udlUNI.write(udlASCII.read

Re: ascii to unicode line endings

2007-05-02 Thread Jean-Paul Calderone
On 2 May 2007 09:19:25 -0700, [EMAIL PROTECTED] wrote: >The code: > >import codecs > >udlASCII = file("c:\\temp\\CSVDB.udl",'r') >udlUNI = codecs.open("c:\\temp\\CSVDB2.udl",'w',"utf_16") > >udlUNI.write(udlASCII.read()) > >udlUNI.close() >udlASCII.close() > >This doesn't seem to generate the corre