Unicode string handling problem

2006-09-05 Thread Richard Schulman
The following program fragment works correctly with an ascii input
file.

But the file I actually want to process is Unicode (utf-16 encoding).
The file must be Unicode rather than ASCII or Latin-1 because it
contains mixed Chinese and English characters.

When I run the program below I get an attribute_count of zero, which
is incorrect for the input file, which should give a value of fifteen
or sixteen. In other words, the count function isn't recognizing the
", characters in the line being read. Here's the program:

in_file = open("c:\\pythonapps\\in-graf1.my","rU")
try:
# Skip the first line; make the second available for processing
in_file.readline()
in_line = readline()
attribute_count = in_line.count('",')
print attribute_count
finally:
in_file.close()

Any suggestions?

Richard Schulman
(For email reply, delete the 'xx' characters)
-- 
http://mail.python.org/mailman/listinfo/python-list


Unicode string handling problem (revised)

2006-09-05 Thread Richard Schulman
The appended program fragment works correctly with an ascii input
file. But the file I actually want to process is Unicode (utf-16
encoding). This file must be Unicode rather than ASCII or Latin-1
because it contains mixed Chinese and English characters.

When I run the program I get an attribute_count of zero. This
is incorrect for the input file, which should give a value of fifteen
or sixteen. In other words, the count function isn't recognizing the

",

characters to be counted in the line read.

Here's the program:

in_file = open("c:\\pythonapps\\in-graf1.my","rU")
try:
# Skip the first line; make the second available for processing
in_file.readline()
in_line = in_file.readline()
attribute_count = in_line.count('",')
print attribute_count
finally:
in_file.close()

Any suggestions?

Richard Schulman
(delete 'xx' characters for email reply)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Unicode string handling problem

2006-09-05 Thread Richard Schulman
Thanks for your excellent debugging suggestions, John. See below for
my follow-up:

Richard Schulman:
>> The following program fragment works correctly with an ascii input
>> file.
>>
>> But the file I actually want to process is Unicode (utf-16 encoding).
>> The file must be Unicode rather than ASCII or Latin-1 because it
>> contains mixed Chinese and English characters.
>>
>> When I run the program below I get an attribute_count of zero, which
>> is incorrect for the input file, which should give a value of fifteen
>> or sixteen. In other words, the count function isn't recognizing the
>> ", characters in the line being read. Here's the program:
>>...

John Machin:
>Insert
>print type(in_line)
>print repr(in_line)
>here [also make the appropriate changes to get the same info from the
>first line], run it again, copy/paste what you get, show us what you
>see.

Here's the revised program, per your suggestion:

=

# This program processes a UTF-16 input file that is
# to be loaded later into a mySQL table. The input file
# is not yet ready for prime time. The purpose of this
#  program is to ready it.

in_file = open("c:\\pythonapps\\in-graf1.my","rU")
try:
# The first line read is a SQL INSERT statement; no
# processing will be required.
in_line = in_file.readline()
print type(in_line)  #For debugging
print repr(in_line)  #For debugging

# The second line read is the first data row.   
in_line = in_file.readline()
print type(in_line)  #For debugging
print repr(in_line)  #For debugging

# For this and subsequent rows, we must count all
# the < ", > character-pairs in a given line/row.
# This  will provide an n-1 measure of the attributes
# for a SQL insert of this row. All rows must have 
# sixteen attributes, but some don't yet.
attribute_count = in_line.count('",')
print attribute_count
finally:
in_file.close()

=

The output of this program, which I ran at the command line,
must needs to be copied by hand and abridged, but I think I
have included the relevant information:

C:\pythonapps>python graf_correction.py

'\xff\xfeI\x00N\x00S...   [the beginning of a SQL INSERT statement]
...\x00U\x00E\x00S\x00\n' [the VALUES keyword at the end of the row,
   followed by an end-of-line]

'\x00\n'  [oh-oh! For the second row, all we're seeing
   is an end-of-line character. Is that from
   the first row? Wasn't the "rU" mode
   supposed to handle that]
0 [the counter value. It's hardly surprising
   it's only zero, given that most of the row
   never got loaded, just an eol mark]

J.M.:
>If you're coy about that, then you'll have to find out yourself if it
>has a BOM at the front, and if not whether it's little/big/endian.

The BOM is little-endian, I believe.

R.S.:
>> Any suggestions?

J.M.
>1. Read the Unicode HOWTO.
>2. Read the docs on the codecs module ...
>
>You'll need to use
>
>in_file = codecs.open(filepath, mode, encoding="utf16???")

Right you are. Here is the output produced by so doing:


u'\ufeffINSERT INTO [...] VALUES\N'

u'\n' 
0   [The counter value]

>It would also be a good idea to get into the habit of using unicode
>constants like u'",'

Right.

>HTH,
>John

Yes, it did. Many thanks! Now I've got to figure out the best way to
handle that \n\n at the end of each row, which the program is
interpreting as two rows. That represents two surprises: first, I
thought that Microsoft files ended as \n\r ; second, I thought that
Python mode "rU" was supposed to be the universal eol handler and
would handle the \n\r as one mark.

Richard Schulman
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Unicode string handling problem

2006-09-05 Thread Richard Schulman
On 5 Sep 2006 19:50:27 -0700, "John Roth" <[EMAIL PROTECTED]>
wrote:

>> [T]he file I actually want to process is Unicode (utf-16 encoding).
>>...
>> in_file = open("c:\\pythonapps\\in-graf1.my","rU")
>>...

John Roth:
>You're not detecting the file encoding and then
>using it in the open statement. If you know this is
>utf-16le or utf-16be, you need to say so in the
>open. If you don't, then you should read it into
>a string, go through some autodetect logic, and
>then decode it with the .decode(encoding)
>method.
>
>A clue: a properly formatted utf-16 or utf-32
>file MUST have a BOM as the first character.
>That's mandated in the unicode standard. If
>it doesn't have a BOM, then try ascii and
>utf-8 in that order.  The first
>one that succeeds is correct. If neither succeeds,
>you're on your own in guessing the file encoding.

Thanks for this further information. I'm now using the codec with
improved results, but am still puzzled as to how to handle the row
termination of \n\n, which is being interpreted as two rows instead of
one.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Unicode string handling problem

2006-09-05 Thread Richard Schulman
On Wed, 06 Sep 2006 03:55:18 GMT, Richard Schulman
<[EMAIL PROTECTED]> wrote:

>...I'm now using the codec with
>improved results, but am still puzzled as to how to handle the row
>termination of \n\n, which is being interpreted as two rows instead of
>one.

Of course, I could do a double read on each row and ignore the second
read, which merely fetches the final of the two u'\n' characters. But
that's not very elegant, and I'm sure there's a better way to do it
(hint, hint someone).

Richard Schulman (for email, drop the 'xx' in the reply-to)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Unicode string handling problem

2006-09-07 Thread Richard Schulman
Many thanks for your help, John, in giving me the tools to work
successfully in Python with Unicode from here on out.

It turns out that the Unicode input files I was working with (from MS
Word and MS Notepad) were indeed creating eol sequences of \r\n, not
\n\n as I had originally thought. The file reading statement that I
was using, with unpredictable results, was

#in_file =
codecs.open("c:\\pythonapps\\in-graf2.my","rU",encoding="utf-16LE")

This was reading to the \n on first read (outputting the whole line,
including the \n but, weirdly, not the preceding \r). Then, also
weirdly, the next readline would read the same \n again, interpreting
that as the entirety of a phantom second line. So each input file line
ended up producing two output lines.

Once the mode string "rU" was dropped, as in

in_file =
codecs.open("c:\\pythonapps\\in-graf2.my",encoding="utf-16LE")

all suddenly became well: no more doubled readlines, and one could see
the \r\n termination of each line.

This behavior of "rU" was not at all what I had expected from the
brief discussion of it in _Python Cookbook_. Which all goes to point
out how difficult it is to cook challenging dishes with sketchy
recipes alone. There is no substitute for the helpful advice of an
experienced chef.

-Richard Schulman
 (remove "xx" for email reply)

On 5 Sep 2006 22:29:59 -0700, "John Machin" <[EMAIL PROTECTED]>
wrote:

>Richard Schulman wrote:
>[big snip]
>>
>> The BOM is little-endian, I believe.
>Correct.
>
>> >in_file = codecs.open(filepath, mode, encoding="utf16???")
>>
>> Right you are. Here is the output produced by so doing:
>
>You don't say which encoding you used, but I guess that you used
>utf_16_le.
>
>>
>> 
>> u'\ufeffINSERT INTO [...] VALUES\N'
>
>Use utf_16 -- it will strip off the BOM for you.
>
>> 
>> u'\n'
>> 0   [The counter value]
>>
>[snip]
>> Yes, it did. Many thanks! Now I've got to figure out the best way to
>> handle that \n\n at the end of each row, which the program is
>> interpreting as two rows.
>
>Well we don't know yet exactly what you have there. We need a byte dump
>of the first few bytes of your file. Get into the interactive
>interpreter and do this:
>
>open('yourfile', 'rb').read(200)
>(the 'b' is for binary, in case you are on Windows)
>That will show us exactly what's there, without *any* EOL
>interpretation at all.
>
>
>> That represents two surprises: first, I
>> thought that Microsoft files ended as \n\r ;
>
>Nah. Wrong on two counts. In text mode, Microsoft *lines* end in \r\n
>(not \n\r); *files* may end in ctrl-Z aka chr(26) -- an inheritance
>from CP/M.
>
>U ... are you saying the file has \n\r at the end of each row?? How
>did you know that if you didn't know what if any BOM it had??? Who
>created the file
>
>> second, I thought that
>> Python mode "rU" was supposed to be the universal eol handler and
>> would handle the \n\r as one mark.
>
>Nah again. It contemplates only \n, \r, and \r\n as end of line. See
>the docs. Thus \n\r becomes *two* newlines when read with "rU".
>
>Having "\n\r" at the end of each row does fit with your symptoms:
>
>| >>> bom = u"\ufeff"
>| >>> guff = '\n\r'.join(['abc', 'def', 'ghi'])
>| >>> guffu = unicode(guff)
>| >>> import codecs
>| >>> f = codecs.open('guff.utf16le', 'wb', encoding='utf_16_le')
>| >>> f.write(bom+guffu)
>| >>> f.close()
>
>| >>> open('guff.utf16le', 'rb').read()  see exactly what we've got
>
>|
>'\xff\xfea\x00b\x00c\x00\n\x00\r\x00d\x00e\x00f\x00\n\x00\r\x00g\x00h\x00i\x00'
>
>| >>> codecs.open('guff.utf16le', 'r', encoding='utf_16').read()
>| u'abc\n\rdef\n\rghi' # Look, Mom, no BOM!
>
>| >>> codecs.open('guff.utf16le', 'rU', encoding='utf_16').read()
>| u'abc\n\ndef\n\nghi'  U means \r -> \n
>
>| >>> codecs.open('guff.utf16le', 'rU', encoding='utf_16_le').read()
>| u'\ufeffabc\n\ndef\n\nghi' # reproduces your second
>experience
>
>| >>> open('guff.utf16le', 'rU').readlines()
>| ['\xff\xfea\x00b\x00c\x00\n', '\x00\n', '\x00d\x00e\x00f\x00\n',
>'\x00\n', '\x00
>| g\x00h\x00i\x00']
>| >>> f = open('guff.utf16le', 'rU')
>| >>> f.readline()
>| '\xff\xfea\x00b\x00c\x00\n'
>| >>> f.readline()
>| '\x00\n' # reproduces your first experience
>| >>> f.readline()
>| '\x00d\x00e\x00f\x00\n'
>| >>>
>
>If that file is a one-off, you can obviously fix it by
>throwing away every second line. Otherwise, if it's an ongoing
>exercise, you need to talk sternly to the file's creator :-)
>
>HTH,
>John
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Convert to big5 to unicode

2006-09-07 Thread Richard Schulman
On 7 Sep 2006 01:27:55 -0700, "GM" <[EMAIL PROTECTED]> wrote:

>Could you all give me some guide on how to convert my big5 string to
>unicode using python? I already knew that I might use cjkcodecs or
>python 2.4 but I still don't have idea on what exactly I should do.
>Please give me some sample code if you could. Thanks a lot

Gary, I used this Java program quite a few years ago to convert
various Big5 files to UTF-16. (Sorry it's Java not Python, but I'm a
very recent convert to the latter.) My newsgroup reader has messed the
formatting up somewhat. If this causes a problem, email me and I'll
send you the source directly.

-Richard Schulman

/*  This program converts an input file of one encoding format to
an output file of 
 *  another format. It will be mainly used to convert Big5 text
files to Unicode text files.
 */   

import java.io.*;
public class ConvertEncoding
{   public static void  main(String[] args)
{   String outfile =null;
try
{convert(args[0], args[1],  "BIG5",
"UTF-16LE");
}
//  Or, at command line:
//  convert(args[0], args[1], "GB2312",
"UTF8");
//  or numerous variations thereon. Among possible
choices for input or output:
//  "GB2312", "BIG5", "UTF8", "UTF-16LE".
The last named is MS UCS-2 format.
//  I.e., "input file","output file",
"input encoding", "output encoding"
catch (Exceptione)
{   System.out.print(e.getMessage());
System.exit(1);
}
 }

public static void convert(String infile, String outfile,
String from, String to) 
 throws IOException,UnsupportedEncodingException
{   // set up byte streams
InputStream in;
if (infile  !=  null)
in = new FileInputStream(infile);
else
in = System.in;

OutputStream out;
if (outfile != null)
out = new FileOutputStream(outfile);
else
out = System.out;

 // Set up character stream
Reader r =  new BufferedReader(new
InputStreamReader(in, from));
Writer w =  new BufferedWriter(new
OutputStreamWriter(out, to));

 w.write("\ufeff"); // This character signals
Unicode in the NT environment
char[] buffer   = new char[4096];
int len;
while((len = r.read(buffer)) != -1) 
w.write(buffer, 0, len);
r.close();
w.flush();
w.close();
}
}
-- 
http://mail.python.org/mailman/listinfo/python-list


cx_Oracle question

2006-09-08 Thread Richard Schulman
I'm having trouble getting started using Python's cx_Oracle binding to
Oracle XE. In forthcoming programs, I need to set variables within sql
statements based on values read in from flat files. But I don't seem
to be able to get even the following stripped-down test program to
work:

import cx_Oracle
connection = cx_Oracle.connect("username", "password")
cursor = connection.cursor()

arg_1 = 2 #later, arg_1, arg_2, etc. will be read in files

cursor.execute("""select mean_eng_txt from mean
  where mean_id=:arg_1""",arg_1)
for row in cursor.fetchone():
print row
cursor.close()
connection.close()

The program above produces the following error message:

C:\pythonapps>python oracle_test.py
Traceback (most recent call last):
   File "oracle_test.py", line 7, in ?
  cursor.execute('select mean_eng_txt from mean where
  mean_id=:arg_1',arg_1)
TypeError: expecting a dictionary, sequence or keyword args

What do I need to do to get this sort of program working?

TIA,
Richard Schulman
For email reply, remove the xx characters 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: cx_Oracle question

2006-09-08 Thread Richard Schulman
Richard Schulman:
>> cursor.execute("""select mean_eng_txt from mean
>>   where mean_id=:arg_1""",arg_1)

Uwe Hoffman:
>cursor.execute("""select mean_eng_txt from mean
>where mean_id=:arg_1""",{"arg_1":arg_1})

R.S.'s error message:
>> Traceback (most recent call last):
>>File "oracle_test.py", line 7, in ?
>>   cursor.execute('select mean_eng_txt from mean where
>>   mean_id=:arg_1',arg_1)
>> TypeError: expecting a dictionary, sequence or keyword args

Excellent! Vielen Dank, Uwe (and Diez).

This also turned out to work:

cursor.execute("""select mean_eng_txt from mean
  where mean_id=:arg_1""",arg_1=arg_1)

Richard Schulman
For email reply, remove the xx part
-- 
http://mail.python.org/mailman/listinfo/python-list


Unicode / cx_Oracle problem

2006-09-08 Thread Richard Schulman
Sorry to be back at the goodly well so soon, but...

...when I execute the following -- variable mean_eng_txt being
utf-16LE and its datatype nvarchar2(79) in Oracle:

 cursor.execute("""INSERT INTO mean (mean_id,mean_eng_txt)
 VALUES (:id,:mean)""",id=id,mean=mean)

I not surprisingly get this error message:

 "cx_Oracle.NotSupportedError: Variable_TypeByValue(): unhandled data
type unicode"

But when I try putting a codecs.BOM_UTF16_LE in various plausible
places, I just end up generating different errors.

Recommendations, please?

TIA,
Richard Schulman
(Remove xx for email reply)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Unicode / cx_Oracle problem

2006-09-09 Thread Richard Schulman
>>  cursor.execute("""INSERT INTO mean (mean_id,mean_eng_txt)
>>  VALUES (:id,:mean)""",id=id,mean=mean)
>>...
>>  "cx_Oracle.NotSupportedError: Variable_TypeByValue(): unhandled data
>> type unicode"
>> 
>> But when I try putting a codecs.BOM_UTF16_LE in various plausible
>> places, I just end up generating different errors.

Diez:
>Show us the alleged plausible places, and the different errors. 
>Otherwise it's crystal ball time again.

More usefully, let's just try to fix the code above. Here's the error
message I get:

NotSupportedError: Variable_TypeByValue(): unhandled data type unicode

Traceback (innermost last):

File "c:\pythonapps\LoadMeanToOra.py", line 1, in ?
  # LoadMeanToOra reads a UTF-16LE input file one record at a time
File "c:\pythonapps\LoadMeanToOra.py", line 23, in ?
  cursor.execute("""INSERT INTO mean (mean_id,mean_eng_txt)

What I can't figure out is whether cx_Oracle is saying it can't handle
Unicode for an Oracle nvarchar2 data type or whether it can handle the
input but that it needs to be in a specific format that I'm not
supplying.

- Richard Schulman
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Unicode / cx_Oracle problem

2006-09-10 Thread Richard Schulman
On Sun, 10 Sep 2006 11:42:26 +0200, "Diez B. Roggisch"
<[EMAIL PROTECTED]> wrote:

>What does print repr(mean) give you?

That is a useful suggestion.

For context, I reproduce the source code:

in_file = codecs.open("c:\\pythonapps\\mean.my",encoding="utf_16_LE")
connection = cx_Oracle.connect("username", "password")
cursor = connection.cursor()
for row in in_file:
id = row[0]
mean = row[1]
print "Value of row is ", repr(row)#debug line
print "Value of the variable 'id' is ", repr(id)   #debug line
print "Value of the variable 'mean' is ", repr(mean)   #debug line
cursor.execute("""INSERT INTO mean (mean_id,mean_eng_txt)
VALUES (:id,:mean)""",id=id,mean=mean)

Here is the result from the print repr() statements:

Value of row is  u"\ufeff(3,'sadness, lament; sympathize with,
pity')\r\n"
Value of the variable 'id' is  u'\ufeff'
Value of the variable 'mean' is  u'('

Clearly, the values loaded into the 'id' and 'mean' variables are not
satisfactory but are picking up the BOM.

>... 
>The oracle NLS is a sometimes tricky beast, as it sets the encoding it 
>tries to be clever and assigns an existing connection some encoding, 
>based on the users/machines locale. Which can yield unexpected results, 
>such as "Dusseldorf" instead of "Düsseldorf" when querying a german city 
>list with an english locale.

Agreed.

>So - you have to figure out, what encoding your db-connection expects. 
>You can do so by issuing some queries against the session tables I 
>believe - I don't have my oracle resources at home, but googling will 
>bring you there, the important oracle term is NLS.

It's very hard to figure out what to do on the basis of complexities
on the order of

http://download-east.oracle.com/docs/cd/B25329_01/doc/appdev.102/b25108/xedev_global.htm#sthref1042

(tiny equivalent http://tinyurl.com/fnc54

But I'm not even sure I got that far. My problems so far seem prior:
in Python or Python's cx_Oracle driver. To be candid, I'm very tempted
at this point to abandon the Python effort and revert to an all-ucs2
environment, much as I dislike Java and C#'s complexities and the poor
support available for all-Java databases.

>Then you need to encode the unicode string before passing it - something 
>like this:
>
>mean = mean.encode("latin1")

I don't see how the Chinese characters embedded in the English text
will carry over if I do that.

In any case, thanks for your patient and generous help.

Richard Schulman
Delete the antispamming 'xx' characters for email reply
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Unicode / cx_Oracle problem

2006-09-16 Thread Richard Schulman
On 10 Sep 2006 15:27:17 -0700, "John Machin" <[EMAIL PROTECTED]>
wrote:

>...
>Encode each Unicode text field in UTF-8. Write the file as a CSV file
>using Python's csv module. Read the CSV file using the same module.
>Decode the text fields from UTF-8.
>
>You need to parse the incoming line into column values (the csv module
>does this for you) and then convert each column value from
>string/Unicode to a Python type that is compatible with the Oracle type
>for that column.
>...

John, how am I to reconcile your suggestions above with my
ActivePython 2.4 documentation, which states:

<<12.20 csv -- CSV File Reading and Writing 
<>

Regards,
Richard Schulman
-- 
http://mail.python.org/mailman/listinfo/python-list