Unicode string handling problem
The following program fragment works correctly with an ascii input file. But the file I actually want to process is Unicode (utf-16 encoding). The file must be Unicode rather than ASCII or Latin-1 because it contains mixed Chinese and English characters. When I run the program below I get an attribute_count of zero, which is incorrect for the input file, which should give a value of fifteen or sixteen. In other words, the count function isn't recognizing the ", characters in the line being read. Here's the program: in_file = open("c:\\pythonapps\\in-graf1.my","rU") try: # Skip the first line; make the second available for processing in_file.readline() in_line = readline() attribute_count = in_line.count('",') print attribute_count finally: in_file.close() Any suggestions? Richard Schulman (For email reply, delete the 'xx' characters) -- http://mail.python.org/mailman/listinfo/python-list
Unicode string handling problem (revised)
The appended program fragment works correctly with an ascii input file. But the file I actually want to process is Unicode (utf-16 encoding). This file must be Unicode rather than ASCII or Latin-1 because it contains mixed Chinese and English characters. When I run the program I get an attribute_count of zero. This is incorrect for the input file, which should give a value of fifteen or sixteen. In other words, the count function isn't recognizing the ", characters to be counted in the line read. Here's the program: in_file = open("c:\\pythonapps\\in-graf1.my","rU") try: # Skip the first line; make the second available for processing in_file.readline() in_line = in_file.readline() attribute_count = in_line.count('",') print attribute_count finally: in_file.close() Any suggestions? Richard Schulman (delete 'xx' characters for email reply) -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode string handling problem
Thanks for your excellent debugging suggestions, John. See below for my follow-up: Richard Schulman: >> The following program fragment works correctly with an ascii input >> file. >> >> But the file I actually want to process is Unicode (utf-16 encoding). >> The file must be Unicode rather than ASCII or Latin-1 because it >> contains mixed Chinese and English characters. >> >> When I run the program below I get an attribute_count of zero, which >> is incorrect for the input file, which should give a value of fifteen >> or sixteen. In other words, the count function isn't recognizing the >> ", characters in the line being read. Here's the program: >>... John Machin: >Insert >print type(in_line) >print repr(in_line) >here [also make the appropriate changes to get the same info from the >first line], run it again, copy/paste what you get, show us what you >see. Here's the revised program, per your suggestion: = # This program processes a UTF-16 input file that is # to be loaded later into a mySQL table. The input file # is not yet ready for prime time. The purpose of this # program is to ready it. in_file = open("c:\\pythonapps\\in-graf1.my","rU") try: # The first line read is a SQL INSERT statement; no # processing will be required. in_line = in_file.readline() print type(in_line) #For debugging print repr(in_line) #For debugging # The second line read is the first data row. in_line = in_file.readline() print type(in_line) #For debugging print repr(in_line) #For debugging # For this and subsequent rows, we must count all # the < ", > character-pairs in a given line/row. # This will provide an n-1 measure of the attributes # for a SQL insert of this row. All rows must have # sixteen attributes, but some don't yet. attribute_count = in_line.count('",') print attribute_count finally: in_file.close() = The output of this program, which I ran at the command line, must needs to be copied by hand and abridged, but I think I have included the relevant information: C:\pythonapps>python graf_correction.py '\xff\xfeI\x00N\x00S... [the beginning of a SQL INSERT statement] ...\x00U\x00E\x00S\x00\n' [the VALUES keyword at the end of the row, followed by an end-of-line] '\x00\n' [oh-oh! For the second row, all we're seeing is an end-of-line character. Is that from the first row? Wasn't the "rU" mode supposed to handle that] 0 [the counter value. It's hardly surprising it's only zero, given that most of the row never got loaded, just an eol mark] J.M.: >If you're coy about that, then you'll have to find out yourself if it >has a BOM at the front, and if not whether it's little/big/endian. The BOM is little-endian, I believe. R.S.: >> Any suggestions? J.M. >1. Read the Unicode HOWTO. >2. Read the docs on the codecs module ... > >You'll need to use > >in_file = codecs.open(filepath, mode, encoding="utf16???") Right you are. Here is the output produced by so doing: u'\ufeffINSERT INTO [...] VALUES\N' u'\n' 0 [The counter value] >It would also be a good idea to get into the habit of using unicode >constants like u'",' Right. >HTH, >John Yes, it did. Many thanks! Now I've got to figure out the best way to handle that \n\n at the end of each row, which the program is interpreting as two rows. That represents two surprises: first, I thought that Microsoft files ended as \n\r ; second, I thought that Python mode "rU" was supposed to be the universal eol handler and would handle the \n\r as one mark. Richard Schulman -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode string handling problem
On 5 Sep 2006 19:50:27 -0700, "John Roth" <[EMAIL PROTECTED]> wrote: >> [T]he file I actually want to process is Unicode (utf-16 encoding). >>... >> in_file = open("c:\\pythonapps\\in-graf1.my","rU") >>... John Roth: >You're not detecting the file encoding and then >using it in the open statement. If you know this is >utf-16le or utf-16be, you need to say so in the >open. If you don't, then you should read it into >a string, go through some autodetect logic, and >then decode it with the .decode(encoding) >method. > >A clue: a properly formatted utf-16 or utf-32 >file MUST have a BOM as the first character. >That's mandated in the unicode standard. If >it doesn't have a BOM, then try ascii and >utf-8 in that order. The first >one that succeeds is correct. If neither succeeds, >you're on your own in guessing the file encoding. Thanks for this further information. I'm now using the codec with improved results, but am still puzzled as to how to handle the row termination of \n\n, which is being interpreted as two rows instead of one. -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode string handling problem
On Wed, 06 Sep 2006 03:55:18 GMT, Richard Schulman <[EMAIL PROTECTED]> wrote: >...I'm now using the codec with >improved results, but am still puzzled as to how to handle the row >termination of \n\n, which is being interpreted as two rows instead of >one. Of course, I could do a double read on each row and ignore the second read, which merely fetches the final of the two u'\n' characters. But that's not very elegant, and I'm sure there's a better way to do it (hint, hint someone). Richard Schulman (for email, drop the 'xx' in the reply-to) -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode string handling problem
Many thanks for your help, John, in giving me the tools to work successfully in Python with Unicode from here on out. It turns out that the Unicode input files I was working with (from MS Word and MS Notepad) were indeed creating eol sequences of \r\n, not \n\n as I had originally thought. The file reading statement that I was using, with unpredictable results, was #in_file = codecs.open("c:\\pythonapps\\in-graf2.my","rU",encoding="utf-16LE") This was reading to the \n on first read (outputting the whole line, including the \n but, weirdly, not the preceding \r). Then, also weirdly, the next readline would read the same \n again, interpreting that as the entirety of a phantom second line. So each input file line ended up producing two output lines. Once the mode string "rU" was dropped, as in in_file = codecs.open("c:\\pythonapps\\in-graf2.my",encoding="utf-16LE") all suddenly became well: no more doubled readlines, and one could see the \r\n termination of each line. This behavior of "rU" was not at all what I had expected from the brief discussion of it in _Python Cookbook_. Which all goes to point out how difficult it is to cook challenging dishes with sketchy recipes alone. There is no substitute for the helpful advice of an experienced chef. -Richard Schulman (remove "xx" for email reply) On 5 Sep 2006 22:29:59 -0700, "John Machin" <[EMAIL PROTECTED]> wrote: >Richard Schulman wrote: >[big snip] >> >> The BOM is little-endian, I believe. >Correct. > >> >in_file = codecs.open(filepath, mode, encoding="utf16???") >> >> Right you are. Here is the output produced by so doing: > >You don't say which encoding you used, but I guess that you used >utf_16_le. > >> >> >> u'\ufeffINSERT INTO [...] VALUES\N' > >Use utf_16 -- it will strip off the BOM for you. > >> >> u'\n' >> 0 [The counter value] >> >[snip] >> Yes, it did. Many thanks! Now I've got to figure out the best way to >> handle that \n\n at the end of each row, which the program is >> interpreting as two rows. > >Well we don't know yet exactly what you have there. We need a byte dump >of the first few bytes of your file. Get into the interactive >interpreter and do this: > >open('yourfile', 'rb').read(200) >(the 'b' is for binary, in case you are on Windows) >That will show us exactly what's there, without *any* EOL >interpretation at all. > > >> That represents two surprises: first, I >> thought that Microsoft files ended as \n\r ; > >Nah. Wrong on two counts. In text mode, Microsoft *lines* end in \r\n >(not \n\r); *files* may end in ctrl-Z aka chr(26) -- an inheritance >from CP/M. > >U ... are you saying the file has \n\r at the end of each row?? How >did you know that if you didn't know what if any BOM it had??? Who >created the file > >> second, I thought that >> Python mode "rU" was supposed to be the universal eol handler and >> would handle the \n\r as one mark. > >Nah again. It contemplates only \n, \r, and \r\n as end of line. See >the docs. Thus \n\r becomes *two* newlines when read with "rU". > >Having "\n\r" at the end of each row does fit with your symptoms: > >| >>> bom = u"\ufeff" >| >>> guff = '\n\r'.join(['abc', 'def', 'ghi']) >| >>> guffu = unicode(guff) >| >>> import codecs >| >>> f = codecs.open('guff.utf16le', 'wb', encoding='utf_16_le') >| >>> f.write(bom+guffu) >| >>> f.close() > >| >>> open('guff.utf16le', 'rb').read() see exactly what we've got > >| >'\xff\xfea\x00b\x00c\x00\n\x00\r\x00d\x00e\x00f\x00\n\x00\r\x00g\x00h\x00i\x00' > >| >>> codecs.open('guff.utf16le', 'r', encoding='utf_16').read() >| u'abc\n\rdef\n\rghi' # Look, Mom, no BOM! > >| >>> codecs.open('guff.utf16le', 'rU', encoding='utf_16').read() >| u'abc\n\ndef\n\nghi' U means \r -> \n > >| >>> codecs.open('guff.utf16le', 'rU', encoding='utf_16_le').read() >| u'\ufeffabc\n\ndef\n\nghi' # reproduces your second >experience > >| >>> open('guff.utf16le', 'rU').readlines() >| ['\xff\xfea\x00b\x00c\x00\n', '\x00\n', '\x00d\x00e\x00f\x00\n', >'\x00\n', '\x00 >| g\x00h\x00i\x00'] >| >>> f = open('guff.utf16le', 'rU') >| >>> f.readline() >| '\xff\xfea\x00b\x00c\x00\n' >| >>> f.readline() >| '\x00\n' # reproduces your first experience >| >>> f.readline() >| '\x00d\x00e\x00f\x00\n' >| >>> > >If that file is a one-off, you can obviously fix it by >throwing away every second line. Otherwise, if it's an ongoing >exercise, you need to talk sternly to the file's creator :-) > >HTH, >John -- http://mail.python.org/mailman/listinfo/python-list
Re: Convert to big5 to unicode
On 7 Sep 2006 01:27:55 -0700, "GM" <[EMAIL PROTECTED]> wrote: >Could you all give me some guide on how to convert my big5 string to >unicode using python? I already knew that I might use cjkcodecs or >python 2.4 but I still don't have idea on what exactly I should do. >Please give me some sample code if you could. Thanks a lot Gary, I used this Java program quite a few years ago to convert various Big5 files to UTF-16. (Sorry it's Java not Python, but I'm a very recent convert to the latter.) My newsgroup reader has messed the formatting up somewhat. If this causes a problem, email me and I'll send you the source directly. -Richard Schulman /* This program converts an input file of one encoding format to an output file of * another format. It will be mainly used to convert Big5 text files to Unicode text files. */ import java.io.*; public class ConvertEncoding { public static void main(String[] args) { String outfile =null; try {convert(args[0], args[1], "BIG5", "UTF-16LE"); } // Or, at command line: // convert(args[0], args[1], "GB2312", "UTF8"); // or numerous variations thereon. Among possible choices for input or output: // "GB2312", "BIG5", "UTF8", "UTF-16LE". The last named is MS UCS-2 format. // I.e., "input file","output file", "input encoding", "output encoding" catch (Exceptione) { System.out.print(e.getMessage()); System.exit(1); } } public static void convert(String infile, String outfile, String from, String to) throws IOException,UnsupportedEncodingException { // set up byte streams InputStream in; if (infile != null) in = new FileInputStream(infile); else in = System.in; OutputStream out; if (outfile != null) out = new FileOutputStream(outfile); else out = System.out; // Set up character stream Reader r = new BufferedReader(new InputStreamReader(in, from)); Writer w = new BufferedWriter(new OutputStreamWriter(out, to)); w.write("\ufeff"); // This character signals Unicode in the NT environment char[] buffer = new char[4096]; int len; while((len = r.read(buffer)) != -1) w.write(buffer, 0, len); r.close(); w.flush(); w.close(); } } -- http://mail.python.org/mailman/listinfo/python-list
cx_Oracle question
I'm having trouble getting started using Python's cx_Oracle binding to Oracle XE. In forthcoming programs, I need to set variables within sql statements based on values read in from flat files. But I don't seem to be able to get even the following stripped-down test program to work: import cx_Oracle connection = cx_Oracle.connect("username", "password") cursor = connection.cursor() arg_1 = 2 #later, arg_1, arg_2, etc. will be read in files cursor.execute("""select mean_eng_txt from mean where mean_id=:arg_1""",arg_1) for row in cursor.fetchone(): print row cursor.close() connection.close() The program above produces the following error message: C:\pythonapps>python oracle_test.py Traceback (most recent call last): File "oracle_test.py", line 7, in ? cursor.execute('select mean_eng_txt from mean where mean_id=:arg_1',arg_1) TypeError: expecting a dictionary, sequence or keyword args What do I need to do to get this sort of program working? TIA, Richard Schulman For email reply, remove the xx characters -- http://mail.python.org/mailman/listinfo/python-list
Re: cx_Oracle question
Richard Schulman: >> cursor.execute("""select mean_eng_txt from mean >> where mean_id=:arg_1""",arg_1) Uwe Hoffman: >cursor.execute("""select mean_eng_txt from mean >where mean_id=:arg_1""",{"arg_1":arg_1}) R.S.'s error message: >> Traceback (most recent call last): >>File "oracle_test.py", line 7, in ? >> cursor.execute('select mean_eng_txt from mean where >> mean_id=:arg_1',arg_1) >> TypeError: expecting a dictionary, sequence or keyword args Excellent! Vielen Dank, Uwe (and Diez). This also turned out to work: cursor.execute("""select mean_eng_txt from mean where mean_id=:arg_1""",arg_1=arg_1) Richard Schulman For email reply, remove the xx part -- http://mail.python.org/mailman/listinfo/python-list
Unicode / cx_Oracle problem
Sorry to be back at the goodly well so soon, but... ...when I execute the following -- variable mean_eng_txt being utf-16LE and its datatype nvarchar2(79) in Oracle: cursor.execute("""INSERT INTO mean (mean_id,mean_eng_txt) VALUES (:id,:mean)""",id=id,mean=mean) I not surprisingly get this error message: "cx_Oracle.NotSupportedError: Variable_TypeByValue(): unhandled data type unicode" But when I try putting a codecs.BOM_UTF16_LE in various plausible places, I just end up generating different errors. Recommendations, please? TIA, Richard Schulman (Remove xx for email reply) -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode / cx_Oracle problem
>> cursor.execute("""INSERT INTO mean (mean_id,mean_eng_txt) >> VALUES (:id,:mean)""",id=id,mean=mean) >>... >> "cx_Oracle.NotSupportedError: Variable_TypeByValue(): unhandled data >> type unicode" >> >> But when I try putting a codecs.BOM_UTF16_LE in various plausible >> places, I just end up generating different errors. Diez: >Show us the alleged plausible places, and the different errors. >Otherwise it's crystal ball time again. More usefully, let's just try to fix the code above. Here's the error message I get: NotSupportedError: Variable_TypeByValue(): unhandled data type unicode Traceback (innermost last): File "c:\pythonapps\LoadMeanToOra.py", line 1, in ? # LoadMeanToOra reads a UTF-16LE input file one record at a time File "c:\pythonapps\LoadMeanToOra.py", line 23, in ? cursor.execute("""INSERT INTO mean (mean_id,mean_eng_txt) What I can't figure out is whether cx_Oracle is saying it can't handle Unicode for an Oracle nvarchar2 data type or whether it can handle the input but that it needs to be in a specific format that I'm not supplying. - Richard Schulman -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode / cx_Oracle problem
On Sun, 10 Sep 2006 11:42:26 +0200, "Diez B. Roggisch" <[EMAIL PROTECTED]> wrote: >What does print repr(mean) give you? That is a useful suggestion. For context, I reproduce the source code: in_file = codecs.open("c:\\pythonapps\\mean.my",encoding="utf_16_LE") connection = cx_Oracle.connect("username", "password") cursor = connection.cursor() for row in in_file: id = row[0] mean = row[1] print "Value of row is ", repr(row)#debug line print "Value of the variable 'id' is ", repr(id) #debug line print "Value of the variable 'mean' is ", repr(mean) #debug line cursor.execute("""INSERT INTO mean (mean_id,mean_eng_txt) VALUES (:id,:mean)""",id=id,mean=mean) Here is the result from the print repr() statements: Value of row is u"\ufeff(3,'sadness, lament; sympathize with, pity')\r\n" Value of the variable 'id' is u'\ufeff' Value of the variable 'mean' is u'(' Clearly, the values loaded into the 'id' and 'mean' variables are not satisfactory but are picking up the BOM. >... >The oracle NLS is a sometimes tricky beast, as it sets the encoding it >tries to be clever and assigns an existing connection some encoding, >based on the users/machines locale. Which can yield unexpected results, >such as "Dusseldorf" instead of "Düsseldorf" when querying a german city >list with an english locale. Agreed. >So - you have to figure out, what encoding your db-connection expects. >You can do so by issuing some queries against the session tables I >believe - I don't have my oracle resources at home, but googling will >bring you there, the important oracle term is NLS. It's very hard to figure out what to do on the basis of complexities on the order of http://download-east.oracle.com/docs/cd/B25329_01/doc/appdev.102/b25108/xedev_global.htm#sthref1042 (tiny equivalent http://tinyurl.com/fnc54 But I'm not even sure I got that far. My problems so far seem prior: in Python or Python's cx_Oracle driver. To be candid, I'm very tempted at this point to abandon the Python effort and revert to an all-ucs2 environment, much as I dislike Java and C#'s complexities and the poor support available for all-Java databases. >Then you need to encode the unicode string before passing it - something >like this: > >mean = mean.encode("latin1") I don't see how the Chinese characters embedded in the English text will carry over if I do that. In any case, thanks for your patient and generous help. Richard Schulman Delete the antispamming 'xx' characters for email reply -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode / cx_Oracle problem
On 10 Sep 2006 15:27:17 -0700, "John Machin" <[EMAIL PROTECTED]> wrote: >... >Encode each Unicode text field in UTF-8. Write the file as a CSV file >using Python's csv module. Read the CSV file using the same module. >Decode the text fields from UTF-8. > >You need to parse the incoming line into column values (the csv module >does this for you) and then convert each column value from >string/Unicode to a Python type that is compatible with the Oracle type >for that column. >... John, how am I to reconcile your suggestions above with my ActivePython 2.4 documentation, which states: <<12.20 csv -- CSV File Reading and Writing <> Regards, Richard Schulman -- http://mail.python.org/mailman/listinfo/python-list