On Wed, 10 Oct 2007 22:19:49 -0500, Robert Dailey wrote > Hi, > > Thanks for responding. I apologize about my lack of details, I was in a hurry when I wrote the initial question. I'll provide more details. > > Basically, I'm attempting to write out unicode strings (16 bits per character) to a file. Before each string, I write out 4 bytes containing the number of characters (NOT BYTES) the string contains. I suppose the confusion comes in because I'm writing out both text information AND binary data at the same time. I suppose the consistent thing to do would be to write out the strings as binary instead of as text? I'm originally a C++ programmer and I'm still learning Python, so figuring out this problem is a little difficult for me. > > In my initial inquiry, I was writing out 5000 as an example, however this number would realistically be the number of characters in the string: len( u"Hello World" ). Once I write out these 4 bytes, I then write out the string "Hello World" immediately after the 4 bytes. You may be wondering why the crazy file format. The reason is because this python script is writing out data that will later be read in by a C++ application. > > The following works fine for ASCII strings: > > mystring = "Hello World" > file = open( "somefile.txt", "wb" ) > file.write( struct.pack ( "I", len(mystring) ) ) > file.write( mystring ) > > Again I do apologize for the lack of detail. If I've still been unclear please don't hesitate to ask for more details.
This is much clearer, and it explains why you need to mix arbitrary binary data with unicode text. Because of this mixing, as you have surmised, you're going to have to treat the file as a binary file in Python. In other words, don't open the file with the codecs module and do the encoding yourself, like so: mystring = u"Hello World" file = open( "somefile.txt", "wb" ) file.write( struct.pack ( "I", len(mystring) ) ) file.write( mystring.encode("utf-16-le") ) (Note that I've guessed that you want little-endian byte-order in the encoding. Without that indication, encode() would put a byte order mark at the beginning of the string, which you probably don't want.) Hope this helps, Carsten. -- http://mail.python.org/mailman/listinfo/python-list