On 09/03/2016 02:18, Steven D'Aprano wrote:
On Wed, 9 Mar 2016 12:28 pm, BartC wrote:

(Which wasn't as painful as I'd expected. However the next project I
have in mind is 20K lines rather than 0.7K. For that I'm looking at some
mechanical translation I think. And probably some library to wrap around
Python's i/o.)

You almost certainly don't need another wrapper around Python's I/O, making
it slower still. You need to understand what Python's I/O is doing.

Well, the original project will be using its file i/o library. So it'll use the same interface that will be reimplemented on top of Python i/o.

And input operations mainly consist of grabbing an entire file at once. Output is a little more mixed.

If you open a file in binary mode, Python will give you a stream of bytes
(ordinal values 0 through 255 inclusive). Python won't modify or change
those bytes in any way. Whatever it reads from disk, it will give to you.

If you open a file in text mode, Python 3 will give you a stream of Unicode
code points (ordinal values 0 through 0x10FFFF). Earlier versions of Python
3 may behave somewhat strangely with so-called "astral characters": I
recommend that you avoid anything below version 3.3. Unless you are
including (e.g.) Chinese or ancient Phoenician in your text file, you
probably won't care.

I've just tried a UTF-8 file and getting some odd results. With a file containing [three euro symbols]:

€€€

(including a 3-byte utf-8 marker at the start), and opened in text mode, Python 3 gives me this series of bytes (ie. the ord() of each character):

239
187
191
226
8218
172
226
8218
172
226
8218
172

And prints the resulting string as: €€€. Although this latter might depend on my console's code page setting. Changing it to UTF-8 however (CHCP 65001 in Windows) gives me this error when I run the program again:

----------
Fatal Python error: Py_Initialize: can't initialize sys standard streams
LookupError: unknown encoding: cp65001

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
----------

(That was with 3.1; 3.4 gives the same set of characters as above, and shows the string differently, but still wrong. While PyPy 3.2.4 gives a different set of byte values, all 0..255, and a different string again, although it now contains some actual € characters.

So I think I'll skip Unicode handling to start off with! (I've already had plenty of fun and games with it in the past.)

--
Bartc



--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to