On Thu, 10 Mar 2016 01:03 am, BartC wrote: > On 09/03/2016 02:18, Steven D'Aprano wrote: >> On Wed, 9 Mar 2016 12:28 pm, BartC wrote: >> >>> (Which wasn't as painful as I'd expected. However the next project I >>> have in mind is 20K lines rather than 0.7K. For that I'm looking at some >>> mechanical translation I think. And probably some library to wrap around >>> Python's i/o.) >> >> You almost certainly don't need another wrapper around Python's I/O, >> making it slower still. You need to understand what Python's I/O is >> doing. > > Well, the original project will be using its file i/o library. So it'll > use the same interface that will be reimplemented on top of Python i/o.
Just don't complain that it's slow :-) > And input operations mainly consist of grabbing an entire file at once. with open(pathname) as f: data = f.read() > Output is a little more mixed. It often is. > I've just tried a UTF-8 file and getting some odd results. With a file > containing [three euro symbols]: > > €€€ > > (including a 3-byte utf-8 marker at the start), and opened in text mode, > Python 3 gives me this series of bytes (ie. the ord() of each character): > > 239 > 187 > 191 > 226 > 8218 > 172 > 226 > 8218 > 172 > 226 > 8218 > 172 Er, do you think that 8218 is a *byte*? (Hint: 1 byte = 8 bits, at least on any platform you are likely to be running.) Bart, you have a bad habit of giving us the output of your code, with an implied "explain this", but without showing us the code you used to generate the output. Without seeing the code you used, I have *no idea* how you could get that result. If you read the file in binary, you should get this: b'\xef\xbb\xbf\xe2\x82\xac\xe2\x82\xac\xe2\x82\xac' Or in decimal: 239, 187, 191, 226, 130, 172, 226, 130, 172, 226, 130, 172 How you are getting 8218 instead of 130, I have no idea! If you read the file as text, but using the wrong encoding, say Latin-1, you would get this: 'â\x82¬â\x82¬â\x82¬' or in decimal: 239, 187, 191, 226, 130, 172, 226, 130, 172, 226, 130, 172 Without seeing your code, I cannot possibly diagnose what you are doing. > And prints the resulting string as: €€€. Although this latter > might depend on my console's code page setting. That is very likely to be the reason for printing strange things. Life is much easier on Linux and OS-X, where the console works with UTF-8 by default. > Changing it to UTF-8 > however (CHCP 65001 in Windows) gives me this error when I run the > program again: > > ---------- > Fatal Python error: Py_Initialize: can't initialize sys standard streams > LookupError: unknown encoding: cp65001 > > This application has requested the Runtime to terminate it in an unusual > way. > Please contact the application's support team for more information. > ---------- I'm afraid I don't know how to deal with that. It's a Windows-specific issue. -- Steven -- https://mail.python.org/mailman/listinfo/python-list