On Mon, Feb 13, 2012 at 11:03 AM, Dave Angel <d...@davea.name> wrote: > On 02/12/2012 06:29 PM, Steven D'Aprano wrote: >> I think you mean 4 times as many bytes as characters. Unless you have 32 >> bit bytes :) >> >> > Until you have 32 bit bytes, you'll continue to have encodings, even if only > a couple of them.
The advantage, though, is that you can always know how many bytes to read for X characters. In ASCII, you allocate 80 bytes of storage and you can store 80 characters. In UTF-8, if you want an 80-character buffer, you can probably get away with allocating 240 characters... but maybe not. In UTF-32, it's easy - just allocate 320 bytes and you know you can store them. Also, you know exactly where the 17th character is; in UTF-8, you have to count. That's a huge advantage for in-memory strings; but is it useful on disk, where (as likely as not) you're actually looking for lines, which you still have to scan for? I'm thinking not, so it makes sense to use a smaller disk image than UTF-32 - less total bytes means less sectors to read/write, which translates fairly directly into performance. ChrisA -- http://mail.python.org/mailman/listinfo/python-list