In article <mailman.5750.1329094801.27778.python-l...@python.org>, Chris Angelico <ros...@gmail.com> wrote:
> The advantage, though, is that you can always know how many bytes to > read for X characters. In ASCII, you allocate 80 bytes of storage and > you can store 80 characters. In UTF-8, if you want an 80-character > buffer, you can probably get away with allocating 240 characters... > but maybe not. In UTF-32, it's easy - just allocate 320 bytes and you > know you can store them. Also, you know exactly where the 17th > character is; in UTF-8, you have to count. That's a huge advantage for > in-memory strings; but is it useful on disk, where (as likely as not) > you're actually looking for lines, which you still have to scan for? > I'm thinking not, so it makes sense to use a smaller disk image than > UTF-32 - less total bytes means less sectors to read/write, which > translates fairly directly into performance. You might just write files compressed. My guess is that a typical gzipped UTF-32 text file will be smaller than the same data stored as uncompressed UTF-8. -- http://mail.python.org/mailman/listinfo/python-list