In article <mailman.5750.1329094801.27778.python-l...@python.org>,
 Chris Angelico <ros...@gmail.com> wrote:

> The advantage, though, is that you can always know how many bytes to
> read for X characters. In ASCII, you allocate 80 bytes of storage and
> you can store 80 characters. In UTF-8, if you want an 80-character
> buffer, you can probably get away with allocating 240 characters...
> but maybe not. In UTF-32, it's easy - just allocate 320 bytes and you
> know you can store them. Also, you know exactly where the 17th
> character is; in UTF-8, you have to count. That's a huge advantage for
> in-memory strings; but is it useful on disk, where (as likely as not)
> you're actually looking for lines, which you still have to scan for?
> I'm thinking not, so it makes sense to use a smaller disk image than
> UTF-32 - less total bytes means less sectors to read/write, which
> translates fairly directly into performance.

You might just write files compressed.  My guess is that a typical 
gzipped UTF-32 text file will be smaller than the same data stored as 
uncompressed UTF-8.
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to