On 2014-06-04 00:58, Paul Rubin wrote: > Steven D'Aprano <st...@pearwood.info> writes: > >> Maybe there's a use-case for a microcontroller that works in > >> ISO-8859-5 natively, thus using only eight bits per character, > > That won't even make the Russians happy, since in Russia there > > are multiple incompatible legacy encodings. > > I've never understood why not use UTF-8 for everything.
If you use UTF-8 for everything, then you end up in a world where string-indexing (see ChrisA's other side thread on this topic) is no longer an O(1) operation, but an O(N) operation. Some of us slice strings for a living. ;-) I understand that using UTF-32 would allow us to maintain O(1) indexing at the cost of every string occupying 4 bytes per character. The FSR (again, as I understand it) allows strings that fit in one-byte-per-character to use that, scaling up to use wider characters internally as they're actually needed/used. At the cost of complexity and non-constant memory space, an O(N) algorithm could be tweaked down to O(log N) by using an internal balanced tree of offsets-to-chunks (where the chunk-size was the size of a block where it was faster to scan linearly than to navigate the tree). One might even endow the algorithm with FSR smarts, so each chunk/fragment could be a different encoding in memory, and linearly iterating over the string would walk the tree, returning each decoded piece. </random_ramblings> -tkc -- https://mail.python.org/mailman/listinfo/python-list