On Thu, 01 May 2014 18:38:35 -0400, Terry Reedy wrote: > "strange beasties like python's FSR" > > Have you really let yourself be poisoned by JMF's bizarre rants? The FSR > is an *internal optimization* that benefits most unicode operations that > people actually perform. It uses UTF-32 by default but adapts to the > strings users create by compressing the internal format. The compression > is trivial -- simple dropping leading null bytes common to all > characters -- so each character is still readable as is.
For anyone who, like me, wasn't convinced that Unicode worked that way, you can see for yourself that it does. You don't need Python 3.3, any version of 3.x will work. In Python 2.7, it should work if you just change the calls from "chr()" to "unichr()": py> for i in range(256): ... c = chr(i) ... u = c.encode('utf-32-be') ... assert u[:3] == b'\0\0\0' ... assert u[3:] == c.encode('latin-1') ... py> for i in range(256, 0xFFFF+1): ... c = chr(i) ... u = c.encode('utf-32-be') ... assert u[:2] == b'\0\0' ... assert u[2:] == c.encode('utf-16-be') ... py> So Terry is correct: dropping leading zeroes, and treating the remainder as either Latin-1 or UTF-16, works fine, and potentially saves a lot of memory. -- Steven D'Aprano http://import-that.dreamwidth.org/ -- https://mail.python.org/mailman/listinfo/python-list