> 1. FSR *is* UTF-32 so it is as unicode compliant as UTF-32, since UTF-32
> is an official encoding.  FSR only differs from UTF-32 in that the
> padding zeros are stripped off such that it is stored in the most
> compact form that can handle all the characters in string, which is
> always known at string creation time.  Now you can argue many things,
> but to say FSR is not unicode compliant is quite a stretch!  What
> unicode entities or characters cannot be stored in strings using FSR?
> What sequences of bytes in FSR result in invalid Unicode entities?
> 2. strings in Python *never change*.  They are immutable.  The +
> operator always copies strings character by character into a new string
> object, even if Python had used UTF-8 internally.  If you're doing a lot
> of string concatenations, perhaps you're using the wrong data type.  A
> byte buffer might be better for you, where you can stuff utf-8 sequences
> into it to your heart's content.
> 3. UTF-8 and UTF-16 encodings, being variable width encodings, mean that
> slicing a string would be very very slow, and that's unacceptable for
> the use cases of python strings.  I'm assuming you understand big O
> notation, as you talk of experience in many languages over the years.
> FSR and UTF-32 both are O(1) for slicing and lookups.  UTF-8, 16 and any
> variable-width encoding are always O(n).  A lot slower!
> 4. Unicode is, well, unicode.  You seem to hop all over the place from
> talking about code points to bytes to bits, using them all
> interchangeably.  And now you seem to be claiming that a particular byte
> encoding standard is by definition unicode (UTF-8).  Or at least that's
> how it sounds.  And also claim FSR is not compliant with unicode
> standards, which appears to me to be completely false.
> Is my understanding of these things wrong?


Compare these (a BDFL exemple, where I'using a non-ascii char)

Py 3.2 (narrow build)

>>> timeit.timeit("a = 'hundred'; 'x' in a")
>>> timeit.timeit("a = 'hundre€'; 'x' in a")
>>> sys.getsizeof('d')
>>> sys.getsizeof('€')
>>> sys.getsizeof('dd')
>>> sys.getsizeof('d€')


>>> timeit.timeit("a = 'hundred'; 'x' in a")
>>> timeit.timeit("a = 'hundre€'; 'x' in a")
>>> sys.getsizeof('d')
>>> sys.getsizeof('€')
>>> sys.getsizeof('dd')
>>> sys.getsizeof('d€')

Tell me which one seems to be more "unicode compliant"?
The goal of Unicode is to handle every char "equaly".

Now, the problem: memory. Do not forget that à la "FSR"
mechanism for a non-ascii user is *irrelevant*. As
soon as one uses one single non-ascii, your ascii feature
is lost. (That why we have all these dedicated coding
schemes, utfs included).

>>> sys.getsizeof('abc' * 1000 + 'z')
>>> sys.getsizeof('abc' * 1000 + '\U00010010')

A bit secret. The larger a repertoire of characters
is, the more bits you needs.
Secret #2. You can not escape from this.



