Re: FSR and unicode compliance - was Re: RE Module Performance

Antoon Pardon Sun, 28 Jul 2013 12:59:44 -0700

Op 28-07-13 21:23, wxjmfa...@gmail.com schreef:

Le dimanche 28 juillet 2013 17:52:47 UTC+2, Michael Torrie a écrit :

On 07/27/2013 12:21 PM, wxjmfa...@gmail.com wrote:

Good point. FSR, nice tool for those who wish to teach

Unicode. It is not every day, one has such an opportunity.




I had a long e-mail composed, but decided to chop it down, but still too

long.  so I ditched a lot of the context, which jmf also seems to do.

Apologies.



1. FSR *is* UTF-32 so it is as unicode compliant as UTF-32, since UTF-32

is an official encoding.  FSR only differs from UTF-32 in that the

padding zeros are stripped off such that it is stored in the most

compact form that can handle all the characters in string, which is

always known at string creation time.  Now you can argue many things,

but to say FSR is not unicode compliant is quite a stretch!  What

unicode entities or characters cannot be stored in strings using FSR?

What sequences of bytes in FSR result in invalid Unicode entities?



2. strings in Python *never change*.  They are immutable.  The +

operator always copies strings character by character into a new string

object, even if Python had used UTF-8 internally.  If you're doing a lot

of string concatenations, perhaps you're using the wrong data type.  A

byte buffer might be better for you, where you can stuff utf-8 sequences

into it to your heart's content.



3. UTF-8 and UTF-16 encodings, being variable width encodings, mean that

slicing a string would be very very slow, and that's unacceptable for

the use cases of python strings.  I'm assuming you understand big O

notation, as you talk of experience in many languages over the years.

FSR and UTF-32 both are O(1) for slicing and lookups.  UTF-8, 16 and any

variable-width encoding are always O(n).  A lot slower!



4. Unicode is, well, unicode.  You seem to hop all over the place from

talking about code points to bytes to bits, using them all

interchangeably.  And now you seem to be claiming that a particular byte

encoding standard is by definition unicode (UTF-8).  Or at least that's

how it sounds.  And also claim FSR is not compliant with unicode

standards, which appears to me to be completely false.



Is my understanding of these things wrong?


------

Compare these (a BDFL exemple, where I'using a non-ascii char)

Py 3.2 (narrow build)

timeit.timeit("a = 'hundred'; 'x' in a")

0.09897159682121348

timeit.timeit("a = 'hundre€'; 'x' in a")

0.09079501961732461

sys.getsizeof('d')

sys.getsizeof('€')

sys.getsizeof('dd')

sys.getsizeof('d€')

34


Py3.3

timeit.timeit("a = 'hundred'; 'x' in a")

0.12183182740848858

timeit.timeit("a = 'hundre€'; 'x' in a")

0.2365732969632326

sys.getsizeof('d')

sys.getsizeof('€')

sys.getsizeof('dd')

sys.getsizeof('d€')

42

Tell me which one seems to be more "unicode compliant"?


Cant tell, you give no relevant information on which one can decide
this question.

The goal of Unicode is to handle every char "equaly".


Not to this kind of detail, which is looking at irrelevant
implementation details.

Now, the problem: memory. Do not forget that à la "FSR"
mechanism for a non-ascii user is *irrelevant*. As
soon as one uses one single non-ascii, your ascii feature
is lost. (That why we have all these dedicated coding
schemes, utfs included).


So? Why should that trouble me? As far as I understand
whether I have an ascii string or not is totally irrelevant
to the application programmer. Within the application I
just process strings and let the programming environment
keep track of these details in a transparant way unless
you start looking at things like getsizeof, which gives
you implementation details that are mostly irrelevant
in deciding whether the behaviour is compliant or not.

sys.getsizeof('abc' * 1000 + 'z')

sys.getsizeof('abc' * 1000 + '\U00010010')

12044

A bit secret. The larger a repertoire of characters
is, the more bits you needs.
Secret #2. You can not escape from this.


And totally unimportant for deciding complyance.

--
Antoon Pardon

--
http://mail.python.org/mailman/listinfo/python-list

Re: FSR and unicode compliance - was Re: RE Module Performance

Reply via email to