Op 28-07-13 21:23, wxjmfa...@gmail.com schreef:
Le dimanche 28 juillet 2013 17:52:47 UTC+2, Michael Torrie a écrit :
On 07/27/2013 12:21 PM, wxjmfa...@gmail.com wrote:
Good point. FSR, nice tool for those who wish to teach
Unicode. It is not every day, one has such an opportunity.
I had a long e-mail composed, but decided to chop it down, but still too
long. so I ditched a lot of the context, which jmf also seems to do.
Apologies.
1. FSR *is* UTF-32 so it is as unicode compliant as UTF-32, since UTF-32
is an official encoding. FSR only differs from UTF-32 in that the
padding zeros are stripped off such that it is stored in the most
compact form that can handle all the characters in string, which is
always known at string creation time. Now you can argue many things,
but to say FSR is not unicode compliant is quite a stretch! What
unicode entities or characters cannot be stored in strings using FSR?
What sequences of bytes in FSR result in invalid Unicode entities?
2. strings in Python *never change*. They are immutable. The +
operator always copies strings character by character into a new string
object, even if Python had used UTF-8 internally. If you're doing a lot
of string concatenations, perhaps you're using the wrong data type. A
byte buffer might be better for you, where you can stuff utf-8 sequences
into it to your heart's content.
3. UTF-8 and UTF-16 encodings, being variable width encodings, mean that
slicing a string would be very very slow, and that's unacceptable for
the use cases of python strings. I'm assuming you understand big O
notation, as you talk of experience in many languages over the years.
FSR and UTF-32 both are O(1) for slicing and lookups. UTF-8, 16 and any
variable-width encoding are always O(n). A lot slower!
4. Unicode is, well, unicode. You seem to hop all over the place from
talking about code points to bytes to bits, using them all
interchangeably. And now you seem to be claiming that a particular byte
encoding standard is by definition unicode (UTF-8). Or at least that's
how it sounds. And also claim FSR is not compliant with unicode
standards, which appears to me to be completely false.
Is my understanding of these things wrong?
------
Compare these (a BDFL exemple, where I'using a non-ascii char)
Py 3.2 (narrow build)
timeit.timeit("a = 'hundred'; 'x' in a")
0.09897159682121348
timeit.timeit("a = 'hundre€'; 'x' in a")
0.09079501961732461
sys.getsizeof('d')
32
sys.getsizeof('€')
32
sys.getsizeof('dd')
34
sys.getsizeof('d€')
34
Py3.3
timeit.timeit("a = 'hundred'; 'x' in a")
0.12183182740848858
timeit.timeit("a = 'hundre€'; 'x' in a")
0.2365732969632326
sys.getsizeof('d')
26
sys.getsizeof('€')
40
sys.getsizeof('dd')
27
sys.getsizeof('d€')
42
Tell me which one seems to be more "unicode compliant"?
Cant tell, you give no relevant information on which one can decide
this question.
The goal of Unicode is to handle every char "equaly".
Not to this kind of detail, which is looking at irrelevant
implementation details.
Now, the problem: memory. Do not forget that à la "FSR"
mechanism for a non-ascii user is *irrelevant*. As
soon as one uses one single non-ascii, your ascii feature
is lost. (That why we have all these dedicated coding
schemes, utfs included).
So? Why should that trouble me? As far as I understand
whether I have an ascii string or not is totally irrelevant
to the application programmer. Within the application I
just process strings and let the programming environment
keep track of these details in a transparant way unless
you start looking at things like getsizeof, which gives
you implementation details that are mostly irrelevant
in deciding whether the behaviour is compliant or not.
sys.getsizeof('abc' * 1000 + 'z')
3026
sys.getsizeof('abc' * 1000 + '\U00010010')
12044
A bit secret. The larger a repertoire of characters
is, the more bits you needs.
Secret #2. You can not escape from this.
And totally unimportant for deciding complyance.
--
Antoon Pardon
--
http://mail.python.org/mailman/listinfo/python-list