On 14/6/2013 9:00 πμ, Zero Piraeus wrote:
:
On 14 June 2013 01:34, Nick the Gr33k <supp...@superhost.gr> wrote:
Why doesn't it work like this?
leading 0 = 1 byte flag
leading 1 = 2 bytes flag
leading 00 = 3 bytes flag
leading 01 = 4 bytes flag
leading 10 = 5 bytes flag
leading 11 = 6 bytes flag
Wouldn't it be more logical?
Think about it. Let's say that, as per your scheme, a leading 0
indicates "1 byte" (as is indeed the case in UTF8). What things could
follow that leading 0? How does that impact your choice of a leading
00 or 01 for other numbers of bytes?
... okay, you're obviously going to need to be spoon-fed a little more
than that. Here's a byte:
01010101
Is that a single byte representing a code point in the 0-127 range, or
the first of 4 bytes representing something else, in your proposed
scheme? How can you tell?
Indeed.
You cannot tell if it stands for 1 byte or a 4 byte sequence:
0 + 1010101 = leading 0 stands for 1byte representation of a code-point
01 + 010101 = leading 01 stands for 4byte representation of a code-point
the problem here in my scheme of how utf8 encoding works is that you
cannot tell whether the flag is '0' or '01'
Same happen with leading '1' and '11'. You cannot tell what the flag is,
so you cannot know if the Unicode code-point is being represented as
2-byte sequence or 6 bye sequence
Understood
Now look at the way UTF8 does it:
<http://en.wikipedia.org/wiki/Utf-8#Description>
Really, follow the link and study the table carefully. Don't continue
reading this until you believe you understand the choices that the
designers of UTF8 made, and why they made them.
Pay particular attention to the possible values for byte 1. Do you
notice the difference between that scheme, and yours:
0xxxxxxx
1xxxxxxx
00xxxxxx
01xxxxxx
10xxxxxx
11xxxxxx
If you don't see it, keep looking until you do ... this email gives
you more than enough hints to work it out. Don't ask someone here to
explain it to you. If you want to become competent, you must use your
brain.
0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
I did read the link but i still cannot see why
1. '110' is the flag for 2-byte code-point
2. why the in the 2nd byte and every subsequent byte leading flag has to
be '10'
--
What is now proved was at first only imagined!
--
http://mail.python.org/mailman/listinfo/python-list