: On 14 June 2013 01:34, Nick the Gr33k <supp...@superhost.gr> wrote: > Why doesn't it work like this? > > leading 0 = 1 byte flag > leading 1 = 2 bytes flag > leading 00 = 3 bytes flag > leading 01 = 4 bytes flag > leading 10 = 5 bytes flag > leading 11 = 6 bytes flag > > Wouldn't it be more logical?
Think about it. Let's say that, as per your scheme, a leading 0 indicates "1 byte" (as is indeed the case in UTF8). What things could follow that leading 0? How does that impact your choice of a leading 00 or 01 for other numbers of bytes? ... okay, you're obviously going to need to be spoon-fed a little more than that. Here's a byte: 01010101 Is that a single byte representing a code point in the 0-127 range, or the first of 4 bytes representing something else, in your proposed scheme? How can you tell? Now look at the way UTF8 does it: <http://en.wikipedia.org/wiki/Utf-8#Description> Really, follow the link and study the table carefully. Don't continue reading this until you believe you understand the choices that the designers of UTF8 made, and why they made them. Pay particular attention to the possible values for byte 1. Do you notice the difference between that scheme, and yours: 0xxxxxxx 1xxxxxxx 00xxxxxx 01xxxxxx 10xxxxxx 11xxxxxx If you don't see it, keep looking until you do ... this email gives you more than enough hints to work it out. Don't ask someone here to explain it to you. If you want to become competent, you must use your brain. -[]z. -- http://mail.python.org/mailman/listinfo/python-list