Hi,

So in the regex we have to determine whether we are unencoding a
single-byte or multi-byte character.

read in a single byte and pass it to chr().  I do not have enough
experience with multi-byte characters to know when a byte can be
recognized as the first byte of a multi-byte character, and thus grab
the next byte before passing to chr().

From RFC-2279 [1], with my comments:

0000 0000-0000 007F   0xxxxxxx
        0-127         [0-7][0-9A-F]

0000 0080-0000 07FF   110xxxxx 10xxxxxx
      128-2047        [C-D][0-9A-F] [8-B][0-9A-F]*1

0000 0800-0000 FFFF   1110xxxx 10xxxxxx 10xxxxxx
     2048-65535       [E][0-9A-F] [8-B][0-9A-F]*2

0001 0000-001F FFFF   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
    65536-2097151     [F][0-7] [8-B][0-9A-F]*3

0020 0000-03FF FFFF   111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
  2097152-67108863    [F][8-B] [8-B][0-9A-F]*4

0400 0000-7FFF FFFF   1111110x 10xxxxxx 10xxxxxx [..] 10xxxxxx
 67108864-2147483647  [F][C-D] [8-B][0-9A-F]*5

That is we should write an algorithm for. Character boundaries can be detected easily: an UTF-8 character is always starts with a byte between 0xC0-0xFD, and follows with one to five bytes between 0x80-0BF.

Bye,
  Andras

[1] http://www.faqs.org/rfcs/rfc2279.html

Reply via email to