Hi,
So in the regex we have to determine whether we are unencoding a single-byte or multi-byte character.
read in a single byte and pass it to chr(). I do not have enough experience with multi-byte characters to know when a byte can be recognized as the first byte of a multi-byte character, and thus grab the next byte before passing to chr().
From RFC-2279 [1], with my comments:
0000 0000-0000 007F 0xxxxxxx 0-127 [0-7][0-9A-F]
0000 0080-0000 07FF 110xxxxx 10xxxxxx 128-2047 [C-D][0-9A-F] [8-B][0-9A-F]*1
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx 2048-65535 [E][0-9A-F] [8-B][0-9A-F]*2
0001 0000-001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 65536-2097151 [F][0-7] [8-B][0-9A-F]*3
0020 0000-03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 2097152-67108863 [F][8-B] [8-B][0-9A-F]*4
0400 0000-7FFF FFFF 1111110x 10xxxxxx 10xxxxxx [..] 10xxxxxx 67108864-2147483647 [F][C-D] [8-B][0-9A-F]*5
That is we should write an algorithm for. Character boundaries can be detected easily: an UTF-8 character is always starts with a byte between 0xC0-0xFD, and follows with one to five bytes between 0x80-0BF.
Bye, Andras
[1] http://www.faqs.org/rfcs/rfc2279.html