Just to applogize to developers here, I'm still not skilled enough to make a proper patch or a clear bug report (I'm on chapter 2 of K&R :-)). I wish with time I'll learn how to do it. I came to the ksh utf8 discussion because I've been playing with some mail mime encoder just to learn C and recognizing valid utf-8 was the first challenge I ecountered.
The code pasted below is what I got so far in recognizing valid utf-8. I'm showing it to make my point, I realize it isn't easy; and from my poor C I'm not able to figure out how you can do such test byte by byte while the user is typing at command line. (Don't bother in explaining me how, I know this is not the place to take C lessons.) By the way, something the last paragraph of the new utf8(7) man page isn't clear enough (I mentioned this to tedu@). Thanks to all of you for your work. Now I know how hard it is. #include <stdio.h> #define ASCII 0x7f #define YES 1 #define NO 0 int main() { int c, ch, wd, ln, col, isutf8; ch = wd = ln = col = 1; isutf8 = YES; while ((c = getchar()) != EOF) { if (c > ASCII) { if ((ch == 1 && (c < 0xc2 || c > 0xf7)) || ((ch > 1 && c <= 4) && ch <= wd && (c < 0x80 || c > 0xbf))) isutf8 = NO; /* 110..... */ else if (ch == 1 && c >= 0xc2 && c <= 0xdf) { wd = 2; ++ch; /* 1110.... */ } else if (ch == 1 && c >= 0xe0 && c <= 0xef) { wd = 3; ++ch; /* 11110... */ } else if (ch == 1 && c >= 0xf0 && c <= 0xf7) { wd = 4; ++ch; } else if (ch > 1 && c <= 4 && ch == wd && c >= 0x80 && c <= 0xbf) ch = 1; else if (ch > 1 && c <= 4 && ch < wd && c >= 0x80 && c <= 0xbf) ++ch; else ++ch; } else if (ch > 1 && ch <= 4 && ch <= wd) isutf8= NO; else ch = 1; if (isutf8 == NO) { printf("Invalid utf-8 character"); printf(" at line %d col %d.\n", ln, col); return 1; } if (c == '\n') { col = 1; ++ln; } else ++col; } return 0; }