Just to applogize to developers here,

I'm still not skilled enough to make a proper patch or a clear bug
report (I'm on chapter 2 of K&R :-)).  I wish with time I'll learn how
to do it.  I came to the ksh utf8 discussion because I've been playing
with some mail mime encoder just to learn C and recognizing valid utf-8
was the first challenge I ecountered.

The code pasted below is what I got so far in recognizing valid utf-8.
I'm showing it to make my point, I realize it isn't easy; and from my
poor C I'm not able to figure out how you can do such test byte by byte
while the user is typing at command line.  (Don't bother in explaining
me how, I know this is not the place to take C lessons.)

By the way, something the last paragraph of the new utf8(7) man page
isn't clear enough (I mentioned this to tedu@).

Thanks to all of you for your work.  Now I know how hard it is.


#include <stdio.h>

#define ASCII   0x7f
#define YES     1
#define NO      0

int
main()
{
        int c, ch, wd, ln, col, isutf8;

        ch = wd = ln = col = 1;
        isutf8 = YES;

        while ((c = getchar()) != EOF) {
                if (c > ASCII) {
                        if ((ch == 1 && (c < 0xc2 || c > 0xf7)) ||
                                ((ch > 1 && c <= 4) &&
                                ch <= wd && (c < 0x80 || c > 0xbf)))
                                isutf8 = NO;
                /* 110..... */
                        else if (ch == 1 && c >= 0xc2 && c <= 0xdf) {
                                wd = 2;
                                ++ch;
                /* 1110.... */
                        } else if (ch == 1 && c >= 0xe0 && c <= 0xef) {
                                wd = 3;
                                ++ch;
                /* 11110... */
                        } else if (ch == 1 && c >= 0xf0 && c <= 0xf7) {
                                wd = 4;
                                ++ch;
                        } else if (ch > 1 && c <= 4 &&
                                ch == wd && c >= 0x80 && c <= 0xbf)
                                ch = 1;
                        else if (ch > 1 && c <= 4 &&
                                ch < wd && c >= 0x80 && c <= 0xbf)
                                ++ch;
                        else
                                ++ch;
                } else if (ch > 1 && ch <= 4 && ch <= wd)
                        isutf8= NO;
                else
                        ch = 1;

                if (isutf8 == NO) {
                        printf("Invalid utf-8 character");
                        printf(" at line %d col %d.\n", ln, col);
                        return 1;
                }
                if (c == '\n') {
                        col = 1;
                        ++ln;
                } else
                        ++col;
        }

        return 0;
}

Reply via email to