Re: Fix broken UTF-8 decoding

Steffen Nurpmeso Sat, 25 Feb 2023 11:30:26 -0800

Crystal Kolipe wrote in
 <Y/pgctbyhmrx+...@exoticsilicon.com>:
 |Currently it is not possible to use unicode codepoints > 0xFF on the \
 |console,
 |because our UTF-8 decoding logic is badly broken.
 |
 |The code in question is in wsemul_subr.c, wsemul_getchar().
 |
 |The problem is that we calculate the number of bytes in a multi-byte
 |sequence by just looking at the high bits in turn:
 ...
 |This is wrong, for several reasons.


Just to note there are also holes, UTF-8 sequences are not
necessarily well-formed (per se -- maybe they are when you control
their generation, of course).  It is actually a real mess:

        if(LIKELY(x <= 0x7Fu))
                c = x;
        /* 0xF8, but Unicode guarantees maximum of 0x10FFFFu -> F4 8F BF BF.
         * Unicode 9.0, 3.9, UTF-8, Table 3-7. Well-Formed UTF-8 Byte Sequences 
*/
        else if(LIKELY(x > 0xC0u && x <= 0xF4u)){
                if(LIKELY(x < 0xE0u)){
                        if(UNLIKELY(l < 1))
                                goto jenobuf;
                        --l;

                        c = (x &= 0x1Fu);
                }else if(LIKELY(x < 0xF0u)){
                        if(UNLIKELY(l < 2))
                                goto jenobuf;
                        l -= 2;

                        x1 = x;
                        c = (x &= 0x0Fu);

                        /* Second byte constraints */
                        x = S(u8,*cp++);
                        switch(x1){
                        case 0xE0u:
                                if(UNLIKELY(x < 0xA0u || x > 0xBFu))
                                        goto jerr;
                                break;
                        case 0xEDu:
                                if(UNLIKELY(x < 0x80u || x > 0x9Fu))
                                        goto jerr;
                                break;
                        default:
                                if(UNLIKELY((x & 0xC0u) != 0x80u))
                                        goto jerr;
                                break;
                        }
                        c <<= 6;
                        c |= (x &= 0x3Fu);
                }else{
                        if(UNLIKELY(l < 3))
                                goto jenobuf;
                        l -= 3;

                        x1 = x;
                        c = (x &= 0x07u);

                        /* Third byte constraints */
                        x = S(u8,*cp++);
                        switch(x1){
                        case 0xF0u:
                                if(UNLIKELY(x < 0x90u || x > 0xBFu))
                                        goto jerr;
                                break;
                        case 0xF4u:
                                if(UNLIKELY((x & 0xF0u) != 0x80u)) /* 80..8F */
                                        goto jerr;
                                break;
                        default:
                                if(UNLIKELY((x & 0xC0u) != 0x80u))
                                        goto jerr;
                                break;
                        }
                        c <<= 6;
                        c |= (x &= 0x3Fu);

                        x = S(u8,*cp++);
                        if(UNLIKELY((x & 0xC0u) != 0x80u))
                                goto jerr;
                        c <<= 6;
                        c |= (x &= 0x3Fu);
                }

                x = S(u8,*cp++);
                if(UNLIKELY((x & 0xC0u) != 0x80u))
                        goto jerr;
                c <<= 6;
                c |= x & 0x3Fu;
        }else
                goto jerr;

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

Re: Fix broken UTF-8 decoding

Reply via email to