Crystal Kolipe wrote in <Y/pgctbyhmrx+...@exoticsilicon.com>: |Currently it is not possible to use unicode codepoints > 0xFF on the \ |console, |because our UTF-8 decoding logic is badly broken. | |The code in question is in wsemul_subr.c, wsemul_getchar(). | |The problem is that we calculate the number of bytes in a multi-byte |sequence by just looking at the high bits in turn: ... |This is wrong, for several reasons.
Just to note there are also holes, UTF-8 sequences are not necessarily well-formed (per se -- maybe they are when you control their generation, of course). It is actually a real mess: if(LIKELY(x <= 0x7Fu)) c = x; /* 0xF8, but Unicode guarantees maximum of 0x10FFFFu -> F4 8F BF BF. * Unicode 9.0, 3.9, UTF-8, Table 3-7. Well-Formed UTF-8 Byte Sequences */ else if(LIKELY(x > 0xC0u && x <= 0xF4u)){ if(LIKELY(x < 0xE0u)){ if(UNLIKELY(l < 1)) goto jenobuf; --l; c = (x &= 0x1Fu); }else if(LIKELY(x < 0xF0u)){ if(UNLIKELY(l < 2)) goto jenobuf; l -= 2; x1 = x; c = (x &= 0x0Fu); /* Second byte constraints */ x = S(u8,*cp++); switch(x1){ case 0xE0u: if(UNLIKELY(x < 0xA0u || x > 0xBFu)) goto jerr; break; case 0xEDu: if(UNLIKELY(x < 0x80u || x > 0x9Fu)) goto jerr; break; default: if(UNLIKELY((x & 0xC0u) != 0x80u)) goto jerr; break; } c <<= 6; c |= (x &= 0x3Fu); }else{ if(UNLIKELY(l < 3)) goto jenobuf; l -= 3; x1 = x; c = (x &= 0x07u); /* Third byte constraints */ x = S(u8,*cp++); switch(x1){ case 0xF0u: if(UNLIKELY(x < 0x90u || x > 0xBFu)) goto jerr; break; case 0xF4u: if(UNLIKELY((x & 0xF0u) != 0x80u)) /* 80..8F */ goto jerr; break; default: if(UNLIKELY((x & 0xC0u) != 0x80u)) goto jerr; break; } c <<= 6; c |= (x &= 0x3Fu); x = S(u8,*cp++); if(UNLIKELY((x & 0xC0u) != 0x80u)) goto jerr; c <<= 6; c |= (x &= 0x3Fu); } x = S(u8,*cp++); if(UNLIKELY((x & 0xC0u) != 0x80u)) goto jerr; c <<= 6; c |= x & 0x3Fu; }else goto jerr; --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)