2017-07-24 22:50 GMT+02:00 Philippe Verdy <verd...@wanadoo.fr>: > 2017-07-24 21:12 GMT+02:00 J Decker via Unicode <unicode@unicode.org>: > >> >> >> On Mon, Jul 24, 2017 at 10:57 AM, Costello, Roger L. via Unicode < >> unicode@unicode.org> wrote: >> >>> Hi Folks, >>> >>> 2. (Bug) The sending application performs the folding process - inserts >>> CRLF plus white space characters - and the receiving application does the >>> unfolding process but doesn't properly delete all of them. >>> >>> The RFC doesn't say 'characters' but either a space or a tab character >> (singular) >> >> back scanning is simple enough >> >> while( ( from[0] & 0xC0 ) == 0x80 ) >> from--; >> > > Certainly not like this! Backscanning should only directly use a single > assignement to the last known start position, no loop at all ! UTF-8 > security is based on the fact that its sequences are strictly limited in > length so that you will never have more than 3 trailing bytes. > > If you don't have that last position in a variable, just use 3 tests but > NO loop at all: if all 3 tests are failing, you know the input was not > valid at all, and the way to handle this error will not be solved simply by > using a very unsecure unbound loop like above but by exiting and returning > an error immediately, or throwing an exception. > > The code should better be: > > if (from[0]&0xC0 == 0x80) from--; > else if (from[-1]&0xC0 == 0x80) from -=2; > else if (from[-2]&0xC0 == 0x80) from -=3; > if (from[0]&0xC0 == 0x80) throw (some exception); > // continue here with character encoded as UTF-8 starting at "from" > (an ASCII byte or an UTF-8 leading byte) > Sorry, sent too fast, I should not have copy-pasted lines trying to adapt your loop; the correct code uses no "else" at all:
> if (from[0]&0xC0 == 0x80) from--; > if (from[0]&0xC0 == 0x80) from--; > if (from[0]&0xC0 == 0x80) from--; > if (from[0]&0xC0 == 0x80) throw (some exception); > // continue here with character encoded as UTF-8 starting at "from" > (an ASCII byte or an UTF-8 leading byte) > >