Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

Philippe Verdy via Unicode Mon, 24 Jul 2017 13:52:48 -0700

2017-07-24 21:12 GMT+02:00 J Decker via Unicode <[email protected]>:

>
>
> On Mon, Jul 24, 2017 at 10:57 AM, Costello, Roger L. via Unicode <
> [email protected]> wrote:
>
>> Hi Folks,
>>
>> 2. (Bug) The sending application performs the folding process - inserts
>> CRLF plus white space characters - and the receiving application does the
>> unfolding process but doesn't properly delete all of them.
>>
>> The RFC doesn't say 'characters' but either a space or a tab character
> (singular)
>
>  back scanning is simple enough
>
> while( ( from[0] & 0xC0 ) == 0x80 )
> from--;
>

Certainly not like this! Backscanning should only directly use a single
assignement to the last known start position, no loop at all ! UTF-8
security is based on the fact that its sequences are strictly limited in
length so that you will never have more than 3 trailing bytes.

If you don't have that last position in a variable, just use 3 tests but NO
loop at all: if all 3 tests are failing, you know the input was not valid
at all, and the way to handle this error will not be solved simply by using
a very unsecure unbound loop like above but by exiting and returning an
error immediately, or throwing an exception.

The code should better be:

    if (from[0]&0xC0 == 0x80) from--;
    else if (from[-1]&0xC0 == 0x80) from -=2;
    else if (from[-2]&0xC0 == 0x80) from -=3;
    if (from[0]&0xC0 == 0x80) throw (some exception);
    // continue here with character encoded as UTF-8 starting at "from" (an
ASCII byte or an UTF-8 leading byte)

And it should be secured using a guard byte at start of your buffer in
which the "from" pointer was pointing, so that it will never read something
else and can generate an error.

Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

Reply via email to