On Mon, Jul 24, 2017 at 1:50 PM, Philippe Verdy <verd...@wanadoo.fr> wrote:
> 2017-07-24 21:12 GMT+02:00 J Decker via Unicode <unicode@unicode.org>: > >> >> >> If you don't have that last position in a variable, just use 3 tests but > NO loop at all: if all 3 tests are failing, you know the input was not > valid at all, and the way to handle this error will not be solved simply by > using a very unsecure unbound loop like above but by exiting and returning > an error immediately, or throwing an exception. > > The code should better be: > > if (from[0]&0xC0 == 0x80) from--; > else if (from[-1]&0xC0 == 0x80) from -=2; > else if (from[-2]&0xC0 == 0x80) from -=3; > if (from[0]&0xC0 == 0x80) throw (some exception); > // continue here with character encoded as UTF-8 starting at "from" > (an ASCII byte or an UTF-8 leading byte) > > I generally accepted any utf-8 encoding up to 31 bits though ( since I was going from the original spec, and not what was effective limit based on unicode codepoint space) and the while loop is more terse; but is less optimal because of code pipeline flushing from backward jump; so yes if series is much better :) (the original code also has the start of the string, and strings are effecitvly prefixed with a 0 byte anyway because of a long little endian size) and you'd probably be tracking an output offset also, so it becomes a little longer than the above. And it should be secured using a guard byte at start of your buffer in > which the "from" pointer was pointing, so that it will never read something > else and can generate an error. > >