Re: Small patch to improve safety of utf8_to_unicode().

Chao Li Wed, 24 Jun 2026 21:11:19 -0700

> On Jun 25, 2026, at 05:57, Jeff Davis <[email protected]> wrote:
> 
> On Wed, 2026-06-24 at 16:44 +0800, Chao Li wrote:
>> There is a compile warning against pg_wchar.h in 0004:
> 
> Fixed. I also used a loop in utf8decode() which is slightly smaller,
> which is good if we intend it to be inlined by a lot of callers.
> 
> Regards,
> Jeff Davis
> 
> <v5-0001-unicode_case.c-defend-against-invalid-UTF8.patch><v5-0002-pg_unicode_fast-fix-final-sigma-logic.patch><v5-0003-unicode_case.c-change-API-to-signal-UTF8-decoding.patch><v5-0004-Validating-iterator-friendly-UTF8-encoder-decoder.patch><v5-0005-unicode_case.c-use-new-utf8encode-utf8decode-APIs.patch>

I just reviewed v5-0001 and got one concern.

In initcap_wbnext(), the new check only verifies that the input has enough 
bytes:
```
if (wbstate->offset + ulen > wbstate->len)
```

What about an invalid continuation byte, for example "\xCE "? In this case, 
pg_utf_mblen() sees \xCE, so ulen will be 2. Since there is still one more 
byte, the length check won't catch the invalid continuation byte \x20, and the 
code will proceed to utf8_to_unicode().

Looking at utf8_to_unicode():
```
static inline char32_t
utf8_to_unicode(const unsigned char *c)
{
        if ((*c & 0x80) == 0)
                return (char32_t) c[0];
        else if ((*c & 0xe0) == 0xc0)
                return (char32_t) (((c[0] & 0x1f) << 6) |
                                                   (c[1] & 0x3f));
        else if ((*c & 0xf0) == 0xe0)
                return (char32_t) (((c[0] & 0x0f) << 12) |
                                                   ((c[1] & 0x3f) << 6) |
                                                   (c[2] & 0x3f));
        else if ((*c & 0xf8) == 0xf0)
                return (char32_t) (((c[0] & 0x07) << 18) |
                                                   ((c[1] & 0x3f) << 12) |
                                                   ((c[2] & 0x3f) << 6) |
                                                   (c[3] & 0x3f));
        else
                return PG_INVALID_CODEPOINT;
}
```

For "\xCE ", it will take this branch:
```
        else if ((*c & 0xe0) == 0xc0)
                return (char32_t) (((c[0] & 0x1f) << 6) |
                                                   (c[1] & 0x3f));
```

This uses the second byte, \x20, without validating. So it looks like the patch 
prevents reading past the end of the string, but it may not fully defend 
against invalid UTF-8 sequences.

Am I missing anything?

(I will continue to review 0002 tomorrow.)

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
Re: Small patch to improve safety of utf8_to_unicode().

Reply via email to