Re: encoding affects ICU regex character classification

2023-12-18 Thread Jeff Davis
On Fri, 2023-12-15 at 16:48 -0800, Jeremy Schneider wrote: > This goes back to my other thread (which sadly got very little > discussion): PosgreSQL really needs to be safe by /default/ Doesn't a built-in provider help create a safer option? The built-in provider's version of Unicode will be cons

Re: encoding affects ICU regex character classification

2023-12-15 Thread Thomas Munro
On Sat, Dec 16, 2023 at 1:48 PM Jeremy Schneider wrote: > On 12/14/23 7:12 AM, Jeff Davis wrote: > > The concern over unassigned code points is misplaced. The application > > may be aware of newly-assigned code points, and there's no way they > > will be mapped correctly in Postgres if the provide

Re: encoding affects ICU regex character classification

2023-12-15 Thread Jeremy Schneider
On 12/14/23 7:12 AM, Jeff Davis wrote: > The concern over unassigned code points is misplaced. The application > may be aware of newly-assigned code points, and there's no way they > will be mapped correctly in Postgres if the provider is not aware of > those code points. The user can either procee

Re: encoding affects ICU regex character classification

2023-12-14 Thread Jeff Davis
On Tue, 2023-12-12 at 14:35 -0800, Jeremy Schneider wrote: > Is someone able to test out upper & lower functions on U+A7BA ... > U+A7BF > across a few libs/versions? Those code points are unassigned in Unicode 11.0 and assigned in Unicode 12.0. In ICU 63-2 (based on Unicode 11.0), they just get m

Re: encoding affects ICU regex character classification

2023-12-12 Thread Jeremy Schneider
On 12/12/23 1:39 PM, Jeff Davis wrote: > On Sun, 2023-12-10 at 10:39 +1300, Thomas Munro wrote: >> Unless you also >> implement built-in case mapping, you'd still have to call libc or ICU >> for that, right? > > We can do built-in case mapping, see: > > https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24

Re: encoding affects ICU regex character classification

2023-12-12 Thread Jeff Davis
On Sun, 2023-12-10 at 10:39 +1300, Thomas Munro wrote: > > How would you specify what you want? One proposal would be to have a builtin collation provider: https://postgr.es/m/9d63548c4d86b0f820e1ff15a83f93ed9ded4543.ca...@j-davis.com I don't think there are very many ctype options, but they c

Re: encoding affects ICU regex character classification

2023-12-09 Thread Thomas Munro
On Sat, Dec 2, 2023 at 9:49 AM Jeff Davis wrote: > Your definition is too wide in my opinion, because it mixes together > different sources of variation that are best left separate: > a. region/language > b. technical requirements > c. versioning > d. implementation variance > > (a) is not a t

Re: encoding affects ICU regex character classification

2023-11-29 Thread Thomas Munro
On Thu, Nov 30, 2023 at 1:23 PM Jeff Davis wrote: > Character classification is not localized at all in libc or ICU as far > as I can tell. Really? POSIX isalpha()/isalpha_l() and friends clearly depend on a locale. See eg d522b05c for a case where that broke something. Perhaps you mean glibc w

Re: encoding affects ICU regex character classification

2023-11-29 Thread Tom Lane
Jeff Davis writes: > The problem seems to be confusion between pg_wchar and a unicode code > point in pg_wc_isalpha() and related functions. Yeah, that's an ancient sore spot: we don't really know what the representation of wchar is. We assume it's Unicode code points for UTF8 locales, but libc

encoding affects ICU regex character classification

2023-11-29 Thread Jeff Davis
The following query: SELECT U&'\017D' ~ '[[:alpha:]]' collate "en-US-x-icu"; returns true if the server encoding is UTF8, and false if the server encoding is LATIN9. That's a bug -- any behavior involving ICU should be encoding-independent. The problem seems to be confusion between pg_wchar