Re: Add CASEFOLD() function.

2025-06-19 Thread Thom Brown
On Thu, 19 Jun 2025 at 18:39, David E. Wheeler wrote: > > On Jun 19, 2025, at 12:59, Thom Brown wrote: > > > No. But given the options, I would personally choose nondeterministic > > collations now that they are available. I just wish they were more > > user-friendly as I suspect the majority o

Re: Add CASEFOLD() function.

2025-06-19 Thread David E. Wheeler
On Jun 19, 2025, at 12:59, Thom Brown wrote: > No. But given the options, I would personally choose nondeterministic > collations now that they are available. I just wish they were more > user-friendly as I suspect the majority of people either won't know about > them, or won't know how to use

Re: Add CASEFOLD() function.

2025-06-19 Thread Thom Brown
On Thu, 19 Jun 2025, 17:33 Jeff Davis, wrote: > On Thu, 2025-06-19 at 16:36 +0100, Thom Brown wrote: > > Ease of use, perhaps. It seems easier to use: > > > > column_name cftext > > > > rather than: > > > > CREATE COLLATION case_insensitive_collation ( > > PROVIDER = icu, > > LOCALE = 'un

Re: Add CASEFOLD() function.

2025-06-19 Thread Jeff Davis
On Thu, 2025-06-19 at 18:21 +0200, Vik Fearing wrote: > > > > The SQL standard also says in a few other places that normalization > > should be applied, and we do none of those, so this is probably not > > a > > reason to change CASEFOLD at this point. > > > > Works for me. Sounds good. We can

Re: Add CASEFOLD() function.

2025-06-19 Thread Robert Treat
On Thu, Jun 19, 2025 at 12:33 PM Jeff Davis wrote: > > On Thu, 2025-06-19 at 16:36 +0100, Thom Brown wrote: > > Ease of use, perhaps. It seems easier to use: > > > > column_name cftext > > > > rather than: > > > > CREATE COLLATION case_insensitive_collation ( > > PROVIDER = icu, > > LOCALE

Re: Add CASEFOLD() function.

2025-06-19 Thread Jeff Davis
On Thu, 2025-06-19 at 16:36 +0100, Thom Brown wrote: > Ease of use, perhaps. It seems easier to use: > > column_name cftext > > rather than: > > CREATE COLLATION case_insensitive_collation ( >     PROVIDER = icu, >     LOCALE = 'und-u-ks-level2', >     DETERMINISTIC = FALSE > ); We could auto-c

Re: Add CASEFOLD() function.

2025-06-19 Thread Vik Fearing
On 19/06/2025 16:47, Peter Eisentraut wrote: On 17.06.25 17:37, Vik Fearing wrote: For (which includes LOWER() and UPPER()), the text says in Section 6.35 GR 7.e: If the character set of is UTF8, UTF16, or UTF32, then FR is replaced by Case: i) If the S IS NORMALIZED eval

Re: Add CASEFOLD() function.

2025-06-19 Thread Robert Treat
On Thu, Jun 19, 2025 at 11:37 AM Thom Brown wrote: > On Thu, 19 Jun 2025 at 15:51, Peter Eisentraut wrote: > > On 19.06.25 06:03, Thom Brown wrote: > > > Late to the party, but is there an argument for porting this to the > > > citext type? Or supplementing the extension with an additional type >

Re: Add CASEFOLD() function.

2025-06-19 Thread Thom Brown
On Thu, 19 Jun 2025 at 15:51, Peter Eisentraut wrote: > > On 19.06.25 06:03, Thom Brown wrote: > > Late to the party, but is there an argument for porting this to the > > citext type? Or supplementing the extension with an additional type > > ("cftext"? *shrug*). It currently uses lower(), so our

Re: Add CASEFOLD() function.

2025-06-19 Thread Peter Eisentraut
On 19.06.25 06:03, Thom Brown wrote: Late to the party, but is there an argument for porting this to the citext type? Or supplementing the extension with an additional type ("cftext"? *shrug*). It currently uses lower(), so our current recommendation for dealing with all unicode characters is t

Re: Add CASEFOLD() function.

2025-06-19 Thread Peter Eisentraut
On 17.06.25 17:37, Vik Fearing wrote: For (which includes LOWER() and UPPER()), the text says in Section 6.35 GR 7.e: If the character set of is UTF8, UTF16, or UTF32, then FR is replaced by     Case:     i) If the S IS NORMALIZED evaluates to True, then NORMALIZE (FR)     ii

Re: Add CASEFOLD() function.

2025-06-18 Thread Jeff Davis
On Thu, 2025-06-19 at 05:03 +0100, Thom Brown wrote: > Late to the party, but is there an argument for porting this to the > citext type? Or supplementing the extension with an additional type > ("cftext"? *shrug*). CASEFOLD() addresses a lot of the problems with using LOWER(), so that sounds like

Re: Add CASEFOLD() function.

2025-06-18 Thread Thom Brown
On Thu, 19 Jun 2025, 03:53 Jeff Davis, wrote: > On Wed, 2025-06-18 at 19:09 +0200, Vik Fearing wrote: > > I don't know. I am just pointing out what the Standard says. I > > think > > we should either comply, or say that we don't do it for LOWER and > > UPPER > > so let's keep things implementat

Re: Add CASEFOLD() function.

2025-06-18 Thread Jeff Davis
On Wed, 2025-06-18 at 19:09 +0200, Vik Fearing wrote: > I don't know.  I am just pointing out what the Standard says.  I > think > we should either comply, or say that we don't do it for LOWER and > UPPER > so let's keep things implementation-consistent. For the standard, I see two potential phi

Re: Add CASEFOLD() function.

2025-06-18 Thread Vik Fearing
On 17/06/2025 20:14, Jeff Davis wrote: On Tue, 2025-06-17 at 17:37 +0200, Vik Fearing wrote: If the character set of is UTF8, UTF16, or UTF32, then FR is replaced by Case: i) If the S IS NORMALIZED evaluates to True, then NORMALIZE (FR) ii) Otherwise, FR. I read th

Re: Add CASEFOLD() function.

2025-06-17 Thread Jeff Davis
On Tue, 2025-06-17 at 17:37 +0200, Vik Fearing wrote: > If the character set of is UTF8, UTF16, or UTF32, > then FR is replaced by > Case: > i) If the S IS NORMALIZED evaluates to > True, then NORMALIZE (FR) > ii) Otherwise, FR. I read that as "if the input is normalized,

Re: Add CASEFOLD() function.

2025-06-17 Thread Vik Fearing
On 16/12/2024 18:49, Jeff Davis wrote: One question I have is whether we want this function to normalize the output. Yes, we do. I am sorry that I am so late to the party, but I am currently writing the Change Proposal for the SQL Standard for this function. For (which includes LOWER()

Re: Add CASEFOLD() function.

2025-01-25 Thread Jeff Davis
On Sat, 2025-01-25 at 00:00 -0500, Tom Lane wrote: > Found characters that cannot be output in the PDF document;  see > README.non-ASCII Thank you, fixed. > Not sure about a good workaround for this.  Are there any characters > within LATIN-1 that have interesting case-folding behavior? I just r

Re: Add CASEFOLD() function.

2025-01-24 Thread Tom Lane
Jeff Davis writes: > v6 attached. I plan to commit this soon. The documentation for this function is giving the PDF docs build indigestion: [WARN] FOUserAgent - Glyph "?" (0x3a3, Sigma) not available in font "Courier". [WARN] FOUserAgent - Glyph "?" (0x3c3, sigma) not available in font "Courier"

Re: Add CASEFOLD() function.

2025-01-23 Thread Jeff Davis
On Fri, 2025-01-17 at 16:34 -0800, Jeff Davis wrote: > v5 attached. v6 attached. I plan to commit this soon. A couple things to note: * The ICU API for lower/title/uppercasing is slightly different from folding. The former accept a locale, while the latter just has an option which is relevant on

Re: Add CASEFOLD() function.

2025-01-08 Thread Jeff Davis
On Thu, 2024-12-19 at 09:51 -0800, Jeff Davis wrote: > But there's a problem: full case folding doesn't preserve the normal > form, so even if the input is NFC normalized, the output might not > be. > If we solve this problem, then we can just say that CASEFOLD() > preserves the normal form, consis

Re: Add CASEFOLD() function.

2024-12-19 Thread Jeff Davis
On Thu, 2024-12-19 at 17:18 +0100, Peter Eisentraut wrote: > Can you explain this in further detail?  I don't quite follow why > this > would be required. I am unsure now. My initial reasoning was based on the idea that users would want to use CASEFOLD(t) in a unique expression index as an impro

Re: Add CASEFOLD() function.

2024-12-19 Thread Peter Eisentraut
On 16.12.24 18:49, Jeff Davis wrote: One question I have is whether we want this function to normalize the output. I believe most usecases would want the output normalized, because normalization differences (e.g. "a" U+0061 followed by "combining acute" U+0301 vs "a with acute" U+00E1) are more

Re: Add CASEFOLD() function.

2024-12-18 Thread Jeff Davis
On Mon, 2024-12-16 at 16:27 -0500, Joe Conway wrote: > > SQL 2023 seems to include the NORMALIZE syntax, but the only case > folding considered is UPPER and LOWER. As such, I think it ought to > be a > function but not part of the grammar. Should the standard support something like the Unicode

Re: Add CASEFOLD() function.

2024-12-17 Thread Andreas Karlsson
On 12/12/24 10:00 AM, Jeff Davis wrote: Patch attached. I have not looked at the patch yet but +1 to the idea. I am leaning towards that the function also optionally normalizing the codepoints would be handy too since I think that is what most usecases want. Otherwise people would have to al

Re: Add CASEFOLD() function.

2024-12-16 Thread Joe Conway
On 12/16/24 12:49, Jeff Davis wrote: One question I have is whether we want this function to normalize the output. I believe most usecases would want the output normalized, because normalization differences (e.g. "a" U+0061 followed by "combining acute" U+0301 vs "a with acute" U+00E1) are more

Re: Add CASEFOLD() function.

2024-12-12 Thread Joe Conway
On 12/12/24 13:30, Jeff Davis wrote: On Thu, 2024-12-12 at 21:52 +0900, Ian Lawrence Barwick wrote: and it seems to work as advertised, except the function is named "FOLDCASE()" in the patch, so I'm wondering which is intended? Thank you for looking into this, I went back and forth on the name

Re: Add CASEFOLD() function.

2024-12-12 Thread Jeff Davis
On Thu, 2024-12-12 at 21:52 +0900, Ian Lawrence Barwick wrote: > and it seems to work as advertised, except the function is named > "FOLDCASE()" > in the patch, so I'm wondering which is intended? Thank you for looking into this, I went back and forth on the name, and mistyped it a few times. ICU

Re: Add CASEFOLD() function.

2024-12-12 Thread Ian Lawrence Barwick
Hi 2024年12月12日(木) 18:00 Jeff Davis : > > Unicode case folding is a way to convert a string to a canonical case > for the purpose of case-insensitive matching. > > Users have long used LOWER() for that purpose, but there are a few edge > case problems: > > * Some characters have more than two cased