Hi! The patch makes my tests pass.
I wonder about a few things: - Isn’t there any code that could be re-used for that (the one triggered by ‘a’ < ‘A’ COLLATE ucs_basic)? - For object key members, the standard also refers to unicode code point collation (SQL-2:2016 4.46.3, last paragraph). - I guess it also applies to the “starts with” predicate, but I cannot find this explicitly stated in the standard. My tests check whether those cases do case-sensitive comparisons. With my default collation "en_US.UTF-8” I cannot discover potential issues there. I haven’t played around with nondeterministic ICU collations yet :( -markus ps.: for me, testing the regular expression dialect of like_regex is out of scope > On 8 Aug 2019, at 02:27, Alexander Korotkov <a.korot...@postgrespro.ru> wrote: > > On Thu, Aug 8, 2019 at 3:05 AM Alexander Korotkov > <a.korot...@postgrespro.ru <mailto:a.korot...@postgrespro.ru>> wrote: >> On Thu, Aug 8, 2019 at 12:55 AM Alexander Korotkov >> <a.korot...@postgrespro.ru> wrote: >>> On Wed, Aug 7, 2019 at 4:11 PM Alexander Korotkov >>> <a.korot...@postgrespro.ru> wrote: >>>> On Wed, Aug 7, 2019 at 2:25 PM Markus Winand <markus.win...@winand.at> >>>> wrote: >>>>> I was playing around with JSON path quite a bit and might have found one >>>>> case where the current implementation doesn’t follow the standard. >>>>> >>>>> The functionality in question are the comparison operators except ==. >>>>> They use the database default collation rather then the standard-mandated >>>>> "Unicode codepoint collation” (SQL-2:2016 9.39 General Rule 12 c iii 2 D, >>>>> last sentence in first paragraph). >>>> >>>> Thank you for pointing! Nikita is about to write a patch fixing that. >>> >>> Please, see the attached patch. >>> >>> Our idea is to not sacrifice "==" operator performance for standard >>> conformance. So, "==" remains per-byte comparison. For consistency >>> in other operators we compare code points first, then do per-byte >>> comparison. In some edge cases, when same Unicode codepoints have >>> different binary representations in database encoding, this behavior >>> diverges standard. In future we can implement strict standard >>> conformance by normalization of input JSON strings. >> >> Previous version of patch has buggy implementation of >> compareStrings(). Revised version is attached. > > Nikita pointed me that for UTF-8 strings per-byte comparison result > matches codepoints comparison result. That allows simplify patch a > lot. > > ------ > Alexander Korotkov > Postgres Professional: http://www.postgrespro.com > <http://www.postgrespro.com/> > The Russian Postgres Company > <0001-Use-Unicode-codepoint-collation-in-jsonpath-4.patch>