Re: GUCs to control abbreviated sort keys

Peter Geoghegan Fri, 27 Jan 2023 11:42:14 -0800

On Thu, Jan 26, 2023 at 3:29 PM Jeff Davis <pg...@j-davis.com> wrote:
> On Thu, 2023-01-26 at 22:39 +0100, Peter Eisentraut wrote:
> > Maybe an easier way to enable or disable it in the source code with a
> > #define would serve this.  Making it a GUC right away seems a bit
> > heavy-handed.  Further exploration and tweaking might well require
> > further source code changes, so relying on a source code level toggle
> > would seem appropriate.
>
> I am using these GUCs for testing the various collation paths in my
> collation refactoring branch.


I'm fine with adding the GUC as a developer option. I think that there
is zero chance of the changes to tuplesort.c having appreciable
overhead.

> I find them pretty useful, and when I saw a regression, I thought
> others might think it was useful, too. But if not I'll just leave them
> in my branch and withdraw from this thread.

I cannot recreate the issue you describe. With abbreviated keys, your
exact test case takes 00:16.620 on my system. Without abbreviated
keys, it takes 00:21.255.

To me it appears to be a moderately good case for abbreviated keys,
though certainly not as good as some cases that I've seen -- ~3x
improvements are common enough.

As a point of reference, the same test case with the C collation and
with abbreviated keys takes 00:10.822. When I look at the "trace_sort"
output for the C collation with abbreviated keys, and compare it to
the equivalent "trace_sort" output for the original "en-US-x-icu"
collation from your test case, it is clear that the overhead of
generating collated abbreviated keys within ICU is relatively high --
the initial scan of the table (which is where we generate all
abbreviated keys here) takes 4.45 seconds in the ICU case, and only
1.65 seconds in the "C" locale case. I think that you should look into
that same difference on your own system, so that we can compare and
contrast.

The underlying issue might well have something to do with the ICU
version you're using, or some other detail of your environment. I'm
using Debian unstable here. Postgres links to the system ICU, which is
ICU 72.

It's not impossible that the perl program you wrote produces
non-deterministic output, which should be controlled for, since it
might just be significant. I see this on my system, having run the
perl program as outlined in your test case:

$ ls -l /tmp/strings.txt
-rw-r--r-- 1 pg pg 431886574 Jan 27 11:13 /tmp/strings.txt
$ sha1sum /tmp/strings.txt
22f60dc12527c215c8e3992e49d31dc531261a83  /tmp/strings.txt

Does that match what you see on your system?

--
Peter Geoghegan

Re: GUCs to control abbreviated sort keys

Reply via email to