On Thu, Jan 26, 2023 at 3:29 PM Jeff Davis <pg...@j-davis.com> wrote: > On Thu, 2023-01-26 at 22:39 +0100, Peter Eisentraut wrote: > > Maybe an easier way to enable or disable it in the source code with a > > #define would serve this. Making it a GUC right away seems a bit > > heavy-handed. Further exploration and tweaking might well require > > further source code changes, so relying on a source code level toggle > > would seem appropriate. > > I am using these GUCs for testing the various collation paths in my > collation refactoring branch.
I'm fine with adding the GUC as a developer option. I think that there is zero chance of the changes to tuplesort.c having appreciable overhead. > I find them pretty useful, and when I saw a regression, I thought > others might think it was useful, too. But if not I'll just leave them > in my branch and withdraw from this thread. I cannot recreate the issue you describe. With abbreviated keys, your exact test case takes 00:16.620 on my system. Without abbreviated keys, it takes 00:21.255. To me it appears to be a moderately good case for abbreviated keys, though certainly not as good as some cases that I've seen -- ~3x improvements are common enough. As a point of reference, the same test case with the C collation and with abbreviated keys takes 00:10.822. When I look at the "trace_sort" output for the C collation with abbreviated keys, and compare it to the equivalent "trace_sort" output for the original "en-US-x-icu" collation from your test case, it is clear that the overhead of generating collated abbreviated keys within ICU is relatively high -- the initial scan of the table (which is where we generate all abbreviated keys here) takes 4.45 seconds in the ICU case, and only 1.65 seconds in the "C" locale case. I think that you should look into that same difference on your own system, so that we can compare and contrast. The underlying issue might well have something to do with the ICU version you're using, or some other detail of your environment. I'm using Debian unstable here. Postgres links to the system ICU, which is ICU 72. It's not impossible that the perl program you wrote produces non-deterministic output, which should be controlled for, since it might just be significant. I see this on my system, having run the perl program as outlined in your test case: $ ls -l /tmp/strings.txt -rw-r--r-- 1 pg pg 431886574 Jan 27 11:13 /tmp/strings.txt $ sha1sum /tmp/strings.txt 22f60dc12527c215c8e3992e49d31dc531261a83 /tmp/strings.txt Does that match what you see on your system? -- Peter Geoghegan