Re: speed up unicode normalization quick check

Mark Dilger Fri, 29 May 2020 09:14:15 -0700

> On May 28, 2020, at 8:54 PM, John Naylor <john.nay...@2ndquadrant.com> wrote:
> 
> On Fri, May 29, 2020 at 5:59 AM Mark Dilger
> <mark.dil...@enterprisedb.com> wrote:
>> 
>>> On May 21, 2020, at 12:12 AM, John Naylor <john.nay...@2ndquadrant.com> 
>>> wrote:
> 
>>> very picky in general. As a test, it also successfully finds a
>>> function for the OS "words" file, the "D" sets of codepoints, and for
>>> sets of the first n built-in OIDs, where n > 5.
>> 
>> Prior to this patch, src/tools/gen_keywordlist.pl is the only script that 
>> uses PerfectHash.  Your patch adds a second.  I'm not convinced that 
>> modifying the PerfectHash code directly each time a new caller needs 
>> different multipliers is the right way to go.

I forgot in my first round of code review to mention, "thanks for the patch".  
I generally like what you are doing here, and am trying to review it so it gets 
committed.

> Calling it "each time" with a sample size of two is a bit of a
> stretch. The first implementation made a reasonable attempt to suit
> future uses and I simply made it a bit more robust. In the text quoted
> above you can see I tested some scenarios beyond the current use
> cases, with key set sizes as low as 6 and as high as 250k.

I don't really have an objection to what you did in the patch.  I'm not going 
to lose any sleep if it gets committed this way.

The reason I gave this feedback is that I saved the *kwlist_d.h files generated 
before applying the patch, and compared them with the same files generated 
after applying the patch, and noticed a very slight degradation.  Most of the 
files changed without any expansion, but the largest of them, 
src/common/kwlist_d.h, changed from

  static const int16 h[901]

to

  static const int16 h[902]

suggesting that even with your reworking of the parameters for PerfectHash, you 
weren't able to find a single set of numbers that worked for the two datasets 
quite as well as different sets of numbers each tailored for a particular data 
set.  I started to imagine that if we wanted to use PerfectHash for yet more 
stuff, the problem of finding numbers that worked across all N data sets (even 
if N is only 3 or 4) might be harder still.  That's all I was referring to.  
901 -> 902 is such a small expansion that it might not be worth worrying about.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Re: speed up unicode normalization quick check

Reply via email to