On Wed, Feb 10, 2016 at 12:28 PM, Oleg Bartunov <obartu...@gmail.com> wrote:
> It looks like there is a very old bug in full text parser (somebody > pointed me on it), which appeared after moving tsearch2 into the core. The > problem is in how full text parser process hyphenated words. Our original > idea was to report hyphenated word itself as well as its parts and ignore > hyphen. That was how tsearch2 works. > > This behaviour was changed after moving tsearch2 into the core: > 1. hyphen now reported by parser, which is useless. > 2. Hyphenated words with numbers ('4-dot', 'dot-4') processed > differently than ones with plain text words like 'four-dot', no hyphenated > word itself reported. > > I think we should consider this as a bug and produce fix for all supported > versions. > > After investigation we found this commit: > > commit 73e6f9d3b61995525785b2f4490b465fe860196b > Author: Tom Lane <t...@sss.pgh.pa.us> > Date: Sat Oct 27 19:03:45 2007 +0000 > > Change text search parsing rules for hyphenated words so that digit > strings > containing decimal points aren't considered part of a hyphenated word. > Sync the hyphenated-word lookahead states with the subsequent > part-by-part > reparsing states so that we don't get different answers about how much > text > is part of the hyphenated word. Per my gripe of a few days ago. > > > 8.2.23 > > select tok_type, description, token from ts_debug('dot-four'); > tok_type | description | token > -------------+-------------------------------+---------- > lhword | Latin hyphenated word | dot-four > lpart_hword | Latin part of hyphenated word | dot > lpart_hword | Latin part of hyphenated word | four > (3 rows) > > select tok_type, description, token from ts_debug('dot-4'); > tok_type | description | token > -------------+-------------------------------+------- > hword | Hyphenated word | dot-4 > lpart_hword | Latin part of hyphenated word | dot > uint | Unsigned integer | 4 > (3 rows) > > select tok_type, description, token from ts_debug('4-dot'); > tok_type | description | token > ----------+------------------+------- > uint | Unsigned integer | 4 > lword | Latin word | dot > (2 rows) > > 8.3.23 > > select alias, description, token from ts_debug('dot-four'); > alias | description | token > -----------------+---------------------------------+---------- > asciihword | Hyphenated word, all ASCII | dot-four > hword_asciipart | Hyphenated word part, all ASCII | dot > blank | Space symbols | - > hword_asciipart | Hyphenated word part, all ASCII | four > (4 rows) > > select alias, description, token from ts_debug('dot-4'); > alias | description | token > -----------+-----------------+------- > asciiword | Word, all ASCII | dot > int | Signed integer | -4 > (2 rows) > > select alias, description, token from ts_debug('4-dot'); > alias | description | token > -----------+------------------+------- > uint | Unsigned integer | 4 > blank | Space symbols | - > asciiword | Word, all ASCII | dot > (3 rows) > > Oh, one more bug, which existed even in tsearch2. select tok_type, description, token from ts_debug('4-dot'); tok_type | description | token ----------+------------------+------- uint | Unsigned integer | 4 lword | Latin word | dot (2 rows) > > Regards, > Oleg >