Re: [HACKERS] old bug in full text parser

Oleg Bartunov Wed, 10 Feb 2016 02:05:32 -0800

On Wed, Feb 10, 2016 at 12:28 PM, Oleg Bartunov <obartu...@gmail.com> wrote:


> It  looks like there is a very old bug in full text parser (somebody
> pointed me on it), which appeared after moving tsearch2 into the core.  The
> problem is in how full text parser process hyphenated words. Our original
> idea was to report hyphenated word itself as well as its parts and ignore
> hyphen. That was how tsearch2 works.
>
> This behaviour was changed after moving tsearch2 into the core:
> 1. hyphen now reported by parser, which is useless.
> 2.  Hyphenated words with numbers ('4-dot', 'dot-4')  processed
> differently than ones with plain text words like 'four-dot', no hyphenated
> word itself reported.
>
> I think we should consider this as a bug and produce fix for all supported
> versions.
>
> After  investigation we found this commit:
>
> commit 73e6f9d3b61995525785b2f4490b465fe860196b
> Author: Tom Lane <t...@sss.pgh.pa.us>
> Date:   Sat Oct 27 19:03:45 2007 +0000
>
>     Change text search parsing rules for hyphenated words so that digit
> strings
>     containing decimal points aren't considered part of a hyphenated word.
>     Sync the hyphenated-word lookahead states with the subsequent
> part-by-part
>     reparsing states so that we don't get different answers about how much
> text
>     is part of the hyphenated word.  Per my gripe of a few days ago.
>
>
> 8.2.23
>
> select tok_type, description, token from ts_debug('dot-four');
>   tok_type   |          description          |  token
> -------------+-------------------------------+----------
>  lhword      | Latin hyphenated word         | dot-four
>  lpart_hword | Latin part of hyphenated word | dot
>  lpart_hword | Latin part of hyphenated word | four
> (3 rows)
>
> select tok_type, description, token from ts_debug('dot-4');
>   tok_type   |          description          | token
> -------------+-------------------------------+-------
>  hword       | Hyphenated word               | dot-4
>  lpart_hword | Latin part of hyphenated word | dot
>  uint        | Unsigned integer              | 4
> (3 rows)
>
> select tok_type, description, token from ts_debug('4-dot');
>  tok_type |   description    | token
> ----------+------------------+-------
>  uint     | Unsigned integer | 4
>  lword    | Latin word       | dot
> (2 rows)
>
> 8.3.23
>
> select alias, description, token from ts_debug('dot-four');
>       alias      |           description           |  token
> -----------------+---------------------------------+----------
>  asciihword      | Hyphenated word, all ASCII      | dot-four
>  hword_asciipart | Hyphenated word part, all ASCII | dot
>  blank           | Space symbols                   | -
>  hword_asciipart | Hyphenated word part, all ASCII | four
> (4 rows)
>
> select alias, description, token from ts_debug('dot-4');
>    alias   |   description   | token
> -----------+-----------------+-------
>  asciiword | Word, all ASCII | dot
>  int       | Signed integer  | -4
> (2 rows)
>
> select alias, description, token from ts_debug('4-dot');
>    alias   |   description    | token
> -----------+------------------+-------
>  uint      | Unsigned integer | 4
>  blank     | Space symbols    | -
>  asciiword | Word, all ASCII  | dot
> (3 rows)
>
>

Oh, one more bug, which existed even in tsearch2.

select tok_type, description, token from ts_debug('4-dot');
 tok_type |   description    | token
----------+------------------+-------
 uint     | Unsigned integer | 4
 lword    | Latin word       | dot
(2 rows)




>
> Regards,
> Oleg
>

Re: [HACKERS] old bug in full text parser

Reply via email to