This does not seem right:

regression=# select alias,description,token from ts_debug('foo-8.3beta');
      alias      |             description             |  token  
-----------------+-------------------------------------+---------
 numhword        | Hyphenated word, letters and digits | foo-8.3
 hword_asciipart | Hyphenated word part, all ASCII     | foo
 blank           | Space symbols                       | -
 float           | Decimal notation                    | 8.3
 hword_asciipart | Hyphenated word part, all ASCII     | beta
(5 rows)

(Code from just before my last commit behaves the same, modulo names of
token types, so I didn't break it just now.)

Surely, if "beta" is an hword part here, it should have been reported as
part of the overall hword.  However, this is all pretty inconsistent,
because if "8.3" had been in the first chunk of text then we'd not have
considered it part of an hword at all:

regression=# select alias,description,token from ts_debug('8.3beta-foo');
      alias      |           description           |  token   
-----------------+---------------------------------+----------
 float           | Decimal notation                | 8.3
 asciihword      | Hyphenated word, all ASCII      | beta-foo
 hword_asciipart | Hyphenated word part, all ASCII | beta
 blank           | Space symbols                   | -
 hword_asciipart | Hyphenated word part, all ASCII | foo
(5 rows)

regression=# select alias,description,token from ts_debug('beta8.3-foo');
 alias |    description    |    token    
-------+-------------------+-------------
 file  | File or path name | beta8.3-foo
(1 row)

regression=# select alias,description,token from ts_debug('foo-beta8.3-foo');
      alias      |               description                |   token   
-----------------+------------------------------------------+-----------
 numhword        | Hyphenated word, letters and digits      | foo-beta8
 hword_asciipart | Hyphenated word part, all ASCII          | foo
 blank           | Space symbols                            | -
 hword_numpart   | Hyphenated word part, letters and digits | beta8
 blank           | Space symbols                            | .
 uint            | Unsigned integer                         | 3
 blank           | Space symbols                            | -
 asciiword       | Word, all ASCII                          | foo
(8 rows)

I'm of the opinion that in no circumstance should "." be considered part
of an hword: the definition of word should not be allowed to stretch
beyond letters and digits.  So I think the second and fourth examples
I showed above are correct.  The third (where it concludes it's a
filename) is maybe a bit odd, but in any case it's not an hword so I won't
complain.  I think the first example ought to parse as

        asciiword       foo
        blank           -
        float           8.3
        asciiword       foo

(Or maybe the '-' should fold into the float?  Don't care much...)

This is all a little bit tricky, since this behavior seems reasonable:

regression=# select alias,description,token from ts_debug('foo-83beta');
      alias      |               description                |   token    
-----------------+------------------------------------------+------------
 numhword        | Hyphenated word, letters and digits      | foo-83beta
 hword_asciipart | Hyphenated word part, all ASCII          | foo
 blank           | Space symbols                            | -
 hword_numpart   | Hyphenated word part, letters and digits | 83beta
(4 rows)

regression=# select alias,description,token from ts_debug('83beta-foo');
      alias      |               description                |   token    
-----------------+------------------------------------------+------------
 numhword        | Hyphenated word, letters and digits      | 83beta-foo
 hword_numpart   | Hyphenated word part, letters and digits | 83beta
 blank           | Space symbols                            | -
 hword_asciipart | Hyphenated word part, all ASCII          | foo
(4 rows)

Basically I'm arguing that a string should be considered valid as a
second or subsequent component of an hword if and only if it would be
considered valid as the first component.

Comments?

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

                http://www.postgresql.org/about/donate

Reply via email to