Tom Lane wrote: > ISTM that perhaps a more generally useful definition would be > > lword Only ASCII letters > nlword Entirely letters per iswalpha(), but not lword > word Entirely alphanumeric per iswalnum(), but not nlword > (hence, includes at least one digit) > > However, I am no linguist and maybe I'm missing something.
I tend to agree with the need to redefine the categories. I am not sure I agree with this particular definition though. I would think that a "latin word" should include ASCII letters and accented letters, and a non-latin word would be one that included only non-ASCII chars. alvherre=# select * from ts_debug('spanish', 'añadido añadió añadidura'); Alias | Description | Token | Dictionaries | Lexized token -------+---------------+-----------+----------------+-------------------------- word | Word | añadido | {spanish_stem} | spanish_stem: {añad} blank | Space symbols | | {} | word | Word | añadió | {spanish_stem} | spanish_stem: {añad} blank | Space symbols | | {} | word | Word | añadidura | {spanish_stem} | spanish_stem: {añadidur} (5 lignes) I would think those would all fit in the "latin word" category. This example is more interesting because it shows a word categorized differently just because the plural loses the accent: alvherre=# select * from ts_debug('spanish', 'caracteres carácter'); Alias | Description | Token | Dictionaries | Lexized token -------+---------------+------------+----------------+-------------------------- lword | Latin word | caracteres | {spanish_stem} | spanish_stem: {caracter} blank | Space symbols | | {} | word | Word | carácter | {spanish_stem} | spanish_stem: {caract} (3 lignes) I am not sure if there are any western european languages were words can only be formed with non-ascii chars. At least in spanish accents tend to be rare. However, I would think this is also wrong: alvherre=# select * from ts_debug('french', 'à'); Alias | Description | Token | Dictionaries | Lexized token --------+----------------+-------+---------------+----------------- nlword | Non-latin word | à | {french_stem} | french_stem: {} (1 ligne) I don't think this is much of a problem, this particular word being (most likely) a stopword. So, how about lword Entirely letters per iswalpha, with at least one ASCII nlword Entirely letters per iswalpha word Entirely alphanumeric per iswalnum, but not nlword -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings