"Heikki Linnakangas" <[EMAIL PROTECTED]> writes: > Alvaro Herrera wrote: >> Tom Lane wrote: >> >>> ISTM that perhaps a more generally useful definition would be >>> >>> lword Only ASCII letters >>> nlword Entirely letters per iswalpha(), but not lword >>> word Entirely alphanumeric per iswalnum(), but not nlword >>> (hence, includes at least one digit) >> ... >> I am not sure if there are any western european languages were words can >> only be formed with non-ascii chars. > > There is at least in Swedish: "ö" (island) and å (river). They're both a > bit special because they're just one letter each.
For what it's worth I did the same search last night and found three French words including "çà" -- which admittedly is likely to be a noise word. Other dictionaries such as Italian and Irish also have one-letter words like this. The only other with multi-letter words is actually Faroese with "íð" and "óð". > I like the "aword" name more than "lword", BTW. If we change the meaning > of the classes, surely we can change the name as well, right? I'm not very familiar with the use case here. Is there a good reason to want to abbreviate these names? I think I would expect "ascii", "word", and "token" for the three categories Tom describes. > Note that the default parser is useless for languages like Japanese, > where words are not separated by whitespace, anyway. I also wonder about languages like Arabic and Hindi which do have words but I'm not sure if they use white space as simply as in latin languages. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings