Robert Haas escreveu: > I'm not real familiar with ts_parse(), but I'm thinking that it > doesn't have any special casing for email addresses and is just > intended to parse text for full-text-search - in which case splitting > on _ is a pretty good algorithm. > It is a bug. The tsearch claims to identify types of tokens but it doesn't correctly identify any valid e-mail addresses. As Dan stated ts_parse() fails to recognize an e-mail address. For example, foo+...@baz.com is a valid e-mail but the function fails to report that.
It is not that simple to identify an e-mail address that agrees with RFC. As that code is a state machine, IMHO it decides too early (when it finds _) that that string is not an e-mail address. AFAIR, that's not an one-line fix. euler=# select distinct token as email from ts_parse('default', 'foo....@baz.com'); email ───────────────── foo....@baz.com (1 row) euler=# select distinct token as email from ts_parse('default', 'foo+...@baz.com'); email ───────────── foo + b...@baz.com (3 rows) euler=# select distinct token as email from ts_parse('default', 'foo_...@baz.com'); email ───────────── foo b...@baz.com _ (3 rows) -- Euler Taveira de Oliveira http://www.timbira.com/ -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs