I agree that it isn't easy to determine if given text is a valid email address. As I couldn't use ts_parse, I ended up using a regex, which worked substantially better at pulling out the emails from the text stream. I haven't looked at the code, but perhaps it is possible to do the same thing here? Even a regex that is 99% correct would be better than the current tokenizer which is only right about 80-85% of the time.
My workaround looked something like this: select regexp_matches(resumetext,E'[a-z0-9._%+...@[a-z0-9.-]+\\.[a-z]{2,4}','gi') as email from "Resume" cheers Dan On Thu, Oct 22, 2009 at 3:39 PM, Euler Taveira de Oliveira <eu...@timbira.com> wrote: > Robert Haas escreveu: >> I'm not real familiar with ts_parse(), but I'm thinking that it >> doesn't have any special casing for email addresses and is just >> intended to parse text for full-text-search - in which case splitting >> on _ is a pretty good algorithm. >> > It is a bug. The tsearch claims to identify types of tokens but it doesn't > correctly identify any valid e-mail addresses. As Dan stated ts_parse() fails > to recognize an e-mail address. For example, foo+...@baz.com is a valid e-mail > but the function fails to report that. > > It is not that simple to identify an e-mail address that agrees with RFC. As > that code is a state machine, IMHO it decides too early (when it finds _) that > that string is not an e-mail address. AFAIR, that's not an one-line fix. > > euler=# select distinct token as email from ts_parse('default', > 'foo....@baz.com'); > email > ───────────────── > foo....@baz.com > (1 row) > > euler=# select distinct token as email from ts_parse('default', > 'foo+...@baz.com'); > email > ───────────── > foo > + > ...@baz.com > (3 rows) > > euler=# select distinct token as email from ts_parse('default', > 'foo_...@baz.com'); > email > ───────────── > foo > ...@baz.com > _ > (3 rows) > > > -- > Euler Taveira de Oliveira > http://www.timbira.com/ > -- ------------------------------------------------------------------- Dan O'Hara Danara Software Systems, Inc. danarasoftw...@gmail.com 613 288-8733 -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs