Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores

Euler Taveira de Oliveira Thu, 22 Oct 2009 12:40:01 -0700

Robert Haas escreveu:
> I'm not real familiar with ts_parse(), but I'm thinking that it
> doesn't have any special casing for email addresses and is just
> intended to parse text for full-text-search - in which case splitting
> on _ is a pretty good algorithm.
> 
It is a bug. The tsearch claims to identify types of tokens but it doesn't
correctly identify any valid e-mail addresses. As Dan stated ts_parse() fails
to recognize an e-mail address. For example, [email protected] is a valid e-mail
but the function fails to report that.


It is not that simple to identify an e-mail address that agrees with RFC. As
that code is a state machine, IMHO it decides too early (when it finds _) that
that string is not an e-mail address. AFAIR, that's not an one-line fix.

euler=# select distinct token as email from ts_parse('default',
'[email protected]');
      email
─────────────────
 [email protected]
(1 row)

euler=# select distinct token as email from ts_parse('default',
'[email protected]');
    email
─────────────
 foo
 +
 [email protected]
(3 rows)

euler=# select distinct token as email from ts_parse('default',
'[email protected]');
    email
─────────────
 foo
 [email protected]
 _
(3 rows)


-- 
  Euler Taveira de Oliveira
  http://www.timbira.com/

-- 
Sent via pgsql-bugs mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores

Reply via email to