I agree that it isn't easy to determine if given text is a valid email
address.  As I couldn't use ts_parse, I ended up using a regex, which
worked substantially better at pulling out the emails from the text
stream.  I haven't looked at the code, but perhaps it is possible to
do the same thing here?  Even a regex that is 99% correct would be
better than the current tokenizer which is only right about 80-85% of
the time.

My workaround looked something like this:

  select 
regexp_matches(resumetext,E'[a-z0-9._%+...@[a-z0-9.-]+\\.[a-z]{2,4}','gi')
as email
                from "Resume"
cheers
Dan

On Thu, Oct 22, 2009 at 3:39 PM, Euler Taveira de Oliveira
<eu...@timbira.com> wrote:
> Robert Haas escreveu:
>> I'm not real familiar with ts_parse(), but I'm thinking that it
>> doesn't have any special casing for email addresses and is just
>> intended to parse text for full-text-search - in which case splitting
>> on _ is a pretty good algorithm.
>>
> It is a bug. The tsearch claims to identify types of tokens but it doesn't
> correctly identify any valid e-mail addresses. As Dan stated ts_parse() fails
> to recognize an e-mail address. For example, foo+...@baz.com is a valid e-mail
> but the function fails to report that.
>
> It is not that simple to identify an e-mail address that agrees with RFC. As
> that code is a state machine, IMHO it decides too early (when it finds _) that
> that string is not an e-mail address. AFAIR, that's not an one-line fix.
>
> euler=# select distinct token as email from ts_parse('default',
> 'foo....@baz.com');
>      email
> ─────────────────
>  foo....@baz.com
> (1 row)
>
> euler=# select distinct token as email from ts_parse('default',
> 'foo+...@baz.com');
>    email
> ─────────────
>  foo
>  +
>  ...@baz.com
> (3 rows)
>
> euler=# select distinct token as email from ts_parse('default',
> 'foo_...@baz.com');
>    email
> ─────────────
>  foo
>  ...@baz.com
>  _
> (3 rows)
>
>
> --
>  Euler Taveira de Oliveira
>  http://www.timbira.com/
>



-- 
-------------------------------------------------------------------
Dan O'Hara
Danara Software Systems, Inc.
danarasoftw...@gmail.com
613 288-8733

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Reply via email to