Thanks for having a look at this bug.

According to section 12.8.2 of the postgres manual, ts_parse is
supposed to recognize different types of data, one of which (#4) is an
email address.

The list of recognized data formats for parse can be selected via this query:

 SELECT * FROM ts_token_type('default');

The example in the bug I reported is valid email address, according to
the RFC, but isn't recognized as such by the full text search in
postgres.  This bug will have a real impact on anybody using ts
functions to locate email addresses, as only some of them are found in
the query.

Regards
Dan



On Thu, Oct 22, 2009 at 12:29 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Fri, Aug 28, 2009 at 9:59 AM, Dan O'Hara <danarasoftw...@gmail.com> wrote:
>>
>> The following bug has been logged online:
>>
>> Bug reference:      5021
>> Logged by:          Dan O'Hara
>> Email address:      danarasoftw...@gmail.com
>> PostgreSQL version: 8.3.7
>> Operating system:   win32
>> Description:        ts_parse doesn't recognize email addresses with
>> underscores
>> Details:
>>
>> In the following example,
>>
>> select distinct token as email
>> from ts_parse('default', ' first_l...@yahoo.com '   )
>> where tokid = 4
>>
>> ts_parse returns l...@yahoo.com rather than first_l...@yahoo.com  It seems
>> that any text prior to the underscore is truncated.  If the portion
>> following the underscore is only numeric, such as this example,
>>
>> select distinct token as email
>> from ts_parse('default', ' bill_2...@yahoo.com '   )
>> where tokid = 4
>>
>> then ts_parse returns nothing at all.
>>
>> section 3.2.3 of RFC 5322 indicates that underscores are valid characters in
>> an email address.
>>
>> http://tools.ietf.org/html/rfc5322
>
> I don't think this has much to do with email addresses.  If you do:
>
> select token from ts_parse('a_b');
>
> ...you get three tokens.  In your case you're pulling out the fourth
> token, but some of your examples don't have four tokens, so then you
> get nothing at all.
>
> I'm not real familiar with ts_parse(), but I'm thinking that it
> doesn't have any special casing for email addresses and is just
> intended to parse text for full-text-search - in which case splitting
> on _ is a pretty good algorithm.
>
> ...Robert
>



-- 
-------------------------------------------------------------------
Dan O'Hara
Danara Software Systems, Inc.
danarasoftw...@gmail.com
613 288-8733

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Reply via email to