Thanks for having a look at this bug. According to section 12.8.2 of the postgres manual, ts_parse is supposed to recognize different types of data, one of which (#4) is an email address.
The list of recognized data formats for parse can be selected via this query: SELECT * FROM ts_token_type('default'); The example in the bug I reported is valid email address, according to the RFC, but isn't recognized as such by the full text search in postgres. This bug will have a real impact on anybody using ts functions to locate email addresses, as only some of them are found in the query. Regards Dan On Thu, Oct 22, 2009 at 12:29 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Fri, Aug 28, 2009 at 9:59 AM, Dan O'Hara <danarasoftw...@gmail.com> wrote: >> >> The following bug has been logged online: >> >> Bug reference: 5021 >> Logged by: Dan O'Hara >> Email address: danarasoftw...@gmail.com >> PostgreSQL version: 8.3.7 >> Operating system: win32 >> Description: ts_parse doesn't recognize email addresses with >> underscores >> Details: >> >> In the following example, >> >> select distinct token as email >> from ts_parse('default', ' first_l...@yahoo.com ' ) >> where tokid = 4 >> >> ts_parse returns l...@yahoo.com rather than first_l...@yahoo.com It seems >> that any text prior to the underscore is truncated. If the portion >> following the underscore is only numeric, such as this example, >> >> select distinct token as email >> from ts_parse('default', ' bill_2...@yahoo.com ' ) >> where tokid = 4 >> >> then ts_parse returns nothing at all. >> >> section 3.2.3 of RFC 5322 indicates that underscores are valid characters in >> an email address. >> >> http://tools.ietf.org/html/rfc5322 > > I don't think this has much to do with email addresses. If you do: > > select token from ts_parse('a_b'); > > ...you get three tokens. In your case you're pulling out the fourth > token, but some of your examples don't have four tokens, so then you > get nothing at all. > > I'm not real familiar with ts_parse(), but I'm thinking that it > doesn't have any special casing for email addresses and is just > intended to parse text for full-text-search - in which case splitting > on _ is a pretty good algorithm. > > ...Robert > -- ------------------------------------------------------------------- Dan O'Hara Danara Software Systems, Inc. danarasoftw...@gmail.com 613 288-8733 -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs