On Tue, Jan 03, 2012 at 06:04:23PM +0000, val...@gmail.com wrote: > The following bug has been logged on the website: > > Bug reference: 6375 > Logged by: Valentine Gogichashvili > Email address: val...@gmail.com > PostgreSQL version: 9.1.1 > Operating system: Debian 4.4.5-8 > Description: > > Hello, > > default tsearch parser does not recognize all valid email addresses and > tokenizes them as text, splitting into tokens. > > For example: > > postgres=# select to_tsquery('simple', 'nor...@email.com' ); > to_tsquery > ──────────────────── > 'nor...@email.com' > (1 row) > > here it behaves ok; > > postgres=# select to_tsquery('simple', '-still-nor...@email.com' ); > to_tsquery > ────────────────────────── > 'still-nor...@email.com' > (1 row) > > here it trims '-' from the beginning of an email. This is not correct, but > will at least find that email. > > postgres=# select to_tsquery('simple', '-not-normal-with-da...@email.com' > ); > to_tsquery > > ─────────────────────────────────────────────────────────────────────────────── > 'not-normal-with-dash' & 'not' & 'normal' & 'with' & 'dash' & 'email.com' > (1 row) > > and this is now a real problem as it leads to finding emails that are not > the same, but are "super-sets" of that one. > > Valid email characters, that are not correctly treated also are at least '+' > and '.'
Yep. :-( You can see the oddness here: test=> SELECT alias, description, token FROM ts_debug('-myn...@gmail.com'); alias | description | token -------+---------------+------------------ blank | Space symbols | - email | Email address | myn...@gmail.com (2 rows) test=> SELECT alias, description, token FROM ts_debug('-myna...@gmail.com'); alias | description | token -------+---------------+------------------- blank | Space symbols | - email | Email address | myna...@gmail.com (2 rows) test=> SELECT alias, description, token FROM ts_debug('-myna-...@gmail.com'); alias | description | token -----------------+---------------------------------+----------- blank | Space symbols | - asciihword | Hyphenated word, all ASCII | myna-me hword_asciipart | Hyphenated word part, all ASCII | myna blank | Space symbols | - hword_asciipart | Hyphenated word part, all ASCII | me blank | Space symbols | -@ host | Host | gmail.com (7 rows) The first and second show that the leading-dash is separated. The third ones shows that a trailing dash causes the middle-dash to also be separated. This email thread from 2010 has a similar problem: http://archives.postgresql.org/pgsql-hackers/2010-10/msg00772.php What is limiting a fix for this is the breaking of existing behavior, and the breaking of indexes used during pg_upgrade. I have added your email to the existing TODO item: http://wiki.postgresql.org/wiki/Todo#Text_Search Improve handling of dash and plus signs in email address user names, and perhaps improve URL parsing http://archives.postgresql.org/pgsql-hackers/2010-10/msg00772.php tsearch does not recognize all valid emails -- Bruce Momjian <br...@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs