On Tue, Jan 03, 2012 at 06:04:23PM +0000, val...@gmail.com wrote:
> The following bug has been logged on the website:
> 
> Bug reference:      6375
> Logged by:          Valentine Gogichashvili
> Email address:      val...@gmail.com
> PostgreSQL version: 9.1.1
> Operating system:   Debian 4.4.5-8
> Description:        
> 
> Hello, 
> 
> default tsearch parser does not recognize all valid email addresses and
> tokenizes them as text, splitting into tokens. 
> 
> For example:
> 
> postgres=# select to_tsquery('simple', 'nor...@email.com' );
>      to_tsquery     
> ────────────────────
>  'nor...@email.com'
> (1 row)
> 
> here it behaves ok;
> 
> postgres=# select to_tsquery('simple', '-still-nor...@email.com' );
>         to_tsquery        
> ──────────────────────────
>  'still-nor...@email.com'
> (1 row)
> 
> here it trims '-' from the beginning of an email. This is not correct, but
> will at least find that email.
> 
> postgres=# select to_tsquery('simple', '-not-normal-with-da...@email.com'
> );
>                                   to_tsquery                                
>   
> ───────────────────────────────────────────────────────────────────────────────
>  'not-normal-with-dash' & 'not' & 'normal' & 'with' & 'dash' & 'email.com'
> (1 row)
> 
> and this is now a real problem as it leads to finding emails that are not
> the same, but are "super-sets" of that one.
> 
> Valid email characters, that are not correctly treated also are at least '+'
> and '.'

Yep.  :-(

You can see the oddness here:

        test=> SELECT alias, description, token FROM 
ts_debug('-myn...@gmail.com');
         alias |  description  |      token
        -------+---------------+------------------
         blank | Space symbols | -
         email | Email address | myn...@gmail.com
        (2 rows)
        
        test=> SELECT alias, description, token FROM 
ts_debug('-myna...@gmail.com');
         alias |  description  |       token
        -------+---------------+-------------------
         blank | Space symbols | -
         email | Email address | myna...@gmail.com
        (2 rows)
        
        test=> SELECT alias, description, token FROM 
ts_debug('-myna-...@gmail.com');
              alias      |           description           |   token
        -----------------+---------------------------------+-----------
         blank           | Space symbols                   | -
         asciihword      | Hyphenated word, all ASCII      | myna-me
         hword_asciipart | Hyphenated word part, all ASCII | myna
         blank           | Space symbols                   | -
         hword_asciipart | Hyphenated word part, all ASCII | me
         blank           | Space symbols                   | -@
         host            | Host                            | gmail.com
        (7 rows)

The first and second show that the leading-dash is separated.  The third
ones shows that a trailing dash causes the middle-dash to also be
separated.

This email thread from 2010 has a similar problem:

        http://archives.postgresql.org/pgsql-hackers/2010-10/msg00772.php

What is limiting a fix for this is the breaking of existing behavior,
and the breaking of indexes used during pg_upgrade.

I have added your email to the existing TODO item:

        http://wiki.postgresql.org/wiki/Todo#Text_Search

        Improve handling of dash and plus signs in email address user names, and
        perhaps improve URL parsing
        
            http://archives.postgresql.org/pgsql-hackers/2010-10/msg00772.php
            tsearch does not recognize all valid emails 

-- 
  Bruce Momjian  <br...@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Reply via email to