There is no Analyzer implementation because no one ever made one :). Copy-pasting StandardAnalyzer and substituting UAX29URLEmailTokenizer wherever StandardTokenizer appears should do the trick.
Because people often want to be able to search against *both* whole email addresses and URLs *and* their components, a UAX29URLEmailAnalyzer would ideally have filter(s) to emit email/URL components at the same position as the full term. Or rather, the reverse: each component would have its own position, and the full term would be positioned at the head component's position. Steve > -----Original Message----- > From: Uwe Schindler [mailto:u...@thetaphi.de] > Sent: Sunday, February 26, 2012 3:51 AM > To: java-user@lucene.apache.org > Subject: RE: StandardAnalyzer and Email Addresses > > Hi, > > If you want a Tokenizer for your Analyzer that supports eMail detection, > use > UAX29URLEmailTokenizer (see http://goo.gl/evH97). There is no Analyzer > available that uses this Tokenizer, but you can define your own one like > StandardAnalyzer, but with this class as Tokenizer (not > StandardTokenizer). > I am not sure why there is no Analyzer implementation already available, > maybe Steven Rowe knows more. > > The trick with the phrase is of lower performance as it uses a PhraseQuery > internally, which is more expensive. > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > > -----Original Message----- > > From: Charlie Hubbard [mailto:charlie.hubb...@gmail.com] > > Sent: Sunday, February 26, 2012 1:51 AM > > To: java-user@lucene.apache.org > > Subject: Re: StandardAnalyzer and Email Addresses > > > > I am using StandardAnalyzer in 3.1. I'd been previously using 2.4 and > from that > > documentation it states email address are recognized: > > > > http://javasourcecode.org/html/open-source/lucene/lucene- > > 2.4.0/org/apache/lucene/analysis/standard/StandardTokenizer.html > > > > It looks like this was changed in 3.x according to this doc now: > > > > > http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/all/or > g/ > > apache/lucene/analysis/standard/ClassicTokenizer.html > > > > I think I've found a work around in that if I search for email address > like: > > > > to:"charlie.hubb...@gmail.com" > > > > Then it will look for the full email address. What is the draw back of > using the > > quoted version? Is the performance worse doing this? How much worse? > I'm > > not sure how quoted searches are implemented so it's hard for me to > gauge > > what the draw back is. > > > > Thanks > > Charlie > > > > On Mon, Feb 20, 2012 at 12:23 PM, Ian Lea <ian....@gmail.com> wrote: > > > > > Are you using StandardAnalyzer in 3.1+? You may want to use > > > ClassicAnalyzer instead. I can't see where in the 3.5 javadocs it > > > says that email addresses are recognized, but it does sound vaguely > > > familiar. > > > > > > > > > -- > > > Ian. > > > > > > > > > On Thu, Feb 16, 2012 at 5:18 PM, Charlie Hubbard > > > <charlie.hubb...@gmail.com> wrote: > > > > This is a pretty simple question to answer, but I have customers > > > > asking > > > me > > > > how this is suppose to work and I'm having trouble explaining it. I > > > > have an app that indexes emails so there are plenty of email > > > > addresses in > > > there. > > > > Reading the StandardAnalyzer javadoc it says it "recognizes" email > > > > addresses when it is creating the token list. What tokens will it > > > produce > > > > exactly? What I'm seeing when I perform searches is the email > > > > address looks like its being tokenized into its parts. Searching by > > > > an email address like: > > > > > > > > to:charlie.hubb...@gmail.com > > > > > > > > pulls back more hits that haven't been addressed to > > > > charlie.hubb...@gmail.com. Other messages with gmail.com in them > > > > are returned. If I use the following: > > > > > > > > to:charlie.hubbard > > > > > > > > in them. It also finds gmail.com, and other domains. And I can > > > > search > > > for > > > > strings like > > > > > > > > to:"charlie.hubb...@gmail.com" > > > > > > > > it will pull back only emails addressed to that address. Further > > > > proof > > > it > > > > seems to token the parts of an email is if I search for a very > > > > specific email address like: > > > > > > > > to:"charlie.hubbard+sometag" > > > > > > > > That will pull back only emails addressed to that email, but it's > > > > not a full email address. Which leads me to think it will parse > > > > parts of the email addresses. Can someone explain this a little > more? > > > > > > > > I'm having trouble with some emails that can't be pulled back using > > > > the username like searching for to:chubbard where the email was > > > > addressed to chubb...@somedomain.com, but it fails to show up in the > > search results. > > > I > > > > can't explain why that's happening. In all of my tests I can't > > > > reproduce it and I think I might have to reindex everything because > > > > this was an > > > index > > > > built with 2.4 and I upgraded to 3.1 so I'm worried it might be > > > corrupted. > > > > > > > > Thoughts? > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org