See PerFieldAnalyzerWrapper, which is itself an Analyzer: <http://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/PerFieldAnalyzerWrapper.html>
Steve On Jul 23, 2014, at 6:00 PM, Milind <mili...@gmail.com> wrote: > Thanks Steve, that helped. I had forgotten about the URL part of the > Analyzer since I was using it for the email field. I need to see if it's > possible to use different analyzers for different fields. If so, then I'll > use the UAX29URLEmailAnalyzer only for the email field and use > StandardAnalyzer for everything else. I'm not sure if that would work > though. Since I'm using the MultiFieldQueryParser and that takes in a > single Analyzer. > > > On Wed, Jul 23, 2014 at 3:29 PM, Steve Rowe <sar...@gmail.com> wrote: > >> Hi Milind, >> >> On Jul 23, 2014, at 1:49 PM, Milind <mili...@gmail.com> wrote: >> >>> The UAX29URLEmailAnalyzer analyzer in Lucene 4.4 is not working as I >>> expected. Is this a bug in the analyzer or is this working as designed? >>> >>> If I use the UAX29URLEmailAnalyzer, it tokenizes the following strings as >>> input=bwl-esl2.gbr.hp.com >>> output=[bwl-esl2.gbr.hp.com] >> >> This is the correct tokenization of a valid domain name with token type >> <URL>: the hyphen (‘-‘) is an allowed character in DNS names. From RFC >> 1035 Domain Implementation and Specification < >> http://www.ietf.org/rfc/rfc1035.txt>: >> >> <domain> ::= <subdomain> | " " >> <subdomain> ::= <label> | <subdomain> "." <label> >> <label> ::= <letter> [ [ <ldh-str> ] <let-dig> ] >> <ldh-str> ::= <let-dig-hyp> | <let-dig-hyp> <ldh-str> >> <let-dig-hyp> ::= <let-dig> | "-" >> <let-dig> ::= <letter> | <digit> >> >> <letter> ::= any one of the 52 alphabetic characters A through Z in >> upper case and a through z in lower case >> >> <digit> ::= any one of the ten digits 0 through 9 >> >> Note that while upper and lower case letters are allowed in domain >> names, no significance is attached to the case. That is, two names >> with >> the same spelling but different case are to be treated as if identical. >> >> The labels must follow the rules for ARPANET host names. They must >> start with a letter, end with a letter or digit, and have as interior >> characters only letters, digits, and hyphen. There are also some >> restrictions on the length. Labels must be 63 characters or less. >> >> From the Lucene grammar UAX29URLEmailTokenizerImpl.jflex: >> >> DomainLabel = [A-Za-z0-9] ([-A-Za-z0-9]* [A-Za-z0-9])? >> DomainNameStrict = {DomainLabel} ("." {DomainLabel})* {ASCIITLD} >> URIhostStrict = ("[" {IPv6Address} "]") | {IPv4Address} | >> {DomainNameStrict} >> […] >> {URIhostStrict} / [^-\w] { yybegin(YYINITIAL); return URL_TYPE; } >> >>> input=esl2.gbr >>> output=[esl2.gb][r] >> >> This is a bug, which was fixed in Lucene 4.7 - see < >> https://issues.apache.org/jira/browse/LUCENE-5391> >> >> Steve >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > > -- > Regards > Milind --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org