Re: Incorrect tokenizing in the UAX29URLEmailAnalyzer analyzer?

Steve Rowe Wed, 23 Jul 2014 15:13:33 -0700

See PerFieldAnalyzerWrapper, which is itself an Analyzer: 
<http://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/PerFieldAnalyzerWrapper.html>


Steve

On Jul 23, 2014, at 6:00 PM, Milind <mili...@gmail.com> wrote:

> Thanks Steve, that helped.  I had forgotten about the URL part of the
> Analyzer since I was using it for the email field.  I need to see if it's
> possible to use different analyzers for different fields.  If so, then I'll
> use the UAX29URLEmailAnalyzer only for the email field and use
> StandardAnalyzer for everything else.  I'm not sure if that would work
> though.  Since I'm using the MultiFieldQueryParser and that takes in a
> single Analyzer.
> 
> 
> On Wed, Jul 23, 2014 at 3:29 PM, Steve Rowe <sar...@gmail.com> wrote:
> 
>> Hi Milind,
>> 
>> On Jul 23, 2014, at 1:49 PM, Milind <mili...@gmail.com> wrote:
>> 
>>> The UAX29URLEmailAnalyzer analyzer in Lucene 4.4 is not working as I
>>> expected.  Is this a bug in the analyzer or is this working as designed?
>>> 
>>> If I use the UAX29URLEmailAnalyzer, it tokenizes the following strings as
>>>   input=bwl-esl2.gbr.hp.com
>>>   output=[bwl-esl2.gbr.hp.com]
>> 
>> This is the correct tokenization of a valid domain name with token type
>> <URL>: the hyphen (‘-‘) is an allowed character in DNS names.  From RFC
>> 1035 Domain Implementation and Specification <
>> http://www.ietf.org/rfc/rfc1035.txt>:
>> 
>>    <domain> ::= <subdomain> | " "
>>    <subdomain> ::= <label> | <subdomain> "." <label>
>>    <label> ::= <letter> [ [ <ldh-str> ] <let-dig> ]
>>    <ldh-str> ::= <let-dig-hyp> | <let-dig-hyp> <ldh-str>
>>    <let-dig-hyp> ::= <let-dig> | "-"
>>    <let-dig> ::= <letter> | <digit>
>> 
>>    <letter> ::= any one of the 52 alphabetic characters A through Z in
>>    upper case and a through z in lower case
>> 
>>    <digit> ::= any one of the ten digits 0 through 9
>> 
>>    Note that while upper and lower case letters are allowed in domain
>>    names, no significance is attached to the case.  That is, two names
>> with
>>    the same spelling but different case are to be treated as if identical.
>> 
>>    The labels must follow the rules for ARPANET host names.  They must
>>    start with a letter, end with a letter or digit, and have as interior
>>    characters only letters, digits, and hyphen.  There are also some
>>    restrictions on the length.  Labels must be 63 characters or less.
>> 
>> From the Lucene grammar UAX29URLEmailTokenizerImpl.jflex:
>> 
>>    DomainLabel = [A-Za-z0-9] ([-A-Za-z0-9]* [A-Za-z0-9])?
>>    DomainNameStrict = {DomainLabel} ("." {DomainLabel})* {ASCIITLD}
>>    URIhostStrict = ("[" {IPv6Address} "]") | {IPv4Address} |
>> {DomainNameStrict}
>>    […]
>>    {URIhostStrict} / [^-\w] { yybegin(YYINITIAL); return URL_TYPE; }
>> 
>>>   input=esl2.gbr
>>>   output=[esl2.gb][r]
>> 
>> This is a bug, which was fixed in Lucene 4.7 - see <
>> https://issues.apache.org/jira/browse/LUCENE-5391>
>> 
>> Steve
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
>> 
> 
> 
> -- 
> Regards
> Milind


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Incorrect tokenizing in the UAX29URLEmailAnalyzer analyzer?

Reply via email to