Hi Milind,

On Jul 23, 2014, at 1:49 PM, Milind <mili...@gmail.com> wrote:

> The UAX29URLEmailAnalyzer analyzer in Lucene 4.4 is not working as I
> expected.  Is this a bug in the analyzer or is this working as designed?
> 
> If I use the UAX29URLEmailAnalyzer, it tokenizes the following strings as
>    input=bwl-esl2.gbr.hp.com
>    output=[bwl-esl2.gbr.hp.com]

This is the correct tokenization of a valid domain name with token type <URL>: 
the hyphen (‘-‘) is an allowed character in DNS names.  From RFC 1035 Domain 
Implementation and Specification <http://www.ietf.org/rfc/rfc1035.txt>:

    <domain> ::= <subdomain> | " "
    <subdomain> ::= <label> | <subdomain> "." <label>
    <label> ::= <letter> [ [ <ldh-str> ] <let-dig> ]
    <ldh-str> ::= <let-dig-hyp> | <let-dig-hyp> <ldh-str>
    <let-dig-hyp> ::= <let-dig> | "-"
    <let-dig> ::= <letter> | <digit>

    <letter> ::= any one of the 52 alphabetic characters A through Z in
    upper case and a through z in lower case

    <digit> ::= any one of the ten digits 0 through 9

    Note that while upper and lower case letters are allowed in domain
    names, no significance is attached to the case.  That is, two names with
    the same spelling but different case are to be treated as if identical.

    The labels must follow the rules for ARPANET host names.  They must
    start with a letter, end with a letter or digit, and have as interior
    characters only letters, digits, and hyphen.  There are also some
    restrictions on the length.  Labels must be 63 characters or less.

From the Lucene grammar UAX29URLEmailTokenizerImpl.jflex:

    DomainLabel = [A-Za-z0-9] ([-A-Za-z0-9]* [A-Za-z0-9])?
    DomainNameStrict = {DomainLabel} ("." {DomainLabel})* {ASCIITLD}
    URIhostStrict = ("[" {IPv6Address} "]") | {IPv4Address} | 
{DomainNameStrict}  
    […]
    {URIhostStrict} / [^-\w] { yybegin(YYINITIAL); return URL_TYPE; }

>    input=esl2.gbr
>    output=[esl2.gb][r]

This is a bug, which was fixed in Lucene 4.7 - see 
<https://issues.apache.org/jira/browse/LUCENE-5391>

Steve


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to