Hi Milind, On Jul 23, 2014, at 1:49 PM, Milind <mili...@gmail.com> wrote:
> The UAX29URLEmailAnalyzer analyzer in Lucene 4.4 is not working as I > expected. Is this a bug in the analyzer or is this working as designed? > > If I use the UAX29URLEmailAnalyzer, it tokenizes the following strings as > input=bwl-esl2.gbr.hp.com > output=[bwl-esl2.gbr.hp.com] This is the correct tokenization of a valid domain name with token type <URL>: the hyphen (‘-‘) is an allowed character in DNS names. From RFC 1035 Domain Implementation and Specification <http://www.ietf.org/rfc/rfc1035.txt>: <domain> ::= <subdomain> | " " <subdomain> ::= <label> | <subdomain> "." <label> <label> ::= <letter> [ [ <ldh-str> ] <let-dig> ] <ldh-str> ::= <let-dig-hyp> | <let-dig-hyp> <ldh-str> <let-dig-hyp> ::= <let-dig> | "-" <let-dig> ::= <letter> | <digit> <letter> ::= any one of the 52 alphabetic characters A through Z in upper case and a through z in lower case <digit> ::= any one of the ten digits 0 through 9 Note that while upper and lower case letters are allowed in domain names, no significance is attached to the case. That is, two names with the same spelling but different case are to be treated as if identical. The labels must follow the rules for ARPANET host names. They must start with a letter, end with a letter or digit, and have as interior characters only letters, digits, and hyphen. There are also some restrictions on the length. Labels must be 63 characters or less. From the Lucene grammar UAX29URLEmailTokenizerImpl.jflex: DomainLabel = [A-Za-z0-9] ([-A-Za-z0-9]* [A-Za-z0-9])? DomainNameStrict = {DomainLabel} ("." {DomainLabel})* {ASCIITLD} URIhostStrict = ("[" {IPv6Address} "]") | {IPv4Address} | {DomainNameStrict} […] {URIhostStrict} / [^-\w] { yybegin(YYINITIAL); return URL_TYPE; } > input=esl2.gbr > output=[esl2.gb][r] This is a bug, which was fixed in Lucene 4.7 - see <https://issues.apache.org/jira/browse/LUCENE-5391> Steve --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org