Re: Incorrect tokenizing in the UAX29URLEmailAnalyzer analyzer?

Steve Rowe Wed, 23 Jul 2014 17:21:19 -0700

On Jul 23, 2014, at 7:43 PM, Milind <mili...@gmail.com> wrote:

>>>   input=esl2.gbr
>>>   output=[esl2.gb][r]
>>> 
>>> This is a bug, which was fixed in Lucene 4.7 - see <
> https://issues.apache.org/jira/browse/LUCENE-5391>
> 
> BTW, I changed the POM dependency to 4.7.1, but I'm still seeing the same
> output.  I can't go beyond 4.7 since it seems 4.8 onwards, Lucene is being
> compiled against Java 7 and I'm still on Java 6.  Hopefully, this will be
> a non-issue with PerFieldAnalyzerWrapper.  But I just wanted to point that
> out.


I checked out the source code for the 4.7.1 release and added a test for 
“esl2.gbr” to TestUAX29URLEmailAnalyzer.testNoSchemeURLs() 
<http://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_4_7_1/lucene/analysis/common/src/test/org/apache/lucene/analysis/core/TestUAX29URLEmailAnalyzer.java?view=markup#l262>:

    BaseTokenStreamTestCase.assertAnalyzesTo
        (a, "esl2.gbr", new String[] { "esl2",     "gbr" },
            new String[] { "<ALPHANUM>", "<ALPHANUM>" });

This passes: the string is broken up into “esl2” and “gbr” tokens, both with 
type <ALPHANUM>.

Are you sure that you’re running against the 4.7.1 version for all Lucene 
dependencies (including lucene-analyzers-common)?

Also, you need to change the value of the matchVersion parameter to the 
constructor to match the version you’re using; unless you do this, the behavior 
will remain the same as that of the version referred to by the matchVersion 
parameter.

Steve
 
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Incorrect tokenizing in the UAX29URLEmailAnalyzer analyzer?

Reply via email to