StandardTokenizer, maxTokenLength behavior — likely bug

[email protected] Mon, 26 Jan 2015 08:19:07 -0800

On one of my other open-source projects (SolrTextTagger) I have a test that
deliberately tests the effect of a very long token with the
StandardTokenizer, and that project is in turn tested against a wide matrix
of Lucene/Solr versions.  Before Lucene 4.9, if you had a token that
exceeded maxTokenLength (by default the max is 255), this created a skipped
position — basically a pseudo-stop-word.  Since 4.9, this doesn’t happen
anymore; the JFlex scanner thing never reports a token > 255.  I checked
our code coverage and sure enough the “skippedPositions++” never happens:


https://builds.apache.org/job/Lucene-Solr-Clover-trunk/lastSuccessfulBuild/clover-report/org/apache/lucene/analysis/standard/StandardTokenizer.html?line=167#src-167

Any thoughts on this?  Steve?

~ David Smiley
Freelance Apache Lucene/Solr Search Consultant/Developer
http://www.linkedin.com/in/davidwsmiley

StandardTokenizer, maxTokenLength behavior — likely bug

Reply via email to