I should add that this is Lucene 4.10.4. But I have checked it on the 5.2.1 version and I have got the same result
Regards Piotr On Mon, Jul 20, 2015 at 9:44 AM, Piotr Idzikowski <piotridzikow...@gmail.com > wrote: > Hello Steve, > It is always pleasure to help you develop such a great lib. > Talking about StandardTokenizer and setMaxTokenLength, I think I have > found another problem. > It looks like when the word is longer than max length analyzer adds two > tokens -> word.substring(0,maxLength) and word.substring(maxLength) > > Look at this code(sorry, it is quite ugly): > public class TestMaxLength { > > public static void main(String[] args) throws IOException { > String str = getString(300); > IndexWriterConfig iwc = new IndexWriterConfig (Version.LATEST, new > StandardAnalyzer()); > final RAMDirectory dir = new RAMDirectory(); > final IndexWriter writer = new IndexWriter (dir, iwc); > Document doc = new Document(); > doc.add(new TextField ("", str, Field.Store.NO)); > writer.addDocument (doc); > IndexReader reader = DirectoryReader.open (writer, false); > IndexSearcher indexSearcher = new IndexSearcher (reader); > TopDocs td = indexSearcher.search(new TermQuery(new Term("", str)), 1); > System.out.println("300*a: " + td.totalHits); > td = indexSearcher.search(new TermQuery(new Term("", getString(255))), > 1); > System.out.println("255*a: " + td.totalHits); > td = indexSearcher.search(new TermQuery(new Term("", getString(45))), 1); > System.out.println("45*a: " + td.totalHits); > System.out.println("\nTERMS"); > Fields fields = MultiFields.getFields(reader); > for(String field : fields) { > Terms terms = fields.terms(field); > TermsEnum termsEnum = terms.iterator(null); > BytesRef t; > while((t = termsEnum.next()) != null) { > final String keyword = t.utf8ToString(); > System.out.println(keyword.length() + ": " + keyword); > } > } > } > > public static final String getString(int n) { > StringBuilder sb = new StringBuilder(); > for(int i = 0; i < n; i++) { > sb.append('a'); > } > return sb.toString(); > } > } > > > And here is the output: > 300*a: 0 > 255*a: 1 > 45*a: 1 > > TERMS > 45: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa > 255: > aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa > > Regards > Piotr > > > On Fri, Jul 17, 2015 at 4:40 PM, Steve Rowe <sar...@gmail.com> wrote: > >> Hi Piotr, >> >> Thanks for reporting! >> >> See https://issues.apache.org/jira/browse/LUCENE-6682 >> >> Steve >> www.lucidworks.com >> >> >> > On Jul 16, 2015, at 4:47 AM, Piotr Idzikowski < >> piotridzikow...@gmail.com> wrote: >> > >> > Hello. >> > I am developing own analyzer based on StandardAnalyzer. >> > I realized that tokenizer.setMaxTokenLength is called many times. >> > >> > *protected TokenStreamComponents createComponents(final String >> fieldName, >> > final Reader reader) {* >> > * final StandardTokenizer src = new StandardTokenizer(getVersion(), >> > reader);* >> > * src.setMaxTokenLength(maxTokenLength);* >> > * TokenStream tok = new StandardFilter(getVersion(), src);* >> > * tok = new LowerCaseFilter(getVersion(), tok);* >> > * tok = new StopFilter(getVersion(), tok, stopwords);* >> > * return new TokenStreamComponents(src, tok) {* >> > * @Override* >> > * protected void setReader(final Reader reader) throws IOException >> {* >> > * src.setMaxTokenLength(StandardAnalyzer.this.maxTokenLength);* >> > * super.setReader(reader);* >> > * }* >> > * };* >> > * }* >> > >> > Does it make sense if length stays the same? I see it finally calls this >> > one( in StandardTokenizerImpl ): >> > *public final void setBufferSize(int numChars) {* >> > * ZZ_BUFFERSIZE = numChars;* >> > * char[] newZzBuffer = new char[ZZ_BUFFERSIZE];* >> > * System.arraycopy(zzBuffer, 0, newZzBuffer, 0, >> > Math.min(zzBuffer.length, ZZ_BUFFERSIZE));* >> > * zzBuffer = newZzBuffer;* >> > * }* >> > So it just copies old array content into the new one. >> > >> > Regards >> > Piotr Idzikowski >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >