This actually rings a bell for me... have a look at Lucene's JIRA, I think this was reported as a bug once and perhaps has been fixed.
Note that Lucene in Action 2 has a case study that talks about searching source code. You may find that study interesting. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch ----- Original Message ---- > From: Stefan Trcek <wzzelfz...@abas.de> > To: java-user@lucene.apache.org > Sent: Mon, December 14, 2009 9:39:34 AM > Subject: NGramTokenizer stops working after about 1000 terms > > Hello > > For a source code (git repo) search engine I choose to use an ngram > analyzer for substring search (something like "git blame"). > > This worked fine except it didn't find some strings. I tracked it down > to the analyzer. When the ngram analyzer yielded about 1000 terms it > stopped yielding more terms, seem to be at most (1024 - ngram_length) > terms. When I use StandardAnalyzer it works as expected. > Is this a bug or did I miss a limit? > > Tested with lucene-2.9.1 and 3.0, this is the core routine I use: > > public static class NGramAnalyzer5 extends Analyzer { > public TokenStream tokenStream(String fieldName, Reader reader) { > return new NGramTokenizer(reader, 5, 5); > } > } > > public static String[] analyzeString(Analyzer analyzer, > String fieldName, String string) throws IOException { > Listoutput = new ArrayList(); > TokenStream tokenStream = analyzer.tokenStream(fieldName, > new StringReader(string)); > TermAttribute termAtt = (TermAttribute)tokenStream.addAttribute( > TermAttribute.class); > tokenStream.reset(); > while (tokenStream.incrementToken()) { > output.add(termAtt.term()); > } > tokenStream.end(); > tokenStream.close(); > return output.toArray(new String[0]); > } > > The complete example is attached. "in.txt" must be in "." and is plain > ASCII. > > Stefan > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org