Did you really mean to shingle twice (shingleanalyzerwrapper just wraps the analyzer with a shinglefilter, then the code wraps that with another shinglefilter again) ?
On Wed, Apr 2, 2014 at 1:42 PM, Natalia Connolly <natalia.v.conno...@gmail.com> wrote: > Hello, > > I am very confused about what ShingleFilter seems to be doing in Lucene > 4.6. What I would like to do is extract all possible bigrams from a > sentence. So if the sentence is "This is a dog", I want "This is", "is a > ", "a dog". > > Here is my code: > > StringTokenizer itr = new StringTokenizer(theText,"\n"); > Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46); > ShingleAnalyzerWrapper shingleAnalyzer = new > ShingleAnalyzerWrapper(analyzer,2,2); > > while (itr.hasMoreTokens()) { > > String theSentence = itr.nextToken(); > StringReader reader = new StringReader(theSentence); > TokenStream tokenStream = shingleAnalyzer.tokenStream("content", > reader); > ShingleFilter theFilter = new ShingleFilter(tokenStream); > theFilter.setOutputUnigrams(false); > > CharTermAttribute charTermAttribute = > theFilter.addAttribute(CharTermAttribute.class); > > theFilter.reset(); > > while (theFilter.incrementToken()) { > > System.out.println(charTermAttribute.toString()); > > } > > theFilter.end(); > theFilter.close(); > } > > > What I see in the output is this: suppose the sentence is "resting > comfortably and in no distress". I get the following output: > > resting resting comfortably > resting comfortably comfortably > comfortably comfortably _ > comfortably _ _ distress > _ distress distress > > So it looks like not only do I not get bigrams, I get spurious 3-grams > by repeating words. Could someone please help? > > Thanks much, > > Natalia Connolly --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org