Using a Lucene ShingleFilter to extract frequencies of bigrams in Lucene
If a Lucene ShingleFilter can be used to tokenize a string into shingles, or ngrams, of different sizes, e.g.: "please divide this sentence into shingles" Becomes: shingles "please divide", "divide this", "this sentence", "sentence into", and "into shingles" Does anyone know if this can be used in conjunction with other analyzers to return the frequencies of the bigrams or trigrams found, e.g.: "please divide this please divide sentence into shingles" Would return 2 for "please divide"? I'm currently using Lucene 3.0.2 to extract frequencies of unigrams from a string using a combination of a TermVectorMapper and Standard/Snowball analyzers. I should add that my strings are built up from a database and then indexed by Lucene in memory and are not persisted beyond this. Use of other products like Solr is not intended. Thanks Mr Morgan.
Re: Using a Lucene ShingleFilter to extract frequencies of bigrams in Lucene
On Tue, Sep 4, 2012 at 12:37 PM, Martin O'Shea wrote: > > Does anyone know if this can be used in conjunction with other analyzers to > return the frequencies of the bigrams or trigrams found, e.g.: > > > > "please divide this please divide sentence into shingles" > > > > Would return 2 for "please divide"? > > > > I'm currently using Lucene 3.0.2 to extract frequencies of unigrams from a > string using a combination of a TermVectorMapper and Standard/Snowball > analyzers. > > > > I should add that my strings are built up from a database and then indexed > by Lucene in memory and are not persisted beyond this. Use of other products > like Solr is not intended. > The bigrams etc generated by shingles are terms just like the unigrams. So you can wrap any other analyzer with a ShingleAnalyzerWrapper if you want the shingles. If you just want to use Lucene's analyzers to tokenize the text and compute within-document frequencies for a one-off purpose, I think indexing and creating term vectors could be overkill: you could just consume the tokens from the Analyzer and make a hashmap or whatever you need... There are examples in the org.apache.lucene.analysis package javadocs. -- lucidworks.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org