Re: Using a Lucene ShingleFilter to extract frequencies of bigrams in Lucene

Robert Muir Tue, 04 Sep 2012 17:54:07 -0700

On Tue, Sep 4, 2012 at 12:37 PM, Martin O'Shea <[email protected]> wrote:
>
> Does anyone know if this can be used in conjunction with other analyzers to
> return the frequencies of the bigrams or trigrams found, e.g.:
>
>
>
>     "please divide this please divide sentence into shingles"
>
>
>
> Would return 2 for "please divide"?
>
>
>
> I'm currently using Lucene 3.0.2 to extract frequencies of unigrams from a
> string using a combination of a TermVectorMapper and Standard/Snowball
> analyzers.
>
>
>
> I should add that my strings are built up from a database and then indexed
> by Lucene in memory and are not persisted beyond this. Use of other products
> like Solr is not intended.
>


The bigrams etc generated by shingles are terms just like the
unigrams. So you can wrap any other analyzer with a
ShingleAnalyzerWrapper if you want the shingles.

If you just want to use Lucene's analyzers to tokenize the text and
compute within-document frequencies for a one-off purpose, I think
indexing and creating term vectors could be overkill: you could just
consume the tokens from the Analyzer and make a hashmap or whatever
you need...

There are examples in the org.apache.lucene.analysis package javadocs.

-- 
lucidworks.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Using a Lucene ShingleFilter to extract frequencies of bigrams in Lucene

Reply via email to