How to get the most frequent words for a set of documents in Lucene?

Gucko Gucko Sun, 09 Jun 2013 02:16:49 -0700

Hello all,

I'm trying to cluster documents that were indexed using Lucene 4.3. The
results of the clustering algorithm is a set of clusters where each cluster
contains the most similar documents (I only store their docIDs in each
cluster). What I want is to get the most frequent words for each cluster.
So I query the Lucene index for the set of documents and then I want to get
the most frequent words for these documents. But how to do this in Lucene?
Especially I want an efficient way because I'm clustering tweets in
real-time.


What I was thinking about is to make a RAMDirectory and index each set of
documents in this directory and then get the statistics for each term.
However this is slow and uses a lot of memory!


Thanks in advance!


Gucko

How to get the most frequent words for a set of documents in Lucene?

Reply via email to