Hello all, I'm trying to cluster documents that were indexed using Lucene 4.3. The results of the clustering algorithm is a set of clusters where each cluster contains the most similar documents (I only store their docIDs in each cluster). What I want is to get the most frequent words for each cluster. So I query the Lucene index for the set of documents and then I want to get the most frequent words for these documents. But how to do this in Lucene? Especially I want an efficient way because I'm clustering tweets in real-time.
What I was thinking about is to make a RAMDirectory and index each set of documents in this directory and then get the statistics for each term. However this is slow and uses a lot of memory! Thanks in advance! Gucko