Hi, I've looked through the archives and it looks like this question has been asked in one form or another a few times, but without a satisfactory solution.
I am trying to get the most frequently occurring phrases in a document and in the index as a whole. The goal is compare the two to get something like Amazon's SIPs. This is straightforward for individual words. Get the term frequency of each term in a doc and compare it to the frequency of that term in the index. A high ratio indicates that the term appears in this doc much more than the other docs on average. Does anyone have an idea of how to do this with phrases of say 1 to 3 words? Just to be clear, in this case I am only using Lucene for it's built in frequency analysis, I'm not actually using it to search for anything that is indexed. Thanks, NSA