Adding to this growing thread, there's really no reason to
index all the term bigrams, trigrams, etc. It's not
only slow, it's very memory/disk intensive. All you need
to do is two passes over the collection.
Pass One
Collect counts of bigrams (or trigrams, or whatever -- if
size is an
Nader Akhnoukh wrote:
Yes, Chris is correct, the goal is to determine the most frequently
occuring
phrases in a document compared to the frequency of that phrase in the
index. So there are only output phrases, no inputs.
Also performance is not really an issue, this would take place on an
irre
I may be coming into this thread without knowing enough. I have implemented a
phrase filter, which indexes all token sequences that are 2 to N tokens long.
The n is defined in the constructor.
It takes a stopword Trie for input because the policy I used, based on a publish
work I read, was that a
Yes, Chris is correct, the goal is to determine the most frequently occuring
phrases in a document compared to the frequency of that phrase in the
index. So there are only output phrases, no inputs.
Also performance is not really an issue, this would take place on an
irregular basis and could ru
Chris Hostetter wrote:
I think either you missunderstood Nader's question or I did: I belive the
goal is to determine what the most frequently occuring phrases are -- not
determine how frequently a particular input phrase appears.
Isn't the latter a pre-requisite for the former ? ;)
Regardi
: > I am trying to get the most frequently occurring phrases in a document and
: > in the index as a whole. The goal is compare the two to get something like
: > Amazon's SIPs.
: Other than indexing the phrases directly, you could use a SpanNearQuery
: over the words, use getSpans() on its SpanS
of occurrences of the "phrase" in the index.
Eeach time doc() on the Spans returns a given document number,
one can increase the phrase frequency count within the document.
A Spans always iterates by non decreasing document number.
Btw. that is a search.
Regards,
Paul Elschot
Hi, I've looked through the archives and it looks like this question has
been asked in one form or another a few times, but without a satisfactory
solution.
I am trying to get the most frequently occurring phrases in a document and
in the index as a whole. The goal is compare the two to get some
I searched my question in the mail archive, and found that I really want to
get a phrase frequency, it is an old question which was not solved well.
I traced Lucene source code, and discover that I can get a phrase's IDF from
the Hits object
weight= PhraseQuery$PhraseWeight (id=62
.
If I do, I would be happy to share.
Good luck, and feel free to post anything you think might be helpful if
you implement something.
Sean
Fabio Cristiano dos Anjos wrote:
Hi,
How can I get phrase frequency in an index?
Thanks in advance
Hi,
How can I get phrase frequency in an index?
Thanks in advance!!
--
Atenciosamente,
Fábio Cristiano dos Anjos
How can I get phrase frequency in an index? termDocs/termPositions in
IndexReader work only with words
Thanks
Ravi.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
12 matches
Mail list logo