21 aug 2007 kl. 13.10 skrev Jens Grivolla:

On 8/21/07, Heba Farouk <[EMAIL PROTECTED]> wrote:
the documents are not duplicated, i mean the hits (assume that 2 documents have the same subject but with different authors, so if i'm searching the subject, the returned hits will have duplicates )
i was asking if i can remove duplicates from the hits??

You may not want to work with documents at all (where you have the
duplicates), but rather with the terms in your index directly.  Take a
look at WildcardTermEnum etc.

My favorite solution for this is a stand alone trie, and such a solution is available in LUCENE-625.

Another way is to create an ngram-index.

It is usually a good idea to create an "a priori" corpus with a limited set of data. I prefere common user queries rather than items in the index. Especially if your corpus is large and you have a lot of server load.

Try LUCENE-550 as a priori index. My guess is that it would outperform a RAMDirectory 20x at 25,000 title-sized (40 chars avg) documents.


--
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to