[
https://issues.apache.org/jira/browse/LUCENE-8123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir resolved LUCENE-8123.
---------------------------------
Resolution: Invalid
Please use the mailing list for questions.
> Question about how to retrieve by TFIDFSimilarity query on lucene
> -----------------------------------------------------------------
>
> Key: LUCENE-8123
> URL: https://issues.apache.org/jira/browse/LUCENE-8123
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/query/scoring
> Affects Versions: 7.2
> Reporter: Wenhai
> Priority: Minor
>
> Hi, all.
> Recently, we were performing experiment on Lucene based on TFIDF.
> We want to get the similar documents from the corpus, of which the
> similarity between each document (d) and the given query (q) is no less than
> a threshold. We use the following scoring function.
> sum(tf(t,d) * idf(t) * tf(t,q) * idf(t))/(norm(d) * norm(q))
> where norm is defined as sqrt( sum(tf(t,d) * idf(t) * tf(t,d) * idf(t)) ).
> We perform this query by scanning the related docIds of all terms in the
> query, and the related docIds are derived from function PostingsEnum docEnum
> = MultiFields.getTermDocsEnum(indexReader, "text", term.bytes()) . After the
> inner products of these related documents have been computed, the final
> similarities are computed by dividing these inner products by their norms.
> However, when the documents scale up, e.g., more than ten million titles
> of twitter's text filed each on average has 10 terms, the runtime is
> unacceptable (more than ten seconds) since we always need to merge 0.5~2
> million documents to generate the inner products. Does Lucene provide more
> efficient interface to generate ranked results based on TFIDF, or directly
> filter out the dissimilar documents (in lucene core) for a given threshold in
> the range of (0, 1)?
> Best
> Wenhai
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]