Yes I was refering to how IDF is used in the Highlighter code to find
out how to prioritize fragments of the documents.
My requirement is to show the relevant fragments of the news article for
each company along with the search results. But the highlighter api
sometimes picks up the fragments which are not so relevant to the news
article/company. I would like to know if there is anyway that I can
modify the scoring/ranking of these fragments in such a way that the
news items in which a company name & keywords in the headline gets
assigned a very strong relevancy ranking, closely followed by a company
name mention in the first paragraph and a multiple-mention within the
entire story. Something like headline = 5 points, first paragraph =
four, etc.
Thanks,
Harini
markharw00d wrote:
Sorry to contradict, Erik, but the Highlighter's QueryScorer will make
use of IDF, given a reader, in order to better prioritise which are
the "best" bits of a document.
However, In the particular example given, the criteria includes
several non-text fields which are not useful for IDF and general
scoring purposes - these are perhaps better expressed using a filter
of some form. Otherwise, why should the scarcity of a particular date
in the given range boost one matching document above others? These
numeric-type fields are simply mandatory boolean "hygiene factors"
and should ideally play no part in highlight selection or results
ordering in general based on their IDF or TF.
Cheers,
Mark
Erik Hatcher wrote:
Harini,
I'm not sure I understand what you're asking. IDF doesn't factor
into highlighting.
IDF calculations are useful in scoring documents during a search,
such that the most relevant documents are returned, but again this
is unrelated to highlighting.
Could you elaborate on what you're after?
Erik
On Dec 30, 2005, at 12:02 PM, Harini Raghavan wrote:
Hi,
I have a requirement to highlight search keywords in the results and
display the matching fragment of the text with the results. I am using
the Hits highlighting mentioned in Lucene in Action.
Here is the search query(BooleanQuery) I am passing to the
IndexSearcher
and QueryScorer:
+DocumentType:news
+(CompanyId:10 CompanyId:20 CompanyId:30 CompanyId:40)
+FilingDate:[20041201 TO 20051201]
+(Content:"cost saving" Content:"cost savings" Content:outsource
Content:outsources Content:downsize Content:downsizes
Content:restructuring Content:restructure)
I do not quite understand how the query scoring actually works &
how Inverse Document Frequency(IDF) calculations are useful? Can
someone shed some light on this using the given query as an example?
Thanks,
Harini
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
___________________________________________________________ NEW Yahoo!
Cars - sell your car and browse thousands of new and used cars online!
http://uk.cars.yahoo.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]