Re: how to control terms to be highlighted?

mark harwood Fri, 02 Dec 2005 04:34:40 -0800

Hi Harini, 
I updated QueryTermsExtractor in Subversion last night
to support your requirement.


The JUnit test is also updated with a field-specific
example.

Cheers,
Mark


--- Harini Raghavan <[EMAIL PROTECTED]>
wrote:

> Hi Chris,
> 
> Can we pass a different query object for searching
> and a different one 
> to the highlighter? I am not sure of that.
> In any case,  based on Mark's suggestion I modified
> the 
> QueryTermsExtractor class and filtered the query 
> terms by the fieldName.
> Attached is the modified file.
> 
> Thanks,
> Harini
> 
> 
> 
> Chris Hostetter wrote:
> 
> >I don't know what your application is, and I have
> no experience with the
> >Highlighter code, so forgive me if this is a silly
> suggestion:
> >
> >It looks like you are building a query up
> programaticaly, which
> >contains some words to search on, and some other
> stuff that's mainly
> >being used to "filter" the results (i'll avoid my
> usual rant about
> >people underutilizing Filters).  So why not pass
> the Higherlighter just
> >the portion of the Query that you acctaully want to
> contribute to the
> >highlighting?  In this query...
> >
> >: >> +DocumentType:news
> >: >> +(CompanyId:10 CompanyId:20 CompanyId:30
> CompanyId:40)
> >: >> +FilingDate:[20041201 TO 20051201]
> >: >> +(Content:"cost saving" Content:"cost savings"
> >: >>Content:outsource
> >: >>Content:outsources Content:downsize
> >: >>Content:downsizes
> >: >>Content:restructuring Content:restructure)
> >
> >...just give the highlighter...
> >
> >    (Content:"cost saving" Content:"cost savings"
> >     Content:outsource
> >     Content:outsources Content:downsize
> >     Content:downsizes
> >     Content:restructuring Content:restructure)
> >
> >
> >: Date: Thu, 01 Dec 2005 10:38:41 +0530
> >: From: Harini Raghavan
> <[EMAIL PROTECTED]>
> >: Reply-To: java-user@lucene.apache.org
> >: To: java-user@lucene.apache.org
> >: Subject: Re: how to control terms to be
> highlighted?
> >:
> >: Hi Mark,
> >:
> >: It would be great if you can make this change and
> send the
> >: QueryTermsExtractor class. I am invoking the
> QueryScorer(Query)
> >: contructor. Should I use QueryScorer(Query query,
> IndexReader reader,
> >: String fieldName) instead for this to work?
> >:
> >: Thanks,
> >: Harini
> >:
> >: mark harwood wrote:
> >:
> >: >>>>Is there anyway to restrict the highlighter
> to
> >: >>>>
> >: >>>>
> >: >>highlight only the values
> >: >>mentioned for the field 'Content'?
> >: >>
> >: >>
> >: >
> >: >The problem lies in the QueryTermsExtractor
> class
> >: >which is typically used to provide the
> Highlighter
> >: >with the list of strings to identify in the
> text. It
> >: >currently has no filter for fieldname - you
> could add
> >: >this without too much effort.
> >: >
> >: >I could make this modification but it may change
> the
> >: >behaviour of existing applications - currently
> the
> >: >QueryTermsExtractor method that takes a
> fieldname only
> >: >uses that fieldname to derive IDF weightings,
> the
> >: >proposed change would also have the effect of
> >: >filtering out any query terms that weren't for
> this
> >: >field.
> >: >Would this change be a problem for anyone?
> >: >
> >: >Cheers,
> >: >Mark
> >: >
> >: >--- Harini Raghavan
> <[EMAIL PROTECTED]>
> >: >wrote:
> >: >
> >: >
> >: >
> >: >>Hi,
> >: >>
> >: >>I have a requirement to highlight search
> keywords in
> >: >>the results and
> >: >>display the matching fragment of the text with
> the
> >: >>results. I am using
> >: >>the Hits highlighting mentioned in Lucene in
> Action.
> >: >>
> >: >>Here is the search query(BooleanQuery) I am
> passing
> >: >>to the IndexSearcher
> >: >>and QueryScorer:
> >: >> +DocumentType:news
> >: >> +(CompanyId:10 CompanyId:20 CompanyId:30
> >: >>CompanyId:40)
> >: >> +FilingDate:[20041201 TO 20051201]
> >: >> +(Content:"cost saving" Content:"cost savings"
> >: >>Content:outsource
> >: >>Content:outsources Content:downsize
> >: >>Content:downsizes
> >: >>Content:restructuring Content:restructure)
> >: >>
> >: >>My requirement is to highlight only the
> keywords for
> >: >>'Content' field,
> >: >>but the highlighter api is also highlighting
> words
> >: >>like 'news', '10',
> >: >>'40' etc.
> >: >>Is there anyway to restrict the highlighter to
> >: >>highlight only the values
> >: >>mentioned for the field 'Content'?
> >: >>
> >: >>Thanks,
> >: >>Harini
> >: >>
> >: >>
> >: >>
> >: >>
> >: >>
> >: >>
> >: >>
> >: >>
> >:
>
>---------------------------------------------------------------------
> >: >
> >: >
> >: >>To unsubscribe, e-mail:
> >: >>[EMAIL PROTECTED]
> >: >>For additional commands, e-mail:
> >: >>[EMAIL PROTECTED]
> >: >>
> >: >>
> >: >>
> >: >>
> >: >
> >: >
> >: >
> >: >
> >:
>
>___________________________________________________________
> >: >Yahoo! Model Search 2005 - Find the next catwalk
> superstars -
> http://uk.news.yahoo.com/hot/model-search/
> >: >
> >:
>
>---------------------------------------------------------------------
> >: >To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> >: >For additional commands, e-mail:
> [EMAIL PROTECTED]
> >: >
> >: >
> >: >
> >: >
> >:
> >:
> >:
>
---------------------------------------------------------------------
> 
=== message truncated ===> package
org.apache.lucene.search.highlight;
> /**
>  * Copyright 2002-2004 The Apache Software
> Foundation
>  *
>  * Licensed under the Apache License, Version 2.0
> (the "License");
>  * you may not use this file except in compliance
> with the License.
>  * You may obtain a copy of the License at
>  *
>  *     http://www.apache.org/licenses/LICENSE-2.0
>  *
>  * Unless required by applicable law or agreed to in
> writing, software
>  * distributed under the License is distributed on
> an "AS IS" BASIS,
>  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
> either express or implied.
>  * See the License for the specific language
> governing permissions and
>  * limitations under the License.
>  */
> 
> import java.io.IOException;
> import java.util.Collection;
> import java.util.HashSet;
> import java.util.Iterator;
> 
> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.search.BooleanClause;
> import org.apache.lucene.search.BooleanQuery;
> import org.apache.lucene.search.PhraseQuery;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.TermQuery;
> import org.apache.lucene.search.spans.SpanNearQuery;
> 
> /**
>  * Utility class used to extract the terms used in a
> query, plus any weights.
>  * This class will not find terms for
> MultiTermQuery, RangeQuery and PrefixQuery classes
>  * so the caller must pass a rewritten query (see
> Query.rewrite) to obtain a list of
>  * expanded terms.
>  *
>  */
> public final class QueryTermExtractor
> {
> 
>       /**
>        * Extracts all terms texts of a given Query into
> an array of WeightedTerms
>        *
>        * @param query      Query to extract term texts
> from
>        * @return an array of the terms used in a query,
> plus their weights.
>        */
>       public static final WeightedTerm[] getTerms(Query
> query)
>       {
>               return getTerms(query,false,"");
>       }
> 
>       /**
>        * Extracts all terms texts of a given Query into
> an array of WeightedTerms
>        *
>        * @param query      Query to extract term texts
> from
>        * @param reader used to compute IDF which can be
> used to a) score selected fragments better
>        * b) use graded highlights eg chaning intensity of
> font color
>        * @param fieldName the field on which Inverse
> Document Frequency (IDF) calculations are based
>        * @return an array of the terms used in a query,
> plus their weights.
>        */
>       public static final WeightedTerm[]
> getIdfWeightedTerms(Query query, IndexReader reader,
> String fieldName)
>       {
>           WeightedTerm[]
> terms=getTerms(query,false,fieldName);
>           int totalNumDocs=reader.numDocs();
>           for (int i = 0; i < terms.length; i++)
>         {
>               try
>             {
>                 int docFreq=reader.docFreq(new
> Term(fieldName,terms[i].term));
>                 //IDF algorithm taken from
> DefaultSimilarity class
>                 float
>
idf=(float)(Math.log((float)totalNumDocs/(double)(docFreq+1))
> + 1.0);
>                 terms[i].weight*=idf;
>             }
>               catch (IOException e)
>             {
>                   //ignore
>             }
>         }
>               return terms;
>       }
> 
>       /**
>        * Extracts all terms texts of a given Query into
> an array of WeightedTerms
>        *
>        * @param query      Query to extract term texts
> from
>        * @param prohibited <code>true</code> to extract
> "prohibited" terms, too
>    * @return an array of the terms used in a query,
> plus their weights.
>    */
>       public static final WeightedTerm[] getTerms(Query
> query, boolean prohibited, String fieldName)
>       {
>               HashSet terms=new HashSet();
>               getTerms(query,terms,prohibited,fieldName);
>               return (WeightedTerm[]) terms.toArray(new
> WeightedTerm[0]);
>       }
> 
>       private static final void getTerms(Query query,
> HashSet terms,boolean prohibited, String fieldName)
>       {
>               if (query instanceof BooleanQuery)
>                       getTermsFromBooleanQuery((BooleanQuery) query,
> terms, prohibited, fieldName);
>               else
>                       if (query instanceof PhraseQuery)
>                               getTermsFromPhraseQuery((PhraseQuery) query,
> terms, fieldName);
>                       else
>                               if (query instanceof TermQuery)
>                                       getTermsFromTermQuery((TermQuery) 
> query, terms,
> fieldName);
>                               else
>                       if(query instanceof SpanNearQuery)
>                          
> getTermsFromSpanNearQuery((SpanNearQuery) query,
> terms, fieldName);
>       }
> 
>       private static final void
> getTermsFromBooleanQuery(BooleanQuery query, HashSet
> terms, boolean prohibited, String fieldName)
>       {
>               BooleanClause[] queryClauses = query.getClauses();
>               int i;
> 
>               for (i = 0; i < queryClauses.length; i++)
>               {
>                       if (prohibited || !queryClauses[i].prohibited)
>                               getTerms(queryClauses[i].query, terms,
> prohibited, fieldName);
>               }
>       }
> 
>       private static final void
> getTermsFromPhraseQuery(PhraseQuery query, HashSet
> terms, String fieldName)
>       {
>               Term[] queryTerms = query.getTerms();
>               int i;
>               String field;
> 
>               for (i = 0; i < queryTerms.length; i++)
>               {
>                       if(fieldName.equals(""))
>                               terms.add(new
>
WeightedTerm(query.getBoost(),queryTerms[i].text()));
>                       else {
>                               field = queryTerms[i].field();
>                               if(field.equals(fieldName))
>                                       terms.add(new
>
WeightedTerm(query.getBoost(),queryTerms[i].text()));
>                       }
>               }
>       }
> 
>       private static final void
> getTermsFromTermQuery(TermQuery query, HashSet
> terms, String fieldName)
>       {
>               String field = query.getTerm().field();
>               if(fieldName.equals(""))
>                       terms.add(new
>
WeightedTerm(query.getBoost(),query.getTerm().text()));
>               else if(field.equals(fieldName)) {
>                       terms.add(new
>
WeightedTerm(query.getBoost(),query.getTerm().text()));
>               }
>       }
> 
> 
=== message truncated ===>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
[EMAIL PROTECTED]



                
___________________________________________________________ 
WIN ONE OF THREE YAHOO! VESPAS - Enter now! - 
http://uk.cars.yahoo.com/features/competitions/vespa.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: how to control terms to be highlighted?

Reply via email to