Re: Searching Diacritics

anorman Mon, 27 Aug 2007 11:30:40 -0700

I've tried to implement an analyzer with little different then using:

result = new ISOLatin1AccentFilter(result);  in the TokenStream method.


Everything appears to work, however my search will not work for any word
with diacritics with that change.  Without using it will find words such as
"cèdulas" but not "cedulas", with the change it will find neither, it just
appears to be stripping it out altogether.

Any suggestions?




anorman wrote:
> 
> Can I do this at search time rather than index time?  Below is my code
> that is handling the searching, where would I utilize such a filter?
> 
> Thanks for the help!
> 
> 
> 
> 
> package search.lucene.search;
> import org.apache.lucene.document.Document;
> import java.io.IOException;
> import java.util.ArrayList;
> import java.util.List;
> 
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.analysis.ISOLatin1AccentFilter;
> import org.apache.lucene.queryParser.ParseException;
> import org.apache.lucene.queryParser.QueryParser;
> import org.apache.lucene.search.Hits;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.Query;
> 
> import search.lucene.index.IndexManager;
> 
> /**
>  * This class is used to search the 
>  * Lucene index and return search results
>  */
> 
> public class SearchManager {
> 
> 
> private String searchWord;
>     
>     private IndexManager indexManager;
>     
>     private Analyzer analyzer;
>     
>     public SearchManager(String searchWord){
>         this.searchWord   = searchWord;
>         this.indexManager = new IndexManager();
>         this.analyzer = new StandardAnalyzer();
>     }
>     
>     /**
>      * do search
>      */
>     public List search(){
>         List searchResult = new ArrayList();
>               
>         IndexSearcher indexSearcher = null;
> 
>         try{
>             indexSearcher = new IndexSearcher(indexManager.getIndexDir());
>         }catch(IOException ioe){
>             ioe.printStackTrace();
>         }
> 
>         QueryParser queryParser = new QueryParser("content",analyzer);
>         Query query = null;
>         try {
>             query = queryParser.parse(searchWord);
>         } catch (ParseException e) {
>           e.printStackTrace();
>         }
>         
>         if(null != query && null != indexSearcher){                   
>             try {
>                 Hits hits = indexSearcher.search(query);
>                 for(int i = 0; i < hits.length(); i ++){
>                                       
>                                       Document doc = hits.doc(i);
>                                       System.out.println(doc.get("filename"));
>                     
>                                       SearchResultBean resultBean = new 
> SearchResultBean();
> 
>                                        
> resultBean.setXMLId(hits.doc(i).get("id"));
>                                       
> resultBean.setXMLTitle(hits.doc(i).get("title"));
>                                       
> resultBean.setXMLAuthor(hits.doc(i).get("author"));
>                                       
> resultBean.setXMLAbstract(hits.doc(i).get("abstract"));
>                                       resultBean.setScore(hits.score(i));
>                                       
>                                       searchResult.add(resultBean);
>                 }
>             } catch (IOException e) {
>                 e.printStackTrace();
>             }
>         }
>         return searchResult;   
>         
>     }
> 
> 
> 
> 
> 
> 
> thomas arni-2 wrote:
>> 
>> You can extend the DefaultAnalyzer.
>> The only thing you have to do, is to rewrite the method tokenStream like 
>> this:
>> 
>>   /** Constructs a [EMAIL PROTECTED] StandardTokenizer} filtered by a [EMAIL 
>> PROTECTED]
>>   StandardFilter}, a [EMAIL PROTECTED] LowerCaseFilter} and a [EMAIL 
>> PROTECTED] StopFilter}. */
>>   public TokenStream tokenStream(String fieldName, Reader reader) {
>>     TokenStream result = new StandardTokenizer(reader);
>>     result = new StandardFilter(result);
>>     result = new LowerCaseFilter(result);
>>     result = new StopFilter(result, stopSet);
>>     result = new ISOLatin1AccentFilter(result);
>>     return result;
>>   }
>> 
>> 
>> anorman wrote:
>>> This looks like exactly what I want.  Would I implement this along with
>>> another analyzer such as the standard or stand alone?  Does anyone have
>>> any
>>> code examples of implementing such a thing?
>>>
>>> Thanks,
>>> Albert
>>>
>>>
>>>
>>>
>>> karl wettin-3 wrote:
>>>   
>>>> 27 aug 2007 kl. 16.03 skrev anorman:
>>>>
>>>>     
>>>>> I have a searchable index of documents which contain french and  
>>>>> spanish
>>>>> diacritics (è, é, À) etc.  I would like to make the content  
>>>>> searchable so
>>>>> that when a user searches for a word such as "Amèrique" or "Amerique"
>>>>> (without diacritic) then it returns the same results.
>>>>>
>>>>> Has anyone set up something similar?
>>>>>       
>>>> ISOLatin1AccentFilter
>>>>
>>>> -- 
>>>> karl
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>>
>>>>
>>>>
>>>>     
>>>
>>>   
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>> 
>> 
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Searching-Diacritics-tf4335454.html#a12354962
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Searching Diacritics

Reply via email to