Re: Suggestive Search

王巍巍 Thu, 09 Apr 2009 05:00:16 -0700

In my project, i stored the user input keyword in the database, as a result,
I build a index from the database and use it to do suggestive search. The
code example is googled and I changed the analyzer and query function. I
attach the code but you have to modify the code to make it run.


For chinese words, if you build your dictionary index from the index you
build for your web search or ftp search, make sure that your analyzer can
correctly segmented, otherwise the dictionary index will not be able to work
correctly.

2009/4/9 王巍巍 <ww.wang...@gmail.com>

> I test the lucene spellchecker and it doesn't support chinese spell
> checker, how can i achieve this goal as google does?
>
> 2009/4/9 Karl Wettin <karl.wet...@gmail.com>
>
> If you use prefix grams only then you'll get a forward-only suggestion
>> scheme. I've seen several implementation that use that and it works quite
>> well.
>>
>> harry potter: ^ha, ^har, ^harr, ^harry, ^harry p, ^harry po..
>> harry houdini: ^ha, ^har, ^harr, ^harry, ^harry h, ^harry ho..
>>
>> I prefere the trie-pattern though. Just rememberd there is an old one in
>> LUCENE-625.
>>
>>     karl
>>
>> 8 apr 2009 kl. 20.50 skrev Matt Schraeder:
>>
>>
>>  Corerct me if I'm wrong, but I don't think n-grams is really what I'm
>>> looking for here.  I'm not looking for a spellchecker or phrase checker
>>> style suggestive search, but only based on the exact phrases the user is
>>> currently typing.  Since Lucene uses term-based searching, I'm not sure
>>> how to have it search on portions of a full phrase.  Using a standard
>>> lucene search typing in "harr" will result in searching for "harr" as a
>>> term, which will not find "Harry Potter".  Using ngrams it would find
>>> "Harry" as a term, but not at the beginning of an entire phrase.  This
>>> would bring back "My Dog Harry" as a result, which isn't what I'm
>>> looking for. I just want phrases from fields beginning with "Harr"
>>> only.
>>>
>>> I could easily do this all with our database server by simply doing a
>>> query for "where searchqueries like 'harr%'" but we're trying to limit
>>> our hits to the database to keep speed up on the site.
>>>
>>>  karl.wet...@gmail.com 4/8/2009 12:49:45 PM >>>
>>>>>>
>>>>>
>>> For this you probably want to use ngrams. Wether or not this is
>>> something that fits in your current index is hard to say. My guess is
>>>
>>> that you want to create a new index with one document per unique
>>> phrase. You might also want to try to load this index in an
>>> InstantiatedIndex, that could speed things up quite a bit if the
>>> corpus is not too large.
>>>
>>> If your suggestion text corpus is really large and you only want
>>> forward-only suggestions then you might want to consider a trie-
>>> pattern solution instead. These can be rather resource efficient, even
>>>
>>> when loaded to memory.
>>>
>>> If you have a lot of user load on your search eninge then it might be
>>>
>>> interesting to use old user queries as the base of your suggestions
>>> and perhaps boost a bit on trends, i.e. the more people search for
>>> something the more it get boosted in the suggestions list.
>>>
>>>
>>>     karl
>>>
>>> 8 apr 2009 kl. 15.26 skrev Matt Schraeder:
>>>
>>>  I want to add a suggestive search similar to google's to
>>>>
>>> autocomplete
>>>
>>>> search phrases as the user types.  It doesn't have to be very
>>>> elaborate
>>>> and for the most part will just involve searching single fields.
>>>>
>>> How
>>>
>>>> can I perform a search  to be able to fill in autocomplete text?
>>>>
>>>> For instance, if I start typing "Harr" it should bring up "Harry
>>>> Potter" "Harry Houdini" and "Harry S. Truman"
>>>>
>>>> I have tried doing search queries for "Harr*" but it's still doing
>>>> term-based searching rather than searching a full field.  To make a
>>>> field both searchable as the full field as well as tokenized, would
>>>>
>>> I
>>>
>>>> have to duplicate the field and make one a keyword field? Is there a
>>>> more convenient way to do this? I have also considered making a
>>>>
>>> second
>>>
>>>> index for suggestive search, which would only have the fields that I
>>>> want to enable suggestive search on, but this seems like it would be
>>>> unneccesary duplication of data as well, though it would probably
>>>>
>>> make
>>>
>>>> suggestive search faster due to a smaller index.
>>>>
>>>> Ideally it would also be nice to be able to rank these terms based
>>>>
>>> on
>>>
>>>> the number of times they have been searched for so that the results
>>>>
>>>
>>>  are
>>>> tailored more to our users rather than simply just the score that
>>>> Lucene
>>>> chooses.
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
>
> --
> 王巍巍(Weiwei Wang)
> Department of Computer Science
> Gulou Campus of Nanjing University
> Nanjing, P.R.China, 210093
>
> Mobile: 86-13913310569
> MSN: ww.wang...@gmail.com
> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>



-- 
王巍巍(Weiwei Wang)
Department of Computer Science
Gulou Campus of Nanjing University
Nanjing, P.R.China, 210093

Mobile: 86-13913310569
MSN: ww.wang...@gmail.com
Homepage: http://cs.nju.edu.cn/rl/weiweiwang

package spellchecker;

import java.io.IOException;
import java.io.Reader;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import log.CrawlerLogger;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.ISOLatin1AccentFilter;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.ngram.EdgeNGramTokenFilter;
import org.apache.lucene.analysis.ngram.EdgeNGramTokenFilter.Side;
import org.apache.lucene.analysis.snowball.SnowballFilter;
import org.apache.lucene.analysis.standard.StandardFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.Sort;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

import database.Criteria;
import database.QueryStatistics;
import database.QueryStatisticsPeer;
import database.Scroller;

/**
 * Search term auto-completer, works for single terms (so use on the last term
 * of the query).
 * <p>
 * Returns more popular terms first.
 * 
 * @author Mat Mannion, m.mann...@warwick.ac.uk
 */
public final class AutoCompleter
{

    private static final String GRAMMED_WORDS_FIELD = "words";

    private static final String SOURCE_WORD_FIELD = "sourceWord";

    private static final String COUNT_FIELD = "count";

    private final Directory autoCompleteDirectory;

    private IndexReader autoCompleteReader;

    private IndexSearcher autoCompleteSearcher;

    public AutoCompleter(String autoCompleteDir) throws Exception
    {
	this.autoCompleteDirectory = FSDirectory.getDirectory(autoCompleteDir);

	reOpenReader();
    }

    public List<String> suggestTermsFor(String term) throws IOException, ParseException
    {
	QueryParser qp = new QueryParser(GRAMMED_WORDS_FIELD,
		    getAnalyzer());
	    qp.setDefaultOperator(QueryParser.OR_OPERATOR);
	    qp.setAllowLeadingWildcard(true);
	Sort sort = new Sort(COUNT_FIELD, true);

	TopDocs docs = autoCompleteSearcher.search(qp.parse(term), null, 10, sort);
	List<String> suggestions = new ArrayList<String>();
	for (ScoreDoc doc : docs.scoreDocs)
	{
	    suggestions.add(autoCompleteReader.document(doc.doc).get(SOURCE_WORD_FIELD));
	}

	return suggestions;
    }

    public Analyzer getAnalyzer()
    {
	return new Analyzer()
	{
	    public TokenStream tokenStream(String fieldName, Reader reader)
	    {
		TokenStream result = new StandardTokenizer(reader);
		result = new StandardFilter(result);
		result = new LowerCaseFilter(result);
		result = new ISOLatin1AccentFilter(result);
		result = new StopFilter(result, StopAnalyzer.ENGLISH_STOP_WORDS);
		result = new SnowballFilter(result, "English");
		result = new EdgeNGramTokenFilter(result, Side.FRONT, 1, 20);
		return result;
	    }
	};
    }

    public void reIndex() throws Exception
    {
	// build a dictionary (from the spell package)
	Criteria c = new Criteria();
	Scroller<QueryStatistics> scro = QueryStatisticsPeer.doSelect(c);

	// use a custom analyzer so we can do EdgeNGramFiltering
	IndexWriter writer = new IndexWriter(autoCompleteDirectory, getAnalyzer(), true,
		IndexWriter.MaxFieldLength.UNLIMITED);

	writer.setMergeFactor(300);
	writer.setMaxBufferedDocs(150);

	// go through every word, storing the original word (incl. n-grams)
	// and the number of times it occurs

	Map<String, Integer> wordsMap = new HashMap<String, Integer>();

	while (scro.hasNext())
	{
	    QueryStatistics qs = scro.next();
	    String word = qs.getKeyword();
	    int len = word.length();
	    if (len < 2)
	    {
		continue;
	    }

	    if (wordsMap.containsKey(word))
		wordsMap.put(word, wordsMap.get(word) + 1);
	    else
		wordsMap.put(word, 1);
	}
	for (String word : wordsMap.keySet())
	{
	    // ok index the word
	    Document doc = new Document();
	    doc.add(new Field(SOURCE_WORD_FIELD, word, Field.Store.YES, Field.Index.NOT_ANALYZED)); // orig
	    // term
	    doc.add(new Field(GRAMMED_WORDS_FIELD, word, Field.Store.YES, Field.Index.ANALYZED)); // grammed
	    doc.add(new Field(COUNT_FIELD, Integer.toString(wordsMap.get(word)), Field.Store.YES,
		    Field.Index.NOT_ANALYZED)); // count

	    writer.addDocument(doc);
	}

	// close writer
	writer.optimize();
	writer.close();

	// re-open our reader
	reOpenReader();
    }

    private void reOpenReader() throws CorruptIndexException, Exception
    {
	if (autoCompleteReader == null)
	{
	    if (!IndexReader.indexExists(autoCompleteDirectory))
	    {
		CrawlerLogger.logger.info("Dictionary index not exists, index now");
		reIndex();
	    }
	    autoCompleteReader = IndexReader.open(autoCompleteDirectory);
	}
	else
	{
	    autoCompleteReader.reopen();
	}

	autoCompleteSearcher = new IndexSearcher(autoCompleteReader);
    }

    public static void main(String[] args) throws Exception
    {
	AutoCompleter autocomplete = new AutoCompleter("wrapper/index/dictionary");
	String term = "çµè§";
	System.out.println(autocomplete.suggestTermsFor(term));
    }
}

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Suggestive Search

Reply via email to